What's the difference between --general-numeric-sort and --numeric-sort options in gnu sort
UnixSortingUnix Problem Overview
sort
provides two kinds of numeric sort. This is from the man page:
-g, --general-numeric-sort
compare according to general numerical value
-n, --numeric-sort
compare according to string numerical value
What's the difference?
Unix Solutions
Solution 1 - Unix
General numeric sort compares the numbers as floats, this allows scientific notation eg 1.234E10 but is slower and subject to rounding error (1.2345678 could come after 1.2345679), numeric sort is just a regular alphabetic sort that knows 10 comes after 9.
See http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
> ‘-g’ ‘--general-numeric-sort’ > ‘--sort=general-numeric’ Sort > numerically, using the standard C > function strtod to convert a prefix of > each line to a double-precision > floating point number. This allows > floating point numbers to be specified > in scientific notation, like 1.0e-34 > and 10e100. The LC_NUMERIC locale > determines the decimal-point > character. Do not report overflow, > underflow, or conversion errors. Use > the following collating sequence: > Lines that do not start with numbers > (all considered to be equal). NaNs > (“Not a Number” values, in IEEE > floating point arithmetic) in a > consistent but machine-dependent > order. Minus infinity. Finite > numbers in ascending numeric order > (with -0 and +0 equal). Plus > infinity. > > Use this option only if there is no > alternative; it is much slower than > --numeric-sort (-n) and it can lose information when converting to > floating point.
> ‘-n’ ‘--numeric-sort’ ‘--sort=numeric’ > Sort numerically. The number begins > each line and consists of optional > blanks, an optional ‘-’ sign, and zero > or more digits possibly separated by > thousands separators, optionally > followed by a decimal-point character > and zero or more digits. An empty > number is treated as ‘0’. The > LC_NUMERIC locale specifies the > decimal-point character and thousands > separator. By default a blank is a > space or a tab, but the LC_CTYPE > locale can change this. > > Comparison is exact; there is no > rounding error. > > Neither a leading ‘+’ nor exponential > notation is recognized. To compare > such strings numerically, use the > --general-numeric-sort (-g) option.
Solution 2 - Unix
You should be careful with your locale. For example, you might intend to sort a floating number (like 2.2) whereas your locale might expect the use of a comma (like 2,2).
As reported in this forum, you may have wrong results using the -n or -g flags.
In my case I use:
LC_ALL=C sort -k 6,6n file
in order to sort the 6th column that contains:
2.5
3.7
1.4
in order to obtain
1.4
2.5
3.7
Solution 3 - Unix
In addition to the accepted answer which mention -g
allow scientific notation, I want to shows the part which most likely causes undesirable behavior.
With -g
:
$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -g myfile
baa
--inf
--inf
--inf-
--inf--
--inf-a
--nnf
nnf--
nnn
tnan
zoo
naN
Nana
nani lol
-inf
-inf--
-11
-2
-1
1
+1
2
+2
0xa
11
+11
inf
Look at the zoo
, three important things here:
-
Line starts with
NAN
(e.g.Nana
andnani lol
) or-INF
(single dash, not--INF
) move to end but before digits. WhileINF
move to the last after digits because it means infinity. -
The
NAN
,INF
, and-INF
are case insensitive. -
The lines always ignore whitespace from either side of
NAN
,INF
,-INF
(regardless ofLC_CTYPE
). Other alphabetic may ignore whitespace from either side depends on localeLC_COLLATE
(e.g.LC_COLLATE=fr_FR.UTF-8
ignore butLC_COLLATE=us_EN.UTF-8
not ignore).
So if you are sorting arbitrary alphanumeric then you probably don't want -g
. If you really need scientific notation comparison with -g
, then you probably want to extract alphabet and
numeric data and do comparison separately.
If you only need ordinary number(e.g. 1, -1
) sorting, and feel that 0x/E/+ sorting
not important, just use -n
enough:
$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
-1000
-22
-13
-11
-010
-10
-5
-2
-1
-0.2
-0.12
-0.11
-0.1
0x1
0x11
0xb
+1
+11
+2
-a
-aa
--aa
-aaa
-b
baa
BAA
bbb
+ignore
inf
-inf
--inf
--inf
--inf-
--inf--
-inf--
--inf-a
naN
Nana
nani lol
--nnf
nnf--
nnn
None
uum
Zero cool
-zzz
1
1.1
1.234E10
5
11
Either of -g
or -n
, be aware of locale effect. You may want to specify LC_NUMERIC
as us_EN.UTF-8
to avoid fr_FR.UTF-8 sort -
with floating number failed:
$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=fr_FR.UTF-8 sort -n myfile
-10
-5
-2
-1
-1.1
-1.2
-0.1
-0.11
-0.12
-0.2
-a
+b
middle
-wwe
+zoo
1
1.1
With LC_NUMERIC=en_US.UTF-8
:
$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
-10
-5
-2
-1.2
-1.1
-1
-0.2
-0.12
-0.11
-0.1
-a
+b
middle
-wwe
+zoo
1
1.1
Or LC_NUMERIC=us_EN.UTF-8
to group +|-|space
with alpha
:
$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=us_EN.UTF-8 sort -n myfile
-0.1
a
b
a
b
+b
+zoo
-a
-wwe
middle
1
You probably want to specify locale
when using sort
if want to write portable script.