Sorting stuff
Table of Contents
sort
$ sort --version | head -n1
sort (GNU coreutils) 8.25
$ man sort
SORT(1) User Commands SORT(1)
NAME
sort - sort lines of text files
SYNOPSIS
sort [OPTION]... [FILE]...
sort [OPTION]... --files0-from=F
DESCRIPTION
Write sorted concatenation of all FILE(s) to standard output.
With no FILE, or when FILE is -, read standard input.
...
Note: All examples shown here assumes ASCII encoded input file
Default sort
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
$ sort poem.txt
And so are you.
Roses are red,
Sugar is sweet,
Violets are blue,
- Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so happened that first letter alone was enough to decide the order
- For next example, let's extract all the words and sort them
- also allows to showcase
sort
accepting stdin - See GNU grep chapter if the
grep
command used below looks alien
- also allows to showcase
$ # output might differ depending on locale settings
$ # note the case-insensitiveness of output
$ grep -oi '[a-z]*' poem.txt | sort
And
are
are
are
blue
is
red
Roses
so
Sugar
sweet
Violets
you
- heed hereunto
- See also
$ info sort | tail
(1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to
‘en_US’), then ‘sort’ may produce output that is sorted differently than
you’re accustomed to. In that case, set the ‘LC_ALL’ environment
variable to ‘C’. Note that setting only ‘LC_COLLATE’ has two problems.
First, it is ineffective if ‘LC_ALL’ is also set. Second, it has
undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is
set to an incompatible value. For example, you get undefined behavior
if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’.
- Example to help show effect of locale setting
$ # note how uppercase is sorted before lowercase
$ grep -oi '[a-z]*' poem.txt | LC_ALL=C sort
And
Roses
Sugar
Violets
are
are
are
blue
is
red
so
sweet
you
Reverse sort
- This is simply reversing from default ascending order to descending order
$ sort -r poem.txt
Violets are blue,
Sugar is sweet,
Roses are red,
And so are you.
Various number sorting
$ cat numbers.txt
20
53
3
101
$ sort numbers.txt
101
20
3
53
- Whoops, what happened there?
sort
won't know to treat them as numbers unless specified - Depending on format of numbers, different options have to be used
- First up is
-n
option, which sorts based on numerical value
$ sort -n numbers.txt
3
20
53
101
$ sort -nr numbers.txt
101
53
20
3
- The
-n
option can handle negative numbers - As well as thousands separator and decimal point (depends on locale)
- The
<()
syntax is Process Substitution- to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file
$ # multiple files are merged as single input by default
$ sort -n numbers.txt <(echo '-4')
-4
3
20
53
101
$ sort -n numbers.txt <(echo '1,234')
3
20
53
101
1,234
$ sort -n numbers.txt <(echo '31.24')
3
20
31.24
53
101
- Use
-g
if input contains numbers prefixed by+
or E scientific notation
$ cat generic_numbers.txt
+120
-1.53
3.14e+4
42.1e-2
$ sort -g generic_numbers.txt
-1.53
42.1e-2
+120
3.14e+4
- Commands like
du
have options to display numbers in human readable formats sort
supports sorting such numbers using the-h
option
$ du -sh *
104K power.log
746M projects
316K report.log
20K sample.txt
$ du -sh * | sort -h
20K sample.txt
104K power.log
316K report.log
746M projects
$ # --si uses powers of 1000 instead of 1024
$ du -s --si *
107k power.log
782M projects
324k report.log
21k sample.txt
$ du -s --si * | sort -h
21k sample.txt
107k power.log
324k report.log
782M projects
- Version sort - dealing with numbers mixed with other characters
- If this sorting is needed simply while displaying directory contents, use
ls -v
instead of piping tosort -V
$ cat versions.txt
foo_v1.2
bar_v2.1.3
foobar_v2
foo_v1.2.1
foo_v1.3
$ sort -V versions.txt
bar_v2.1.3
foobar_v2
foo_v1.2
foo_v1.2.1
foo_v1.3
- Another common use case is when there are multiple filenames differentiated by numbers
$ cat files.txt
file0
file10
file3
file4
$ sort -V files.txt
file0
file3
file4
file10
- Can be used when dealing with numbers reported by
time
command as well
$ # different solving durations
$ cat rubik_time.txt
5m35.363s
3m20.058s
4m5.099s
4m1.130s
3m42.833s
4m33.083s
$ # assuming consistent min/sec format
$ sort -V rubik_time.txt
3m20.058s
3m42.833s
4m1.130s
4m5.099s
4m33.083s
5m35.363s
Random sort
- Note that duplicate lines will always end up next to each other
- might be useful as a feature for some cases ;)
- Use
shuf
if this is not desirable
- See also How can I shuffle the lines of a text file on the Unix command line or in a shell script?
$ cat nums.txt
1
10
10
12
23
563
$ # the two 10s will always be next to each other
$ sort -R nums.txt
563
12
1
10
10
23
$ # duplicates can end up anywhere
$ shuf nums.txt
10
23
1
10
563
12
Specifying output file
- The
-o
option can be used to specify output file - Useful for in place editing
$ sort -R nums.txt -o rand_nums.txt
$ cat rand_nums.txt
23
1
10
10
563
12
$ sort -R nums.txt -o nums.txt
$ cat nums.txt
563
23
10
10
1
12
- Use shell script looping if there multiple files to be sorted in place
- Below snippet is for
bash
shell
$ for f in *.txt; do echo sort -V "$f" -o "$f"; done
sort -V files.txt -o files.txt
sort -V rubik_time.txt -o rubik_time.txt
sort -V versions.txt -o versions.txt
$ # remove echo once commands look fine
$ for f in *.txt; do sort -V "$f" -o "$f"; done
Unique sort
- Keep only first copy of lines that are deemed to be same according to
sort
option used
$ cat duplicates.txt
foo
12 carrots
foo
12 apples
5 guavas
$ # only one copy of foo in output
$ sort -u duplicates.txt
12 apples
12 carrots
5 guavas
foo
- According to option used, definition of duplicate will vary
- For example, when
-n
is used, matching numbers are deemed same even if rest of line differs- Pipe the output to
uniq
if this is not desirable
- Pipe the output to
$ # note how first copy of line starting with 12 is retained
$ sort -nu duplicates.txt
foo
5 guavas
12 carrots
$ # use uniq when entire line should be compared to find duplicates
$ sort -n duplicates.txt | uniq
foo
5 guavas
12 apples
12 carrots
- Use
-f
option to ignore case of alphabets while determining duplicates
$ cat words.txt
CAR
are
car
Are
foot
are
$ # only the two 'are' were considered duplicates
$ sort -u words.txt
are
Are
car
CAR
foot
$ # note again that first copy of duplicate is retained
$ sort -fu words.txt
are
CAR
foot
Column based sorting
From info sort
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
Each POS has the form ‘F[.C][OPTS]’, where F is the number of the
field to use, and C is the number of the first character from the
beginning of the field. Fields and character positions are
numbered starting with 1; a character position of zero in POS2
indicates the field’s last character. If ‘.C’ is omitted from
POS1, it defaults to 1 (the beginning of the field); if omitted
from POS2, it defaults to 0 (the end of the field). OPTS are
ordering options, allowing individual keys to be sorted according
to different rules; see below for details. Keys can span multiple
fields.
- By default, blank characters (space and tab) serve as field separators
$ cat fruits.txt
apple 42
guava 6
fig 90
banana 31
$ sort fruits.txt
apple 42
banana 31
fig 90
guava 6
$ # sort based on 2nd column numbers
$ sort -k2,2n fruits.txt
guava 6
banana 31
apple 42
fig 90
- Using a different field separator
- Consider the following sample input file having fields separated by
:
$ # name:pet_name:no_of_pets
$ cat pets.txt
foo:dog:2
xyz:cat:1
baz:parrot:5
abcd:cat:3
joe:dog:1
bar:fox:1
temp_var:squirrel:4
boss:dog:10
- Sorting based on particular column or column to end of line
- In case of multiple entries, by default
sort
would use content of remaining parts of line to resolve
$ # only 2nd column
$ # -k2,4 would mean 2nd column to 4th column
$ sort -t: -k2,2 pets.txt
abcd:cat:3
xyz:cat:1
boss:dog:10
foo:dog:2
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ # from 2nd column to end of line
$ sort -t: -k2 pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
boss:dog:10
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
- Multiple keys can be specified to resolve ties
- Note that if there are still multiple entries with specified keys, remaining parts of lines would be used
$ # default sort for 2nd column, numeric sort on 3rd column to resolve ties
$ sort -t: -k2,2 -k3,3n pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
foo:dog:2
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ # numeric sort on 3rd column, default sort for 2nd column to resolve ties
$ sort -t: -k3,3n -k2,2 pets.txt
xyz:cat:1
joe:dog:1
bar:fox:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
- Use
-s
option to retain original order of lines in case of tie
$ sort -s -t: -k2,2 pets.txt
xyz:cat:1
abcd:cat:3
foo:dog:2
joe:dog:1
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
- The
-u
option, as seen earlier, will retain only first match
$ sort -u -t: -k2,2 pets.txt
xyz:cat:1
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ sort -u -t: -k3,3n pets.txt
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
- Sometimes, the input has to be sorted first and then
-u
used on the sorted output - See also remove duplicates based on the value of another column
$ # sort by number in 3rd column
$ sort -t: -k3,3n pets.txt
bar:fox:1
joe:dog:1
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
$ # then get unique entry based on 2nd column
$ sort -t: -k3,3n pets.txt | sort -t: -u -k2,2
xyz:cat:1
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
- Specifying particular characters within fields
- If character position is not specified, defaults to
1
for starting column and0
(last character) for ending column
$ cat marks.txt
fork,ap_12,54
flat,up_342,1.2
fold,tn_48,211
more,ap_93,7
rest,up_5,63
$ # for 2nd column, sort numerically only from 4th character to end
$ sort -t, -k2.4,2n marks.txt
rest,up_5,63
fork,ap_12,54
fold,tn_48,211
more,ap_93,7
flat,up_342,1.2
$ # sort uniquely based on first two characters of line
$ sort -u -k1.1,1.2 marks.txt
flat,up_342,1.2
fork,ap_12,54
more,ap_93,7
rest,up_5,63
- If there are headers
$ cat header.txt
fruit qty
apple 42
guava 6
fig 90
banana 31
$ # separate and combine header and content to be sorted
$ cat <(head -n1 header.txt) <(tail -n +2 header.txt | sort -k2nr)
fruit qty
fig 90
apple 42
banana 31
guava 6
Further reading for sort
- There are many other options apart from handful presented above. See
man sort
andinfo sort
for detailed documentation and more examples - sort like a master
- When -b to ignore leading blanks is needed
- sort Q&A on unix stackexchange
- sort on multiple columns using -k option
- sort a string character wise
- Scalability of 'sort -u' for gigantic files
uniq
$ uniq --version | head -n1
uniq (GNU coreutils) 8.25
$ man uniq
UNIQ(1) User Commands UNIQ(1)
NAME
uniq - report or omit repeated lines
SYNOPSIS
uniq [OPTION]... [INPUT [OUTPUT]]
DESCRIPTION
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
...
Default uniq
$ cat word_list.txt
are
are
to
good
bad
bad
bad
good
are
bad
$ # adjacent duplicate lines are removed, leaving one copy
$ uniq word_list.txt
are
to
good
bad
good
are
bad
$ # To remove duplicates from entire file, input has to be sorted first
$ # also showcases that uniq accepts stdin as input
$ sort word_list.txt | uniq
are
bad
good
to
Only duplicates
$ # duplicates adjacent to each other
$ uniq -d word_list.txt
are
bad
$ # duplicates in entire file
$ sort word_list.txt | uniq -d
are
bad
good
- To get only duplicates as well as show all duplicates
$ uniq -D word_list.txt
are
are
bad
bad
bad
$ sort word_list.txt | uniq -D
are
are
are
bad
bad
bad
bad
good
good
- To distinguish the different groups
$ # using --all-repeated=prepend will add a newline before the first group as well
$ sort word_list.txt | uniq --all-repeated=separate
are
are
are
bad
bad
bad
bad
good
good
Only unique
$ # lines with no adjacent duplicates
$ uniq -u word_list.txt
to
good
good
are
bad
$ # unique lines in entire file
$ sort word_list.txt | uniq -u
to
Prefix count
$ # adjacent lines
$ uniq -c word_list.txt
2 are
1 to
1 good
3 bad
1 good
1 are
1 bad
$ # entire file
$ sort word_list.txt | uniq -c
3 are
4 bad
2 good
1 to
$ # entire file, only duplicates
$ sort word_list.txt | uniq -cd
3 are
4 bad
2 good
- Sorting by count
$ # sort by count
$ sort word_list.txt | uniq -c | sort -n
1 to
2 good
3 are
4 bad
$ # reverse the order, highest count first
$ sort word_list.txt | uniq -c | sort -nr
4 bad
3 are
2 good
1 to
- To get only entries with min/max count, bit of awk magic would help
$ # consider this result
$ sort colors.txt | uniq -c | sort -nr
3 Red
3 Blue
2 Yellow
1 Green
1 Black
$ # to get all max count
$ # save 1st line 1st column value to c and then print if 1st column equals c
$ sort colors.txt | uniq -c | sort -nr | awk 'NR==1{c=$1} $1==c'
3 Red
3 Blue
$ # to get all min count
$ sort colors.txt | uniq -c | sort -n | awk 'NR==1{c=$1} $1==c'
1 Black
1 Green
- Get rough count of most used commands from
history
file
$ # awk '{print $1}' will get the 1st column alone
$ awk '{print $1}' "$HISTFILE" | sort | uniq -c | sort -nr | head
1465 echo
1180 grep
552 cd
531 awk
451 sed
423 vi
418 cat
392 perl
325 printf
320 sort
$ # extract command name from start of line or preceded by 'spaces|spaces'
$ # won't catch commands in other places like command substitution though
$ grep -oP '(^| +\| +)\K[^ ]+' "$HISTFILE" | sort | uniq -c | sort -nr | head
2006 grep
1469 echo
933 sed
698 awk
552 cd
513 perl
510 cat
453 sort
423 vi
327 printf
Ignoring case
$ cat another_list.txt
food
Food
good
are
bad
Are
$ # note how first copy is retained
$ uniq -i another_list.txt
food
good
are
bad
Are
$ uniq -iD another_list.txt
food
Food
Combining multiple files
$ sort -f word_list.txt another_list.txt | uniq -i
are
bad
food
good
to
$ sort -f word_list.txt another_list.txt | uniq -c
4 are
1 Are
5 bad
1 food
1 Food
3 good
1 to
$ sort -f word_list.txt another_list.txt | uniq -ic
5 are
5 bad
2 food
3 good
1 to
- If only adjacent lines (not sorted) is required, need to concatenate files using another command
$ uniq -id word_list.txt
are
bad
$ uniq -id another_list.txt
food
$ cat word_list.txt another_list.txt | uniq -id
are
bad
food
Column options
uniq
has few options dealing with column manipulations. Not extensive assort -k
but handy for some cases- First up, skipping fields
- No option to specify different delimiter
- From
info uniq
: Fields are sequences of non-space non-tab characters that are separated from each other by at least one space or tab - Number of spaces/tabs between fields should be same
$ cat shopping.txt
lemon 5
mango 5
banana 8
bread 1
orange 5
$ # skips first field
$ uniq -f1 shopping.txt
lemon 5
banana 8
bread 1
orange 5
$ # use -f3 to skip first three fields and so on
- Skipping characters
$ cat text
glue
blue
black
stack
stuck
$ # don't consider first 2 characters
$ uniq -s2 text
glue
black
stuck
$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 2nd column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
- Upto specified characters
$ # consider only first 2 characters
$ uniq -w2 text
glue
blue
stack
$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 1st column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
- Combining
-s
and-w
- Can be combined with
-f
as well
$ # skip first 3 characters and then use next 2 characters
$ uniq -s3 -w2 text
glue
black
Further reading for uniq
- Do check out
man uniq
andinfo uniq
for other options and more detailed documentation - uniq Q&A on unix stackexchange
- process duplicate lines only based on certain fields
comm
$ comm --version | head -n1
comm (GNU coreutils) 8.25
$ man comm
COMM(1) User Commands COMM(1)
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
When FILE1 or FILE2 (not both) is -, read standard input.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
...
Default three column output
Consider below sample input files
$ # sorted input files viewed side by side
$ paste colors_1.txt colors_2.txt
Blue Black
Brown Blue
Purple Green
Red Red
Teal White
Yellow
- Without any option,
comm
gives 3 column output- lines unique to first file
- lines unique to second file
- lines common to both files
$ comm colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Purple
Red
Teal
White
Yellow
Suppressing columns
-1
suppress lines unique to first file-2
suppress lines unique to second file-3
suppress lines common to both files
$ # suppressing column 3
$ comm -3 colors_1.txt colors_2.txt
Black
Brown
Green
Purple
Teal
White
Yellow
- Combining options gives three distinct and useful constructs
- First, getting only common lines to both files
$ comm -12 colors_1.txt colors_2.txt
Blue
Red
- Second, lines unique to first file
$ comm -23 colors_1.txt colors_2.txt
Brown
Purple
Teal
Yellow
- And the third, lines unique to second file
$ comm -13 colors_1.txt colors_2.txt
Black
Green
White
- See also how the above three cases can be done using grep alone
- Note input files do not need to be sorted for
grep
solution
- Note input files do not need to be sorted for
If different sort
order than default is required, use --nocheck-order
to ignore error message
$ comm -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
comm: file 1 is not in sorted order
20
53
101
$ comm --nocheck-order -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
20
53
101
Files with duplicates
- As many duplicate lines match in both files, they'll be considered as common
- Rest will be unique to respective files
- This is useful for cases like finding lines present in first but not in second taking in to consideration count of duplicates as well
- This solution won't be possible with
grep
- This solution won't be possible with
$ paste list1 list2
a a
a b
a c
b c
b d
c
$ comm list1 list2
a
a
a
b
b
c
c
d
$ comm -23 list1 list2
a
a
b
Further reading for comm
man comm
andinfo comm
for more options and detailed documentation- comm Q&A on unix stackexchange
shuf
$ shuf --version | head -n1
shuf (GNU coreutils) 8.25
$ man shuf
SHUF(1) User Commands SHUF(1)
NAME
shuf - generate random permutations
SYNOPSIS
shuf [OPTION]... [FILE]
shuf -e [OPTION]... [ARG]...
shuf -i LO-HI [OPTION]...
DESCRIPTION
Write a random permutation of the input lines to standard output.
With no FILE, or when FILE is -, read standard input.
...
Random lines
- Without repeating input lines
$ cat nums.txt
1
10
10
12
23
563
$ # duplicates can end up anywhere
$ # all lines are part of output
$ shuf nums.txt
10
23
1
10
563
12
$ # limit max number of output lines
$ shuf -n2 nums.txt
563
23
- Use
-o
option to specify output file name instead of displaying on stdout - Helpful for inplace editing
$ shuf nums.txt -o nums.txt
$ cat nums.txt
10
12
23
10
563
1
- With repeated input lines
$ # -n3 for max 3 lines, -r allows input lines to be repeated
$ shuf -n3 -r nums.txt
1
1
563
$ seq 3 | shuf -n5 -r
2
1
2
1
2
$ # if a limit using -n is not specified, shuf will output lines indefinitely
- use
-e
option to specify multiple input lines from command line itself
$ shuf -e red blue green
green
blue
red
$ shuf -e 'hi there' 'hello world' foo bar
bar
hi there
foo
hello world
$ shuf -n2 -e 'hi there' 'hello world' foo bar
foo
hi there
$ shuf -r -n4 -e foo bar
foo
foo
bar
foo
Random integer numbers
- The
-i
option accepts integer range as input to be shuffled
$ shuf -i 3-8
3
7
6
4
8
5
- Combine with other options as needed
$ shuf -n3 -i 3-8
5
4
7
$ shuf -r -n4 -i 3-8
5
5
7
8
$ shuf -r -n5 -i 0-1
1
0
0
1
1
- Use seq input if negative numbers, floating point, etc are needed
$ seq 2 -1 -2 | shuf
2
-1
-2
0
1
$ seq 0.3 0.1 0.7 | shuf -n3
0.4
0.5
0.7
Further reading for shuf
man shuf
andinfo shuf
for more options and detailed documentation- Generate random numbers in specific range
- Variable - randomly choose among three numbers
- Related to 'random' stuff: