Text Processing
The rich set of text processing commands is comprehensive and time saving. Knowing even their existence is enough to avoid the need of writing yet another script (which takes time and effort plus debugging) – a trap which many beginners fall into. An extensive list of text processing commands and examples can be found here
sort
sort lines of text files
As the name implies, this command is used to sort files. How about alphabetic sort and numeric sort? Possible. How about sorting a particular column? Possible. Prioritized multiple sorting order? Possible. Randomize? Unique? Just about any sorting need is catered by this powerful command
Options
-R
random sort-r
reverse the sort order-o
redirect sorted result to specified filename, very useful to sort a file inplace-n
sort numerically-V
version sort, aware of numbers within text-h
sort human readable numbers like 4K, 3M, etc-k
sort via key-u
sort uniquely-b
ignore leading white-spaces of a line while sorting-t
use SEP instead of non-blank to blank transition
Examples
sort dir_list.txt
display sorted file on standard outputsort -bn numbers.txt -o numbers.txt
sort numbers.txt numerically (ignoring leading white-spaces) and overwrite the file with sorted outputsort -R crypto_keys.txt -o crypto_keys_random.txt
sort randomly and write to new fileshuf crypto_keys.txt -o crypto_keys_random.txt
can also be used
du -sh * | sort -h
sort file/directory sizes in current directory in human readable format
$ cat ip.txt
6.2 : 897 : bar
3.1 : 32 : foo
2.3 : 012 : bar
1.2 : 123 : xyz
$ # -k3,3 means from 3rd column onwards to 3rd column
$ # for ex: to sort from 2nd column till end, use -k2
$ sort -t: -k3,3 ip.txt
2.3 : 012 : bar
6.2 : 897 : bar
3.1 : 32 : foo
1.2 : 123 : xyz
$ # -n option for numeric sort, check out what happens when -n is not used
$ sort -t: -k2,2n ip.txt
2.3 : 012 : bar
3.1 : 32 : foo
1.2 : 123 : xyz
6.2 : 897 : bar
$ # more than one rule can be specified to resolve same values
$ sort -t: -k3,3 -k1,1rn ip.txt
6.2 : 897 : bar
2.3 : 012 : bar
3.1 : 32 : foo
1.2 : 123 : xyz
Further Reading
uniq
report or omit repeated lines
This command is more specific to recognizing duplicates. Usually requires a sorted input as the comparison is made on adjacent lines only
Options
-d
print only duplicate lines-c
prefix count to occurrences-u
print only unique lines
Examples
sort test_list.txt | uniq
outputs lines of test_list.txt in sorted order with duplicate lines removeduniq <(sort test_list.txt)
same command using process substitutionsort -u test_list.txt
equivalent command
uniq -d sorted_list.txt
print only duplicate linesuniq -cd sorted_list.txt
print only duplicate lines and prefix the line with number of times it is repeateduniq -u sorted_list.txt
print only unique lines, repeated lines are ignored- uniq Q&A on unix stackexchange
$ echo -e 'Blue\nRed\nGreen\nBlue\nRed\nBlack\nRed' > colors.txt
$ uniq colors.txt
Blue
Red
Green
Blue
Red
Black
Red
$ echo -e 'Blue\nRed\nGreen\nBlue\nRed\nBlack\nRed' | sort > sorted_colors.txt
$ uniq sorted_colors.txt
Black
Blue
Green
Red
$ uniq -d sorted_colors.txt
Blue
Red
$ uniq -cd sorted_colors.txt
2 Blue
3 Red
$ uniq -u sorted_colors.txt
Black
Green
comm
compare two sorted files line by line
Without any options, it prints output in three columns - lines unique to file1, line unique to file2 and lines common to both files
Options
-1
suppress lines unique to file1-2
suppress lines unique to file2-3
suppress lines common to both files
Examples
comm -23 sorted_file1.txt sorted_file2.txt
print lines unique to sorted_file1.txtcomm -23 <(sort file1.txt) <(sort file2.txt)'
same command using process substitution, if sorted input files are not available
comm -13 sorted_file1.txt sorted_file2.txt
print lines unique to sorted_file2.txtcomm -12 sorted_file1.txt sorted_file2.txt
print lines common to both files- comm Q&A on unix stackexchange
$ echo -e 'Brown\nRed\nPurple\nBlue\nTeal\nYellow' | sort > colors_1.txt
$ echo -e 'Red\nGreen\nBlue\nBlack\nWhite' | sort > colors_2.txt
$ # the input files viewed side by side
$ paste colors_1.txt colors_2.txt
Blue Black
Brown Blue
Purple Green
Red Red
Teal White
Yellow
- examples
$ # 3 column output - unique to file1, file2 and common
$ comm colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Purple
Red
Teal
White
Yellow
$ # suppress 1 and 2 column, gives only common lines
$ comm -12 colors_1.txt colors_2.txt
Blue
Red
$ # suppress 1 and 3 column, gives lines unique to file2
$ comm -13 colors_1.txt colors_2.txt
Black
Green
White
$ # suppress 2 and 3 column, gives lines unique to file1
$ comm -23 colors_1.txt colors_2.txt
Brown
Purple
Teal
Yellow
cmp
compare two files byte by byte
Useful to compare binary files. If the two files are same, no output is displayed (exit status 0)
If there is a difference, it prints the first difference - line number and byte location (exit status 1)
Option -s
allows to suppress the output, useful in scripts
$ cmp /bin/grep /bin/fgrep
/bin/grep /bin/fgrep differ: byte 25, line 1
- More examples here
diff
compare files line by line
Useful to compare old and new versions of text files
All the differences are printed, which might not be desirable if files are too long
Options
-s
convey message when two files are same-y
two column output-i
ignore case while comparing-w
ignore white-spaces-r
recursively compare files between the two directories specified-q
report if files differ, not the details of difference
Examples
diff -s test_list_mar2.txt test_list_mar3.txt
compare two filesdiff -s report.log bkp/mar10/
no need to specify second filename if names are samediff -qr report/ bkp/mar10/report/
recursively compare files between report and bkp/mar10/report directories, filenames not matching are also specified in output- see this link for detailed analysis and corner cases
diff report/ bkp/mar10/report/ | grep -w '^diff'
useful trick to get only names of mismatching files (provided no mismatches contain the whole word diff at start of line)
Further Reading
- diff Q&A on unix stackexchange
gvimdiff
edit two, three or four versions of a file with Vim and show differences- GUI diff and merge tools
tr
translate or delete characters
Options
-d
delete the specified characters-c
complement set of characters to be replaced
Examples
tr a-z A-Z < test_list.txt
convert lowercase to uppercasetr -d ._ < test_list.txt
delete the dot and underscore characterstr a-z n-za-m < test_list.txt > encrypted_test_list.txt
Encrypt by replacing every lowercase alphabet with 13th alphabet after it- Same command on encrypted text will decrypt it
- tr Q&A on unix stackexchange
sed
stream editor for filtering and transforming text
Options
-n
suppress automatic printing of pattern space-i
edit files inplace (makes backup if SUFFIX supplied)-r
use extended regular expressions-e
add the script to the commands to be executed-f
add the contents of script-file to the commands to be executed- for examples and details, refer to links given below
commands
We'll be seeing examples only for three commonly used commands
d
Delete the pattern spacep
Print out the pattern spaces
search and replace- check out 'Often-Used Commands' and 'Less Frequently-Used Commands' sections in
info sed
for complete list of commands
range
By default, sed
acts on all of input contents. This can be refined to specific line number or a range defined by line numbers, search pattern or mix of the two
n,m
range between nth line to mth line, including n and mi~j
act on ith line and i+j, i+2j, i+3j, etc1~2
means 1st, 3rd, 5th, 7th, etc lines i.e odd numbered lines5~3
means 5th, 8th, 11th, etc
n
only nth line$
only last line/pattern/
lines matching patternn,/pattern/
nth line to line matching patternn,+x
nth line and x lines after/pattern/,m
line matching pattern to mth line/pattern/,+x
line matching pattern and x lines after/pattern1/,/pattern2/
line matching pattern1 to line matching pattern2/pattern/I
lines matching pattern, pattern is case insensitive- for more details, see section 'Selecting lines with sed' in
info sed
- see 'Regular Expressions' in grep command for extended regular expressions reference
- also check out 'Overview of Regular Expression Syntax' section in
info sed
Examples for selective deletion(d)
sed '/cat/d' story.txt
delete every line containing catsed '/cat/!d' story.txt
delete every line NOT containing catsed '$d' story.txt
delete last line of the filesed '2,5d' story.txt
delete lines 2,3,4,5 of the filesed '1,/test/d' dir_list.txt
delete all lines from beginning of file to first occurrence of line containing test (the matched line is also deleted)sed '/test/,$d' dir_list.txt
delete all lines from line containing test to end of file
Examples for selective printing(p)
sed -n '5p' story.txt
print 5th line,-n
option overrides default print behavior of sed- use
sed '5q;d' story.txt
on large files. Read more
- use
sed -n '/cat/p' story.txt
print every line containing the text cat- equivalent to
sed '/cat/!d' story.txt
- equivalent to
sed -n '4,8!p' story.txt
print all lines except lines 4 to 8man grep | sed -n '/^\s*exit status/I,/^$/p'
extract exit status information of a command from manual/^\s*exit status/I
checks for line starting with 'exit status' in case insensitive way, white-space may be present at start of line/^$/
empty line
man ls | sed -n '/^\s*-F/,/^$/p'
extract information on command option from manual/^\s*-F/
line starting with option '-F', white-space may be present at start of line
Examples for search and replace(s)
sed -i 's/cat/dog/g' story.txt
search and replace every occurrence of cat with dog in story.txtsed -i.bkp 's/cat/dog/g' story.txt
in addition to inplace file editing, create backup file story.txt.bkp, so that if a mistake happens, original file can be restoredsed -i.bkp 's/cat/dog/g' *.txt
to perform operation on all files ending with .txt in current directory
sed -i '5,10s/cat/dog/gI' story.txt
search and replace every occurrence of cat (case insensitive due to modifier I) with dog in story.txt only in line numbers 5 to 10sed '/cat/ s/animal/mammal/g' story.txt
replace animal with mammal in all lines containing cat- Since
-i
option is not used, output is displayed on standard output and story.txt is not changed - spacing between range and command is optional,
sed '/cat/s/animal/mammal/g' story.txt
can also be used
- Since
sed -i -e 's/cat/dog/g' -e 's/lion/tiger/g' story.txt
search and replace every occurrence of cat with dog and lion with tiger- any number of
-e
option can be used sed -i 's/cat/dog/g ; s/lion/tiger/g' story.txt
alternative syntax, spacing around ; is optional
- any number of
sed -r 's/(.*)/abc: \1 :xyz/' list.txt
add prefix 'abc: ' and suffix ' :xyz' to every line of list.txtsed -i -r "s/(.*)/$(basename $PWD)\/\1/" dir_list.txt
add current directory name and forward-slash character at the start of every line- Note the use of double quotes to perform command substitution
sed -i -r "s|.*|$HOME/\0|" dir_list.txt
add home directory and forward-slash at the start of every line- Since the value of '$HOME' itself contains forward-slash characters, we cannot use
/
as delimiter - Any character other than backslash or newline can be used as delimiter, for example
| # ^
see this link for more info \0
back-reference contains entire matched string
- Since the value of '$HOME' itself contains forward-slash characters, we cannot use
Example input file
$ cat mem_test.txt
mreg2 = 1200 # starting address
mreg4 = 2180 # ending address
dreg5 = get(mreg2) + get(mreg4)
print dreg5
- replace all reg with register
$ sed 's/reg/register/g' mem_test.txt
mregister2 = 1200 # starting address
mregister4 = 2180 # ending address
dregister5 = get(mregister2) + get(mregister4)
print dregister5
- change start and end address
$ sed 's/1200/1530/; s/2180/1870/' mem_test.txt
mreg2 = 1530 # starting address
mreg4 = 1870 # ending address
dreg5 = get(mreg2) + get(mreg4)
print dreg5
$ # to make changes only on mreg initializations, use
$ # sed '/mreg[0-9] *=/ s/1200/1530/; s/2180/1870/' mem_test.txt
- Using
bash
variables
$ s_add='1760'; e_add='2500'
$ sed "s/1200/$s_add/; s/2180/$e_add/" mem_test.txt
mreg2 = 1760 # starting address
mreg4 = 2500 # ending address
dreg5 = get(mreg2) + get(mreg4)
print dreg5
- split inline commented code to comment + code
$ sed -E 's/^([^#]+)(#.*)/\2\n\1/' mem_test.txt
# starting address
mreg2 = 1200
# ending address
mreg4 = 2180
dreg5 = get(mreg2) + get(mreg4)
print dreg5
- range of lines matching pattern
$ seq 20 | sed -n '/3/,/5/p'
3
4
5
13
14
15
- inplace editing
$ sed -i -E 's/([md]r)eg/\1/g' mem_test.txt
$ cat mem_test.txt
mr2 = 1200 # starting address
mr4 = 2180 # ending address
dr5 = get(mr2) + get(mr4)
print dr5
$ # more than one input files can be given
$ # use glob pattern if files share commonality, ex: *.txt
Further Reading
- sed basics
- sed detailed tutorial
- sed-book
- cheat sheet
- sed examples
- sed one-liners explained
- common search and replace examples with sed
- sed Q&A on unix stackexchange
- sed Q&A on stackoverflow
awk
pattern scanning and text processing language
awk
derives its name from authors Alfred Aho, Peter Weinberger and Brian Kernighan.
syntax
awk 'BEGIN {initialize} condition1 {stmts} condition2 {stmts}... END {finish}'
BEGIN {initialize}
used to initialize variables (could be user defined or awk variables or both), executed once - optional blockcondition1 {stmts} condition2 {stmts}...
action performed for every line of input, condition is optional, more than one block {} can be used with/without conditionEND {finish}
perform action once at end of program - optional block
- commands can be written in a file and passed using the
-f
option instead of writing it all on command line- for examples and details, refer to links given below
Example input file
$ cat test.txt
abc : 123 : xyz
3 : 32 : foo
-2.3 : bar : bar
- Just printing something, no input
$ awk 'BEGIN{print "Hello!\nTesting awk one-liner"}'
Hello!
Testing awk one-liner
- search and replace
- when the
{stmts}
portion ofcondition {stmts}
is not specified, by default{print $0}
is executed if thecondition
evaluates to true1
is a generally usedawk
idiom to print contents of$0
after performing some processingprint
statement without argument will print the content of$0
$ # sub will replace only first occurrence
$ # third argument to sub specifies variable to change, defaults to $0
$ awk '{sub("3", "%")} 1' test.txt
abc : 12% : xyz
% : 32 : foo
-2.% : bar : bar
$ # gsub will replace all occurrences
$ awk '{gsub("3", "%")} 1' test.txt
abc : 12% : xyz
% : %2 : foo
-2.% : bar : bar
$ # add a condition to restrict processing only to those records
$ awk '/foo/{gsub("3", "%")} 1' test.txt
abc : 123 : xyz
% : %2 : foo
-2.3 : bar : bar
$ # using shell variables
$ r="@"
$ awk -v r_str="$r" '{sub("3", r_str)} 1' test.txt
abc : 12@ : xyz
@ : 32 : foo
-2.@ : bar : bar
$ # bash environment variables like PWD, HOME is also accessible via ENVIRON
$ s="%" awk '{sub("3", ENVIRON["s"])} 1' test.txt
abc : 12% : xyz
% : 32 : foo
-2.% : bar : bar
- filtering content
$ # regex pattern, by default tested against $0
$ awk '/a/' test.txt
abc : 123 : xyz
-2.3 : bar : bar
$ # use ! to invert condition
$ awk '!/abc/' test.txt
3 : 32 : foo
-2.3 : bar : bar
$ seq 30 | awk 'END{print}'
30
$ # generic, length(var) - default is $0
$ seq 8 13 | awk 'length==1'
8
9
- selecting based on line numbers
NR
is record number
$ seq 123 135 | awk 'NR==7'
129
$ seq 123 135 | awk 'NR>=3 && NR<=5'
125
126
127
$ seq 5 | awk 'NR>=3'
3
4
5
$ # for large input, use exit to avoid unnecessary record processing
$ seq 14323 14563435 | awk 'NR==234{print; exit}'
14556
- selecting based on start and end condition
- for following examples
- numbers 1 to 20 is input
- regex pattern
/4/
is start condition - regex pattern
/6/
is end condition
f
is idiomatically used to represent a flag variable
$ # records between start and end
$ seq 20 | awk '/4/{f=1; next} /6/{f=0} f'
5
15
$ # records between start and end and also includes start
$ seq 20 | awk '/4/{f=1} /6/{f=0} f'
4
5
14
15
$ # records between start and end and also includes end
$ seq 20 | awk '/4/{f=1; next} f; /6/{f=0}'
5
6
15
16
$ # records from start to end
$ seq 20 | awk '/4/{f=1} f{print} /6/{f=0}'
4
5
6
14
15
16
$ # records excluding start to end
$ seq 10 | awk '/4/{f=1} !f; /6/{f=0}'
1
2
3
7
8
9
10
- column manipulations
- by default, one or more consecutive spaces/tabs are considered as field separators
$ echo -e "1 3 4\na b c"
1 3 4
a b c
$ # second column
$ echo -e "1 3 4\na b c" | awk '{print $2}'
3
b
$ # last column
$ echo -e "1 3 4\na b c" | awk '{print $NF}'
4
c
$ # default output field separator is single space character
$ echo -e "1 3 4\na b c" | awk '{print $1, $3}'
1 4
a c
$ # condition for specific field
$ echo -e "1 3 4\na b c" | awk '$2 ~ /[0-9]/'
1 3 4
- specifying a different input/output field separator
- can be string alone or regex, multiple separators can be specified using
|
in regex pattern
$ awk -F' *: *' '$1 == "3"' test.txt
3 : 32 : foo
$ awk -F' *: *' '{print $1 "," $2}' test.txt
abc,123
3,32
-2.3,bar
$ awk -F' *: *' -v OFS="::" '{print $1, $2}' test.txt
abc::123
3::32
-2.3::bar
$ awk -F: -v OFS="\t" '{print $1 OFS $2}' test.txt
abc 123
3 32
-2.3 bar
- dealing with duplicates, line/field wise
$ cat duplicates.txt
abc 123 ijk
foo 567 xyz
abc 123 ijk
bar 090 pqr
tst 567 zzz
$ # whole line
$ awk '!seen[$0]++' duplicates.txt
abc 123 ijk
foo 567 xyz
bar 090 pqr
tst 567 zzz
$ # particular column
$ awk '!seen[$2]++' duplicates.txt
abc 123 ijk
foo 567 xyz
bar 090 pqr
- inplace editing
$ awk -i inplace '{print NR ") " $0}' test.txt
$ cat test.txt
1) abc : 123 : xyz
2) 3 : 32 : foo
3) -2.3 : bar : bar
Further Reading
- awk basics
- Gawk: Effective AWK Programming
- awk detailed tutorial
- basic tutorials for grep, awk, sed
- awk one-liners explained
- awk book
- awk cheat-sheet for awk variables, statements, functions, etc
- awk examples
- awk Q&A on unix stackexchange
- awk Q&A on stackoverflow
perl
The Perl 5 language interpreter
Larry Wall wrote Perl as a general purpose scripting language, borrowing features from C, shell scripting, awk, sed, grep, cut, sort etc
Reference tables given below for frequently used constructs with perl one-liners. Resource links given at end for further reading.
Descriptions adapted from perldoc - command switches
Option | Description |
---|---|
-e | execute perl code |
-n | iterate over input files in a loop, lines are NOT printed by default |
-p | iterate over input files in a loop, lines are printed by default |
-l | chomp input line, $\ gets value of $/ if no argument given |
-a | autosplit input lines on space, implicitly sets -n for Perl version 5.20.0 and above |
-F | specifies the pattern to split input lines, implicitly sets -a and -n for Perl version 5.20.0 and above |
-i | edit files inplace, if extension provided make a backup copy |
-0777 | slurp entire file as single string, not advisable for large input files |
Descriptions adapted from perldoc - Special Variables
Variable | Description |
---|---|
$_ | The default input and pattern-searching space |
$. | Current line number |
$/ | input record separator, newline by default |
$\ | output record separator, empty string by default |
@F | contains the fields of each line read, applicable with -a or -F option |
%ENV | contains current environment variables |
$ARGV | contains the name of the current file |
Function | Description |
---|---|
length | Returns the length in characters of the value of EXPR. If EXPR is omitted, returns the length of $_ |
eof | Returns 1 if the next read on FILEHANDLE will return end of file |
Simple Perl program
$ perl -e 'print "Hello!\nTesting Perl one-liner\n"'
Hello!
Testing Perl one-liner
Example input file
$ cat test.txt
abc : 123 : xyz
3 : 32 : foo
-2.3 : bar : bar
- Search and replace
$ perl -pe 's/3/%/' test.txt
abc : 12% : xyz
% : 32 : foo
-2.% : bar : bar
$ # use g flag to replace all occurrences, not just first match in line
$ perl -pe 's/3/%/g' test.txt
abc : 12% : xyz
% : %2 : foo
-2.% : bar : bar
$ # conditional replacement
$ perl -pe 's/3/@/g if /foo/' test.txt
abc : 123 : xyz
@ : @2 : foo
-2.3 : bar : bar
$ # using shell variables
$ r="@"
$ perl -pe "s/3/$r/" test.txt
abc : 12@ : xyz
@ : 32 : foo
-2.@ : bar : bar
$ # preferred approach is to use ENV hash variable
$ export s="%"
$ perl -pe 's/3/$ENV{s}/' test.txt
abc : 12% : xyz
% : 32 : foo
-2.% : bar : bar
- Search and replace special characters
The \Q
and q()
constructs are helpful to nullify regex meta characters
$ # if not properly escaped or quoted, it can lead to errors
$ echo '*.^[}' | perl -pe 's/*.^[}/abc/'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE .^[}/ at -e line 1.
$ echo '*.^[}' | perl -pe 's/\*\.\^\[}/abc/'
abc
$ echo '*.^[}' | perl -pe 's/\Q*.^[}/abc/'
abc
$ echo '*.^[}' | perl -pe 's/\Q*.^[}/\$abc\$/'
$abc$
$ echo '*.^[}' | perl -pe 's/\Q*.^[}/q($abc$)/e'
$abc$
- Print lines based on line number or pattern
$ perl -ne 'print if /a/' test.txt
abc : 123 : xyz
-2.3 : bar : bar
$ perl -ne 'print if !/abc/' test.txt
3 : 32 : foo
-2.3 : bar : bar
$ seq 123 135 | perl -ne 'print if $. == 7'
129
$ seq 1 30 | perl -ne 'print if eof'
30
$ # Use exit to save time on large input files
$ seq 14323 14563435 | perl -ne 'if($. == 234){print; exit}'
14556
$ # length() can also be used instead of length $_
$ seq 8 13 | perl -lne 'print if length $_ == 1'
8
9
- Print range of lines based on line number or pattern
$ seq 123 135 | perl -ne 'print if $. >= 3 && $. <= 5'
125
126
127
$ # $. is default variable compared against when using ..
$ seq 123 135 | perl -ne 'print if 3..5'
125
126
127
$ # can use many alternatives, eof looks more readable
$ seq 5 | perl -ne 'print if 3..eof'
3
4
5
$ # matching regex specified by /pattern/ is checked against $_
$ seq 5 | perl -ne 'print if 3../4/'
3
4
$ seq 1 30 | perl -ne 'print if /4/../6/'
4
5
6
14
15
16
24
25
26
$ seq 2 8 | perl -ne 'print if !(/4/../6/)'
2
3
7
8
..
vs...
$ echo -e '10\n11\n10' | perl -ne 'print if /10/../10/'
10
10
$ echo -e '10\n11\n10' | perl -ne 'print if /10/.../10/'
10
11
10
- Column manipulations
$ echo -e "1 3 4\na b c" | perl -nale 'print $F[1]'
3
b
$ echo -e "1,3,4,8\na,b,c,d" | perl -F, -lane 'print $F[$#F]'
8
d
$ perl -F: -lane 'print "$F[0] $F[2]"' test.txt
abc xyz
3 foo
-2.3 bar
$ perl -F: -lane '$sum+=$F[1]; END{print $sum}' test.txt
155
$ perl -F: -lane '$F[2] =~ s/\w(?=\w)/$&,/g; print join ":", @F' test.txt
abc : 123 : x,y,z
3 : 32 : f,o,o
-2.3 : bar : b,a,r
$ perl -F'/:\s*[a-z]+/i' -lane 'print $F[0]' test.txt
abc : 123
3 : 32
-2.3
$ perl -F'\s*:\s*' -lane 'print join ",", grep {/[a-z]/i} @F' test.txt
abc,xyz
foo
bar,bar
$ perl -F: -ane 'print if (grep {/\d/} @F) < 2' test.txt
abc : 123 : xyz
-2.3 : bar : bar
- Dealing with duplicates
$ cat duplicates.txt
abc 123 ijk
foo 567 xyz
abc 123 ijk
bar 090 pqr
tst 567 zzz
$ # whole line
$ perl -ne 'print if !$seen{$_}++' duplicates.txt
abc 123 ijk
foo 567 xyz
bar 090 pqr
tst 567 zzz
$ # particular column
$ perl -ane 'print if !$seen{$F[1]}++' duplicates.txt
abc 123 ijk
foo 567 xyz
bar 090 pqr
- Multiline processing
$ # save previous lines to make it easier for multiline matching
$ perl -ne 'print if /3/ && $p =~ /abc/; $p = $_' test.txt
3 : 32 : foo
$ perl -ne 'print "$p$_" if /3/ && $p =~ /abc/; $p = $_' test.txt
abc : 123 : xyz
3 : 32 : foo
$ # with multiline matching, -0777 slurping not advisable for very large files
$ perl -0777 -ne 'print $1 if /.*abc.*\n(.*3.*\n)/' test.txt
3 : 32 : foo
$ perl -0777 -ne 'print $1 if /(.*abc.*\n.*3.*\n)/' test.txt
abc : 123 : xyz
3 : 32 : foo
$ # use s flag to allow .* to match across lines
$ perl -0777 -pe 's/(.*abc.*32)/ABC/s' test.txt
ABC : foo
-2.3 : bar : bar
$ # use m flag if ^$ anchors are needed to match individual lines
$ perl -0777 -pe 's/(.*abc.*3)/ABC/s' test.txt
ABC : bar : bar
$ perl -0777 -pe 's/(.*abc.*^3)/ABC/sm' test.txt
ABC : 32 : foo
-2.3 : bar : bar
$ # print multiple lines after matching line
$ perl -ne 'if(/abc/){ print; foreach (1..2){$n = <>; print $n} }' test.txt
abc : 123 : xyz
3 : 32 : foo
-2.3 : bar : bar
- Using modules
$ echo 'a,b,a,c,d,1,d,c,2,3,1,b' | perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq(@F)'
a,b,c,d,1,2,3
$ base64 test.txt
YWJjICA6IDEyMyA6IHh5egozICAgIDogMzIgIDogZm9vCi0yLjMgOiBiYXIgOiBiYXIK
$ base64 test.txt | base64 -d
abc : 123 : xyz
3 : 32 : foo
-2.3 : bar : bar
$ base64 test.txt | perl -MMIME::Base64 -ne 'print decode_base64($_)'
abc : 123 : xyz
3 : 32 : foo
-2.3 : bar : bar
$ perl -MList::MoreUtils=indexes -nale '@i = indexes { /[a-z]/i } @F if $. == 1; print join ",", @F[@i]' test.txt
abc,xyz
3,foo
-2.3,bar
- In place editing
$ perl -i -pe 's/\d/*/g' test.txt
$ cat test.txt
abc : *** : xyz
* : ** : foo
-*.* : bar : bar
$ perl -i.bak -pe 's/\*/^/g' test.txt
$ cat test.txt
abc : ^^^ : xyz
^ : ^^ : foo
-^.^ : bar : bar
$ cat test.txt.bak
abc : *** : xyz
* : ** : foo
-*.* : bar : bar
Further Reading
- Perl Introduction - Introductory course for Perl 5 through examples
- Perl curated resources
- Handy Perl regular expressions
- What does this regex mean?
- Perl one-liners
- Perl command line switches
- Env
cut
remove sections from each line of files
For columns operations with well defined delimiters, cut
command is handy
Examples
ls -l | cut -d' ' -f1
first column ofls -l
-d
option specifies delimiter character, in this case it is single space character (Default delimiter is TAB character)-f
option specifies which fields to print separated by commas, in this case field 1
cut -d':' -f1 /etc/passwd
prints first column of /etc/passwd filecut -d':' -f1,7 /etc/passwd
prints 1st and 7th column of /etc/passwd file with : character in betweencut -d':' --output-delimiter=' ' -f1,7 /etc/passwd
use space as delimiter between 1st and 7th column while printing- cut Q&A on unix stackexchange
paste
merge lines of files
Examples
paste list1.txt list2.txt list3.txt > combined_list.txt
combines the three files column-wise into single file, the entries separated by TAB characterpaste -d':' list1.txt list2.txt list3.txt > combined_list.txt
the entries are separated by : character instead of TAB- See pr command for multiple character delimiter
- paste Q&A on unix stackexchange
$ # joining multiple files
$ paste -d, <(seq 5) <(seq 6 10)
1,6
2,7
3,8
4,9
5,10
$ paste -d, <(seq 3) <(seq 4 6) <(seq 7 10)
1,4,7
2,5,8
3,6,9
,,10
- Single column to multiple columns
$ seq 5 | paste - -
1 2
3 4
5
$ # specifying different output delimiter, default is tab
$ seq 5 | paste -d, - -
1,2
3,4
5,
$ # if number of columns to specify is large, use the printf trick
$ seq 5 | paste $(printf -- "- %.s" {1..3})
1 2 3
4 5
- Combine all lines to single line
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10
$ # for multiple character delimiter, perl can be used
$ seq 10 | perl -pe 's/\n/ : / if(!eof)'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
column
columnate lists
$ cat dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh
$ column -t dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh
- More examples here
pr
convert text files for printing
$ pr sample.txt
2016-05-29 11:00 sample.txt Page 1
This is an example of adding text to a new file using cat command.
Press Ctrl+d on a newline to save and quit.
Adding a line of text at end of file
- Options include converting text files for printing with header, footer, page numbers, double space a file, combine multiple files column wise, etc
- More examples here
$ # single column to multiple column, split vertically
$ # for example, in command below, output of seq is split into two
$ seq 5 | pr -2t
1 4
2 5
3
$ # different output delimiter can be used by passing string to -s option
$ seq 5 | pr -2ts' '
1 4
2 5
3
$ seq 15 | pr -5ts,
1,4,7,10,13
2,5,8,11,14
3,6,9,12,15
- Use
-a
option to split across
$ seq 5 | pr -2ats' : '
1 : 2
3 : 4
5
$ seq 15 | pr -5ats,
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
$ # use $ to expand characters denoted by escape characters like \t for tab
$ seq 5 | pr -3ts$'\t'
1 3 5
2 4
$ # or leave the argument to -s empty as tab is default
$ seq 5 | pr -3ts
1 3 5
2 4
- The default PAGE_WIDTH is 72
- The formula
(col-1)*len(delimiter) + col
seems to work in determining minimum PAGE_WIDTH required for multiple column output - The
-J
option will help in turning off line truncation
$ seq 74 | pr -36ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36
37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72
73,74
$ seq 74 | pr -37ats,
pr: page width too narrow
$ # (37-1)*1 + 37 = 73
$ seq 74 | pr -Jw 73 -37ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37
38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74
$ # (3-1)*4 + 3 = 11
$ seq 6 | pr -Jw 10 -3ats'::::'
pr: page width too narrow
$ seq 6 | pr -Jw 11 -3ats'::::'
1::::2::::3
4::::5::::6
- Use
-m
option to combine multiple files in parallel
$ pr -mts', ' <(seq 3) <(seq 4 6) <(seq 7 9)
1, 4, 7
2, 5, 8
3, 6, 9
We can use a combination of different commands for complicated operations. For example, transposing a table
$ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)t
North South West East
alootikki appam dhokla handoguri
baati bisibelebath khakhra litti
khichdi dosa modak momo
makkiroti koottu shiro rosgulla
poha sevai vadapav shondesh
Notice how pr
neatly arranges the columns. If spacing is too much, we can use column
$ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)ts | column -t
North South West East
alootikki appam dhokla handoguri
baati bisibelebath khakhra litti
khichdi dosa modak momo
makkiroti koottu shiro rosgulla
poha sevai vadapav shondesh