Text Processing
- String methods
- Regular Expressions
- Pattern matching and extraction
- Search and Replace
- Compiling Regular Expressions
- Further Reading on Regular Expressions
String methods
- translate string characters
str.maketrans()
to get translation tabletranslate()
to perform the string mapping based on translation table
- the first argument to
maketrans()
is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped toNone
- character translation examples
>>> greeting = '===== Have a great day ====='
>>> greeting.translate(str.maketrans('=', '-'))
'----- Have a great day -----'
>>> greeting = '===== Have a great day!! ====='
>>> greeting.translate(str.maketrans('=', '-', '!'))
'----- Have a great day -----'
>>> import string
>>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION'
>>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase)
>>> quote.translate(tr_table)
'simplicity is the ultimate sophistication'
>>> sentence = "Thi1s is34 a senten6ce"
>>> sentence.translate(str.maketrans('', '', string.digits))
'This is a sentence'
>>> greeting.translate(str.maketrans('', '', string.punctuation))
' Have a great day '
- removing leading/trailing/both characters
- only consecutive characters from start/end string are removed
- by default whitespace characters are stripped
- if more than one character is specified, it is treated as a set and all combinations of it are used
>>> greeting = ' Have a nice day :) '
>>> greeting.strip()
'Have a nice day :)'
>>> greeting.rstrip()
' Have a nice day :)'
>>> greeting.lstrip()
'Have a nice day :) '
>>> greeting.strip(') :')
'Have a nice day'
>>> greeting = '===== Have a great day!! ====='
>>> greeting.strip('=')
' Have a great day!! '
- styling
- width argument specifies total output string length
>>> ' Hello World '.center(40, '*')
'************* Hello World **************'
- changing case and case checking
>>> sentence = 'thIs iS a saMple StrIng'
>>> sentence.capitalize()
'This is a sample string'
>>> sentence.title()
'This Is A Sample String'
>>> sentence.lower()
'this is a sample string'
>>> sentence.upper()
'THIS IS A SAMPLE STRING'
>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'
>>> 'good'.islower()
True
>>> 'good'.isupper()
False
- check if string is made up of numbers
>>> '1'.isnumeric()
True
>>> 'abc1'.isnumeric()
False
>>> '1.2'.isnumeric()
False
- check if character sequence is present or not
>>> sentence = 'This is a sample string'
>>> 'is' in sentence
True
>>> 'this' in sentence
False
>>> 'This' in sentence
True
>>> 'this' in sentence.lower()
True
>>> 'is a' in sentence
True
>>> 'test' not in sentence
True
- get number of times character sequence is present (non-overlapping)
>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0
>>> word = 'phototonic'
>>> word.count('oto')
1
- matching character sequence at start/end of string
>>> sentence
'This is a sample string'
>>> sentence.startswith('This')
True
>>> sentence.startswith('The')
False
>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False
- split string based on character sequence
- returns a list
- to split using regular expressions, use
re.split()
instead
>>> sentence = 'This is a sample string'
>>> sentence.split()
['This', 'is', 'a', 'sample', 'string']
>>> "oranges:5".split(':')
['oranges', '5']
>>> "oranges :: 5".split(' :: ')
['oranges', '5']
>>> "a e i o u".split(' ', maxsplit=1)
['a', 'e i o u']
>>> "a e i o u".split(' ', maxsplit=2)
['a', 'e', 'i o u']
>>> line = '{1.0 2.0 3.0}'
>>> nums = [float(s) for s in line.strip('{}').split()]
>>> nums
[1.0, 2.0, 3.0]
- joining list of strings
>>> str_list
['This', 'is', 'a', 'sample', 'string']
>>> ' '.join(str_list)
'This is a sample string'
>>> '-'.join(str_list)
'This-is-a-sample-string'
>>> c = ' :: '
>>> c.join(str_list)
'This :: is :: a :: sample :: string'
- replace characters
- third argument specifies how many times replace has to be performed
- variable has to be explicitly re-assigned to change its value
>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'
>>> phrase
'2 be or not 2 be'
>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'
>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'
Further Reading
Regular Expressions
- Handy reference of regular expression (RE) elements
Meta characters | Description |
---|---|
\A |
anchor to restrict matching to beginning of string |
\Z |
anchor to restrict matching to end of string |
^ |
anchor to restrict matching to beginning of line |
$ |
anchor to restrict matching to end of line |
. |
Match any character except newline character \n |
| | OR operator for matching multiple patterns |
(RE) |
capturing group |
(?:RE) |
non-capturing group |
[] |
Character class - match one character among many |
\^ |
prefix \ to literally match meta characters like ^ |
Greedy Quantifiers | Description |
---|---|
* |
Match zero or more times |
+ |
Match one or more times |
? |
Match zero or one times |
{m,n} |
Match m to n times (inclusive) |
{m,} |
Match at least m times |
{,n} |
Match up to n times (including 0 times) |
{n} |
Match exactly n times |
Appending a ?
to greedy quantifiers makes them non-greedy
Character classes | Description |
---|---|
[aeiou] |
Match any vowel |
[^aeiou] |
^ inverts selection, so this matches any consonant |
[a-f] |
- defines a range, so this matches any of abcdef characters |
\d |
Match a digit, same as [0-9] |
\D |
Match non-digit, same as [^0-9] or [^\d] |
\w |
Match alphanumeric and underscore character, same as [a-zA-Z0-9_] |
\W |
Match non-alphanumeric and underscore character, same as [^a-zA-Z0-9_] or [^\w] |
\s |
Match white-space character, same as [\ \t\n\r\f\v] |
\S |
Match non white-space character, same as [^\s] |
\b |
word boundary, see \w for characters constituting a word |
\B |
not a word boundary |
Flags | Description |
---|---|
re.I |
Ignore case |
re.M |
Multiline mode, ^ and $ anchors work on lines |
re.S |
Singleline mode, . will also match \n |
re.X |
Verbose mode, for better readability and adding comments |
See Python docs - Compilation Flags for more details and long names for flags
Variable | Description |
---|---|
\1 , \2 , \3 ... \99 |
backreferencing matched patterns |
\g<1> , \g<2> , \g<3> ... |
backreferencing matched patterns, prevents ambiguity |
\g<0> |
entire matched portion |
\0
and \100
onwards are considered as octal values, hence cannot be used as backreference.
Pattern matching and extraction
To match/extract sequence of characters, use
re.search()
to see if input string contains a pattern or notre.findall()
to get a list of all matching portionsre.finditer()
to get an iterator ofre.Match
objects of all matching portionsre.split()
to get a list from splitting input string based on a pattern
Their syntax is as follows:
re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.finditer(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
- As a good practice, always use raw strings to construct RE, unless other formats are required
- this will avoid clash of backslash escaping between RE and normal quoted strings
- examples for
re.search
>>> sentence = 'This is a sample string'
# using normal string methods
>>> 'is' in sentence
True
>>> 'xyz' in sentence
False
# need to load the re module before use
>>> import re
# check if 'sentence' contains the pattern described by RE argument
>>> bool(re.search(r'is', sentence))
True
>>> bool(re.search(r'this', sentence, flags=re.I))
True
>>> bool(re.search(r'xyz', sentence))
False
- examples for
re.findall
# match whole word par with optional s at start and e at end
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
['par', 'spar', 'spare', 'pare']
# numbers >= 100 with optional leading zeros
>>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']
# if multiple capturing groups are used, each element of output
# will be a tuple of strings of all the capture groups
>>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]
# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against')
['cost', 'akin', 'east', 'against']
# useful for debugging purposes as well before applying substitution
>>> re.findall(r't.*?a', 'that is quite a fabricated tale')
['tha', 't is quite a', 'ted ta']
- examples for
re.split
# split based on one or more digit characters
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
# split based on digit or whitespace characters
>>> re.split(r'[\d\s]+', '**1\f2\n3star\t7 77\r**')
['**', 'star', '**']
# to include the matching delimiter strings as well in the output
>>> re.split(r'(\d+)', 'Sample123string42with777numbers')
['Sample', '123', 'string', '42', 'with', '777', 'numbers']
# use non-capturing group if capturing is not needed
>>> re.split(r'hand(?:y|ful)', '123handed42handy777handful500')
['123handed42', '777', '500']
- backreferencing
# whole words that have at least one consecutive repeated character
>>> words = ['effort', 'flee', 'facade', 'oddball', 'rat', 'tool']
>>> [w for w in words if re.search(r'\b\w*(\w)\1\w*\b', w)]
['effort', 'flee', 'oddball', 'tool']
- The
re.search
function returns are.Match
object from which various details can be extracted like the matched portion of string, location of matched portion, etc - Note that output here is shown for Python version 3.7
>>> re.search(r'b.*d', 'abc ac adc abbbc')
<re.Match object; span=(1, 9), match='bc ac ad'>
# retrieving entire matched portion
>>> re.search(r'b.*d', 'abc ac adc abbbc')[0]
'bc ac ad'
# capture group example
>>> m = re.search(r'a(.*)d(.*a)', 'abc ac adc abbbc')
# to get matched portion of second capture group
>>> m[2]
'c a'
# to get a tuple of all the capture groups
>>> m.groups()
('bc ac a', 'c a')
- examples for
re.finditer
>>> m_iter = re.finditer(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
>>> [(m[1], m[2]) for m in m_iter]
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]
>>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc')
>>> for m in m_iter:
... print(m.span())
...
(0, 3)
(11, 16)
Search and Replace
Syntax
re.sub(pattern, repl, string, count=0, flags=0)
- examples
- Note that as strings are immutable,
re.sub
will not change value of variable passed to it, has to be explicity assigned
>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat
# replace 'par' only at start of word
>>> re.sub(r'\bpar', r'X', 'par spar apparent spare part')
'X spar apparent spare Xt'
# same as: r'part|parrot|parent'
>>> re.sub(r'par(en|ro)?t', r'X', 'par part parrot parent')
'par X X X'
# remove first two columns where : is delimiter
>>> re.sub(r'\A([^:]+:){2}', r'', 'foo:123:bar:baz', count=1)
'bar:baz'
- backreferencing
# remove any number of consecutive duplicate words separated by space
# quantifiers can be applied to backreferences too!
>>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'aa a a a 42 f_1 f_1 f_13.14')
'aa a 42 f_1 f_13.14'
# add something around the matched strings
>>> re.sub(r'\d+', r'(\g<0>0)', '52 apples and 31 mangoes')
'(520) apples and (310) mangoes'
# swap words that are separated by a comma
>>> re.sub(r'(\w+),(\w+)', r'\2,\1', 'a,b 42,24')
'b,a 24,42'
- using functions in replace part of
re.sub()
- Note that Python version 3.7 is used here
>>> from math import factorial
>>> numbers = '1 2 3 4 5'
>>> def fact_num(n):
... return str(factorial(int(n[0])))
...
>>> re.sub(r'\d+', fact_num, numbers)
'1 2 6 24 120'
# using lambda
>>> re.sub(r'\d+', lambda m: str(factorial(int(m[0]))), numbers)
'1 2 6 24 120'
Compiling Regular Expressions
- Regular expressions can be compiled using
re.compile
function, which gives back are.Pattern
object - The top level
re
module functions are all available as methods for this object - Compiling a regular expression helps if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit)
- By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases
>>> pet = re.compile(r'dog')
>>> type(pet)
<class 're.Pattern'>
>>> bool(pet.search('They bought a dog'))
True
>>> bool(pet.search('A cat crossed their path'))
False
>>> remove_parentheses = re.compile(r'\([^)]*\)')
>>> remove_parentheses.sub('', 'a+b(addition) - foo() + c%d(#modulo)')
'a+b - foo + c%d'
>>> remove_parentheses.sub('', 'Hi there(greeting). Nice day(a(b)')
'Hi there. Nice day'
Further Reading on Regular Expressions
- Python re(gex)? - a book on regular expressions
- Python docs - re module
- Python docs - introductory tutorial to using regular expressions
- Comprehensive reference: What does this regex mean?
- rexegg - tutorials, tricks and more
- regular-expressions - tutorials and tools
- CommonRegex - collection of common regular expressions
- Practice tools
- regex101 - visual aid and online testing tool for regular expressions, select flavor as Python before use
- debuggex - railroad diagrams for regular expressions, select flavor as Python before use
- regexone - interative tutorial
- cheatsheet - one can also learn it interactively
- regexcrossword - practice by solving crosswords, read 'How to play' section before you start