Regular Expressions are commonly used in Linux command line tools like sed, awk, grep etc. Most programming languages support them in either built – in or through an external library.
The main problem of using them is that they difficult to understand, but they are well worth the effort to learn. Using a regular expression can save you a lot of time.
Lets start with a simple example:
Validating Input string
import re x="[a-z]+@[a-z]+\.[a-z]+" s1='[email protected]' s2='' c=re.match(x,s1) if c: print ('ok') else: print ('no')
We declare the regular expression to match email with a very simple rules:
- at least one letter – [a-z]+
- followed by @
- followed by at least one letter
- followed by period
- followed by at least one letter
This is a very simple example , it doesn’t accept digits, doesn’t check for known extension (com/net/org) and there are more pitfalls. But the point is that if we want to add those rules we need to change the regular expression only.
Some match rules:
x? match 0 or 1 occurrences of x x+ match 1 or more occurrences of x x* match 0 or more occurrences of x x{m,n} match between m and n x’s hello match hello hello|world match hello or world ^ match beginning of text $ match end of text [a-zA-Z] match any char in the set [^a-zA-Z] match any char not in the set
For example if we want to add support for digits in the first part of the email expression we add:
Or if we want to enable only .com or .net emails we need to add:
Some examples:
Phone number:
phone = '(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'
Search and Match
Search in text based on regular expression. For example if we want to find a sentence starting with ‘hello’ or ‘bye’ and ending with ‘day’ or ‘month’
str = 'welcome and hello all, have a good day' m ="(hello|bye).*(day|month)", str) if m: print('Matched',m.groups()) print('Start index', m.start()) print('End index', m.end())
Matched ('hello', 'day') Starting at 12 Ending at 38
Regular expression substitution
Sometimes you need to find and replace one sub string with another. Using regular expressions , you can search also for pattern. For example we want to find all the numbers in the text and replace it with * :
str = 'string with456 some111 888 numbers' txt = re.sub('[0-9]+', '*', str) print(txt)
string with* some* * numbers
You can use subn which returns a tuple :
str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', '*', str) print(txt) if txt[1]: print(txt[0])
('string with* some* * numbers', 3) string with* some* * numbers
You can also supply a function in the second parameter, the function will be invoked for any match helping you decide what to do:
def fn(match): print( return '#' str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', fn, str) print(txt) if txt[1]: print(txt[0])
456 111 888 ('string with# some# # numbers', 3) string with# some# # numbers
Splitting a string
Using string class, you can split a string to substrings only with one separator.
Using regular expressions, you can do it for a pattern and for multiple separators. For example:
str = 'string with456 some111 888 numbers' txt = re.split('[0-9]+',str) print(txt)
['string with', ' some', ' ', ' numbers']
Multiple separators:
str = 'str,in;g wi,th*456 so#me1;11 88$8 numbers' txt = re.split('[,;*#$]',str) print(txt)
['str', 'in', 'g wi', 'th', '456 so', 'me1', '11 88', '8 numbers']
There are some shortcuts for common patterns like numbers, words, etc.
For example if we want to find one or more digit we can use the pattern [0-9]+ . We can do it with ‘\d+’ as a shortcut:
str = 'string with456 some111 888 numbers' txt = re.split('\d+',str) print(txt) # output: # ['string with', ' some', ' ', ' numbers']
Other shortcuts:
\D – not digit:
str = 'string with456 some111 888 numbers' txt = re.split('\D+',str) print(txt) # output: # ['', '456', '111', '888', '']
\w – word
\W – not word
\s – white space
\S – not white space
Find All
str = 'string with456 some111 888 numbers' txt = re.findall('[0-9]+', str) print(txt) # output: # ['456', '111', '888']
Using iterator:
str = 'string with456 some111 888 numbers' for m in re.finditer('[0-9]+', str): print(m)
<_sre.SRE_Match object; span=(11, 14), match='456'> <_sre.SRE_Match object; span=(19, 22), match='111'> <_sre.SRE_Match object; span=(23, 26), match='888'>
Compiling a regular expression
If you are using a regular expression in a loop , for example while reading lines from a file it is better to compile it for performance :
reobj = re.compile (r"[0-9]+") for line in myfile: m = reobj.match(line) if m: print(m.string[m.start():m.end()])
4 thoughts on “Python – Regular Expressions Practical Guide”
Comments are closed.
Not off to a great start – first example should be x=”[a-z]+@[a-z]+\.[a-z]+”
You are right, thanks
using iterator print(m) is missing info on print, also needs m.span() and m,m.span(), to get “span=” and “match=”
[…] For complete guide look here […]