Python – Regular Expressions Practical Guide

Regular Expressions are commonly used in Linux command line tools like sed, awk, grep etc. Most programming languages support them in either built – in or through an external library.

The main problem of using them is that they difficult to understand, but they are well worth the effort to learn. Using a regular expression can save you a lot of time.

Lets start with a simple example:

Validating Input string

import re

x="[a-z]+@[a-z]+\.[a-z]+"

s1='[email protected]'
s2='liran#devarea.com'

c=re.match(x,s1)
if c:
    print ('ok')
else:
    print ('no')

We declare the regular expression to match email with a very simple rules:

  • at least one letter – [a-z]+
  • followed by @
  • followed by at least one letter
  • followed by period
  • followed by at least one letter

This is a very simple example , it doesn’t accept digits, doesn’t check for known extension (com/net/org) and there are more pitfalls. But the point is that if we want to add those rules we need to change the regular expression only.

Some match rules:

x?		match 0 or 1 occurrences of x
x+		match 1 or more occurrences of x
x*		match 0 or more occurrences of x
x{m,n}		match between m and n x’s

hello		      match hello
hello|world	      match hello or world
^		      match beginning of text
$		      match end of text

[a-zA-Z]	      match any char in the set
[^a-zA-Z]	      match any char not in the set

For example if we want to add support for digits in the first part of the email expression we add:

str="[a-z0-9]+@[a-z]+\.[a-z]+"

Or if we want to enable only .com or .net emails we need to add:

x="[a-z0-9]+@[a-z]+\.(net|com)"

Some examples:

Email:

email="^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$"

URL

url='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Phone number:

phone = '(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'

 

Search and Match

Search in text based on regular expression. For example if we want to find a sentence starting with ‘hello’ or ‘bye’ and ending with ‘day’ or ‘month’

str = 'welcome and hello all, have a good day'

m = re.search(r"(hello|bye).*(day|month)", str)
if m:
    print('Matched',m.groups())
    print('Start index', m.start())
    print('End index', m.end())

Output:

Matched ('hello', 'day')
Starting at 12
Ending at 38

Regular expression substitution

Sometimes you need to find and replace one sub string with another. Using regular expressions , you can search also for pattern. For example  we want to find all the numbers in the text and replace it with * :

str = 'string with456 some111 888 numbers'
txt = re.sub('[0-9]+', '*', str)
print(txt)

Output:

string with* some* * numbers

You can use subn which returns a tuple :

str = 'string with456 some111 888 numbers'
txt = re.subn('[0-9]+', '*', str)
print(txt)
if txt[1]:
    print(txt[0])

Output:

('string with* some* * numbers', 3)
string with* some* * numbers

You can also supply a function in the second parameter, the function will be invoked for any match helping you decide what to do:

def fn(match):
     print( match.group(0))
     return '#'

str = 'string with456 some111 888 numbers'
txt = re.subn('[0-9]+', fn, str)
print(txt)
if txt[1]:
    print(txt[0])

Output:

456
111
888
('string with# some# # numbers', 3)
string with# some# # numbers

 

Splitting a string

Using string class, you can split a string to substrings only with one separator.

Using regular expressions, you can do it for a pattern and for multiple separators. For example:

str = 'string with456 some111 888 numbers'
txt = re.split('[0-9]+',str)
print(txt)

Output:

['string with', ' some', ' ', ' numbers']

Multiple separators:

str = 'str,in;g wi,th*456 so#me1;11 88$8 numbers'
txt = re.split('[,;*#$]',str)
print(txt)

Output:

['str', 'in', 'g wi', 'th', '456 so', 'me1', '11 88', '8 numbers']

 

Shortcuts:

There are some shortcuts for common patterns like numbers, words, etc.

For example if we want to find one or more digit we can use the pattern [0-9]+ . We can do it with ‘\d+’ as a shortcut:

str = 'string with456 some111 888 numbers'
txt = re.split('\d+',str)
print(txt)
# output:
# ['string with', ' some', ' ', ' numbers']

Other shortcuts:

\D – not digit:

str = 'string with456 some111 888 numbers'
txt = re.split('\D+',str)
print(txt)
# output:
# ['', '456', '111', '888', '']

\w – word

\W – not word

\s – white space

\S – not white space

Find All

str = 'string with456 some111 888 numbers'
txt = re.findall('[0-9]+', str)
print(txt)
# output:
# ['456', '111', '888']

Using iterator:

str = 'string with456 some111 888 numbers'
for m in re.finditer('[0-9]+', str):
    print(m)

Output:

<_sre.SRE_Match object; span=(11, 14), match='456'>
<_sre.SRE_Match object; span=(19, 22), match='111'>
<_sre.SRE_Match object; span=(23, 26), match='888'>

 

Compiling a regular expression

If you are using a regular expression in a loop , for example while reading lines from a file it is better to compile it for performance :

reobj = re.compile (r"[0-9]+")
for line in myfile:
    m = reobj.match(line)
    if m:
        print(m.string[m.start():m.end()])

 

 

 

 

 

 

 

 

 

 

Tagged

4 thoughts on “Python – Regular Expressions Practical Guide

  1. Not off to a great start – first example should be x=”[a-z]+@[a-z]+\.[a-z]+”

    1. You are right, thanks
      fixed

  2. using iterator print(m) is missing info on print, also needs m.span() and m,m.span(),m.group(0) to get “span=” and “match=”

  3. […] For complete guide look here […]

Comments are closed.