Regular Expressions are commonly used in Linux command line tools like sed, awk, grep etc. Most programming languages support them in either built – in or through an external library.
The main problem of using them is that they difficult to understand, but they are well worth the effort to learn. Using a regular expression can save you a lot of time.
Lets start with a simple example:
Validating Input string
import re
x="[a-z]+@[a-z]+\.[a-z]+"
s1='liran@devarea.com'
s2='liran#devarea.com'
c=re.match(x,s1)
if c:
print ('ok')
else:
print ('no')
We declare the regular expression to match email with a very simple rules:
- at least one letter – [a-z]+
- followed by @
- followed by at least one letter
- followed by period
- followed by at least one letter
This is a very simple example , it doesn’t accept digits, doesn’t check for known extension (com/net/org) and there are more pitfalls. But the point is that if we want to add those rules we need to change the regular expression only.
Some match rules:
x? match 0 or 1 occurrences of x
x+ match 1 or more occurrences of x
x* match 0 or more occurrences of x
x{m,n} match between m and n x’s
hello match hello
hello|world match hello or world
^ match beginning of text
$ match end of text
[a-zA-Z] match any char in the set
[^a-zA-Z] match any char not in the set
For example if we want to add support for digits in the first part of the email expression we add:
str="[a-z0-9]+@[a-z]+\.[a-z]+"
Or if we want to enable only .com or .net emails we need to add:
x="[a-z0-9]+@[a-z]+\.(net|com)"
Some examples:
Email:
email="^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$"
URL
url='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Phone number:
phone = '(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'
Search and Match
Search in text based on regular expression. For example if we want to find a sentence starting with ‘hello’ or ‘bye’ and ending with ‘day’ or ‘month’
str = 'welcome and hello all, have a good day'
m = re.search(r"(hello|bye).*(day|month)", str)
if m:
print('Matched',m.groups())
print('Start index', m.start())
print('End index', m.end())
Output:
Matched ('hello', 'day')
Starting at 12
Ending at 38
Regular expression substitution
Sometimes you need to find and replace one sub string with another. Using regular expressions , you can search also for pattern. For example we want to find all the numbers in the text and replace it with * :
str = 'string with456 some111 888 numbers'
txt = re.sub('[0-9]+', '*', str)
print(txt)
Output:
string with* some* * numbers
You can use subn which returns a tuple :
str = 'string with456 some111 888 numbers'
txt = re.subn('[0-9]+', '*', str)
print(txt)
if txt[1]:
print(txt[0])
Output:
('string with* some* * numbers', 3)
string with* some* * numbers
You can also supply a function in the second parameter, the function will be invoked for any match helping you decide what to do:
def fn(match):
print( match.group(0))
return '#'
str = 'string with456 some111 888 numbers'
txt = re.subn('[0-9]+', fn, str)
print(txt)
if txt[1]:
print(txt[0])
Output:
456
111
888
('string with# some# # numbers', 3)
string with# some# # numbers
Splitting a string
Using string class, you can split a string to substrings only with one separator.
Using regular expressions, you can do it for a pattern and for multiple separators. For example:
str = 'string with456 some111 888 numbers'
txt = re.split('[0-9]+',str)
print(txt)
Output:
['string with', ' some', ' ', ' numbers']
Multiple separators:
str = 'str,in;g wi,th*456 so#me1;11 88$8 numbers'
txt = re.split('[,;*#$]',str)
print(txt)
Output:
['str', 'in', 'g wi', 'th', '456 so', 'me1', '11 88', '8 numbers']
Shortcuts:
There are some shortcuts for common patterns like numbers, words, etc.
For example if we want to find one or more digit we can use the pattern [0-9]+ . We can do it with ‘\d+’ as a shortcut:
str = 'string with456 some111 888 numbers'
txt = re.split('\d+',str)
print(txt)
# output:
# ['string with', ' some', ' ', ' numbers']
Other shortcuts:
\D – not digit:
str = 'string with456 some111 888 numbers'
txt = re.split('\D+',str)
print(txt)
# output:
# ['', '456', '111', '888', '']
\w – word
\W – not word
\s – white space
\S – not white space
Find All
str = 'string with456 some111 888 numbers'
txt = re.findall('[0-9]+', str)
print(txt)
# output:
# ['456', '111', '888']
Using iterator:
str = 'string with456 some111 888 numbers'
for m in re.finditer('[0-9]+', str):
print(m)
Output:
<_sre.SRE_Match object; span=(11, 14), match='456'> <_sre.SRE_Match object; span=(19, 22), match='111'> <_sre.SRE_Match object; span=(23, 26), match='888'>
Compiling a regular expression
If you are using a regular expression in a loop , for example while reading lines from a file it is better to compile it for performance :
reobj = re.compile (r"[0-9]+")
for line in myfile:
m = reobj.match(line)
if m:
print(m.string[m.start():m.end()])
4 thoughts on “Python – Regular Expressions Practical Guide”
Comments are closed.
Not off to a great start – first example should be x=”[a-z]+@[a-z]+\.[a-z]+”
You are right, thanks
fixed
using iterator print(m) is missing info on print, also needs m.span() and m,m.span(),m.group(0) to get “span=” and “match=”
[…] For complete guide look here […]