Regular Expressions are commonly used in Linux command line tools like sed, awk, grep etc. Most programming languages support them in either built – in or through an external library.
The main problem of using them is that they difficult to understand, but they are well worth the effort to learn. Using a regular expression can save you a lot of time.
Lets start with a simple example:
Validating Input string
import re x="[a-z]+@[a-z]+\.[a-z]+" s1='[email protected]' s2='liran#devarea.com' c=re.match(x,s1) if c: print ('ok') else: print ('no')
We declare the regular expression to match email with a very simple rules:
- at least one letter – [a-z]+
- followed by @
- followed by at least one letter
- followed by period
- followed by at least one letter
This is a very simple example , it doesn’t accept digits, doesn’t check for known extension (com/net/org) and there are more pitfalls. But the point is that if we want to add those rules we need to change the regular expression only.
Some match rules:
x? match 0 or 1 occurrences of x x+ match 1 or more occurrences of x x* match 0 or more occurrences of x x{m,n} match between m and n x’s hello match hello hello|world match hello or world ^ match beginning of text $ match end of text [a-zA-Z] match any char in the set [^a-zA-Z] match any char not in the set
For example if we want to add support for digits in the first part of the email expression we add:
str="[a-z0-9]+@[a-z]+\.[a-z]+"
Or if we want to enable only .com or .net emails we need to add:
x="[a-z0-9]+@[a-z]+\.(net|com)"
Some examples:
Email:
email="^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$"
URL
url='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Phone number:
phone = '(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'
Search and Match
Search in text based on regular expression. For example if we want to find a sentence starting with ‘hello’ or ‘bye’ and ending with ‘day’ or ‘month’
str = 'welcome and hello all, have a good day' m = re.search(r"(hello|bye).*(day|month)", str) if m: print('Matched',m.groups()) print('Start index', m.start()) print('End index', m.end())
Output:
Matched ('hello', 'day') Starting at 12 Ending at 38
Regular expression substitution
Sometimes you need to find and replace one sub string with another. Using regular expressions , you can search also for pattern. For example we want to find all the numbers in the text and replace it with * :
str = 'string with456 some111 888 numbers' txt = re.sub('[0-9]+', '*', str) print(txt)
Output:
string with* some* * numbers
You can use subn which returns a tuple :
str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', '*', str) print(txt) if txt[1]: print(txt[0])
Output:
('string with* some* * numbers', 3) string with* some* * numbers
You can also supply a function in the second parameter, the function will be invoked for any match helping you decide what to do:
def fn(match): print( match.group(0)) return '#' str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', fn, str) print(txt) if txt[1]: print(txt[0])
Output:
456 111 888 ('string with# some# # numbers', 3) string with# some# # numbers
Splitting a string
Using string class, you can split a string to substrings only with one separator.
Using regular expressions, you can do it for a pattern and for multiple separators. For example:
str = 'string with456 some111 888 numbers' txt = re.split('[0-9]+',str) print(txt)
Output:
['string with', ' some', ' ', ' numbers']
Multiple separators:
str = 'str,in;g wi,th*456 so#me1;11 88$8 numbers' txt = re.split('[,;*#$]',str) print(txt)
Output:
['str', 'in', 'g wi', 'th', '456 so', 'me1', '11 88', '8 numbers']
Shortcuts:
There are some shortcuts for common patterns like numbers, words, etc.
For example if we want to find one or more digit we can use the pattern [0-9]+ . We can do it with ‘\d+’ as a shortcut:
str = 'string with456 some111 888 numbers' txt = re.split('\d+',str) print(txt) # output: # ['string with', ' some', ' ', ' numbers']
Other shortcuts:
\D – not digit:
str = 'string with456 some111 888 numbers' txt = re.split('\D+',str) print(txt) # output: # ['', '456', '111', '888', '']
\w – word
\W – not word
\s – white space
\S – not white space
Find All
str = 'string with456 some111 888 numbers' txt = re.findall('[0-9]+', str) print(txt) # output: # ['456', '111', '888']
Using iterator:
str = 'string with456 some111 888 numbers' for m in re.finditer('[0-9]+', str): print(m)
Output:
<_sre.SRE_Match object; span=(11, 14), match='456'> <_sre.SRE_Match object; span=(19, 22), match='111'> <_sre.SRE_Match object; span=(23, 26), match='888'>
Compiling a regular expression
If you are using a regular expression in a loop , for example while reading lines from a file it is better to compile it for performance :
reobj = re.compile (r"[0-9]+") for line in myfile: m = reobj.match(line) if m: print(m.string[m.start():m.end()])
4 thoughts on “Python – Regular Expressions Practical Guide”
Comments are closed.
Not off to a great start – first example should be x=”[a-z]+@[a-z]+\.[a-z]+”
You are right, thanks
fixed
using iterator print(m) is missing info on print, also needs m.span() and m,m.span(),m.group(0) to get “span=” and “match=”
[…] For complete guide look here […]