Regular Expressions are commonly used in Linux command line tools like sed, awk, grep etc. Most programming languages support them in either built – in or through an external library.
The main problem of using them is that they difficult to understand, but they are well worth the effort to learn. Using a regular expression can save you a lot of time.
Lets start with a simple example:
Validating Input string
1 2 3 4 5 6 7 8 9 10 11 12 |
import re x="[a-z]+@[a-z]+\.[a-z]+" s1='liran@devarea.com' s2='liran#devarea.com' c=re.match(x,s1) if c: print ('ok') else: print ('no') |
We declare the regular expression to match email with a very simple rules:
- at least one letter – [a-z]+
- followed by @
- followed by at least one letter
- followed by period
- followed by at least one letter
This is a very simple example , it doesn’t accept digits, doesn’t check for known extension (com/net/org) and there are more pitfalls. But the point is that if we want to add those rules we need to change the regular expression only.
Some match rules:
1 2 3 4 5 6 7 8 9 10 11 12 |
x? match 0 or 1 occurrences of x x+ match 1 or more occurrences of x x* match 0 or more occurrences of x x{m,n} match between m and n x’s hello match hello hello|world match hello or world ^ match beginning of text $ match end of text [a-zA-Z] match any char in the set [^a-zA-Z] match any char not in the set |
For example if we want to add support for digits in the first part of the email expression we add:
1 |
str="[a-z0-9]+@[a-z]+\.[a-z]+" |
Or if we want to enable only .com or .net emails we need to add:
1 |
x="[a-z0-9]+@[a-z]+\.(net|com)" |
Some examples:
Email:
1 |
email="^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$" |
URL
1 |
url='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' |
Phone number:
1 |
phone = '(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})' |
Search and Match
Search in text based on regular expression. For example if we want to find a sentence starting with ‘hello’ or ‘bye’ and ending with ‘day’ or ‘month’
1 2 3 4 5 6 7 |
str = 'welcome and hello all, have a good day' m = re.search(r"(hello|bye).*(day|month)", str) if m: print('Matched',m.groups()) print('Start index', m.start()) print('End index', m.end()) |
Output:
1 2 3 |
Matched ('hello', 'day') Starting at 12 Ending at 38 |
Regular expression substitution
Sometimes you need to find and replace one sub string with another. Using regular expressions , you can search also for pattern. For example we want to find all the numbers in the text and replace it with * :
1 2 3 |
str = 'string with456 some111 888 numbers' txt = re.sub('[0-9]+', '*', str) print(txt) |
Output:
1 |
string with* some* * numbers |
You can use subn which returns a tuple :
1 2 3 4 5 |
str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', '*', str) print(txt) if txt[1]: print(txt[0]) |
Output:
1 2 |
('string with* some* * numbers', 3) string with* some* * numbers |
You can also supply a function in the second parameter, the function will be invoked for any match helping you decide what to do:
1 2 3 4 5 6 7 8 9 |
def fn(match): print( match.group(0)) return '#' str = 'string with456 some111 888 numbers' txt = re.subn('[0-9]+', fn, str) print(txt) if txt[1]: print(txt[0]) |
Output:
1 2 3 4 5 |
456 111 888 ('string with# some# # numbers', 3) string with# some# # numbers |
Splitting a string
Using string class, you can split a string to substrings only with one separator.
Using regular expressions, you can do it for a pattern and for multiple separators. For example:
1 2 3 |
str = 'string with456 some111 888 numbers' txt = re.split('[0-9]+',str) print(txt) |
Output:
1 |
['string with', ' some', ' ', ' numbers'] |
Multiple separators:
1 2 3 |
str = 'str,in;g wi,th*456 so#me1;11 88$8 numbers' txt = re.split('[,;*#$]',str) print(txt) |
Output:
1 |
['str', 'in', 'g wi', 'th', '456 so', 'me1', '11 88', '8 numbers'] |
Shortcuts:
There are some shortcuts for common patterns like numbers, words, etc.
For example if we want to find one or more digit we can use the pattern [0-9]+ . We can do it with ‘\d+’ as a shortcut:
1 2 3 4 5 |
str = 'string with456 some111 888 numbers' txt = re.split('\d+',str) print(txt) # output: # ['string with', ' some', ' ', ' numbers'] |
Other shortcuts:
\D – not digit:
1 2 3 4 5 |
str = 'string with456 some111 888 numbers' txt = re.split('\D+',str) print(txt) # output: # ['', '456', '111', '888', ''] |
\w – word
\W – not word
\s – white space
\S – not white space
Find All
1 2 3 4 5 |
str = 'string with456 some111 888 numbers' txt = re.findall('[0-9]+', str) print(txt) # output: # ['456', '111', '888'] |
Using iterator:
1 2 3 |
str = 'string with456 some111 888 numbers' for m in re.finditer('[0-9]+', str): print(m) |
Output:
1 2 3 |
<_sre.SRE_Match object; span=(11, 14), match='456'> <_sre.SRE_Match object; span=(19, 22), match='111'> <_sre.SRE_Match object; span=(23, 26), match='888'> |
Compiling a regular expression
If you are using a regular expression in a loop , for example while reading lines from a file it is better to compile it for performance :
1 2 3 4 5 |
reobj = re.compile (r"[0-9]+") for line in myfile: m = reobj.match(line) if m: print(m.string[m.start():m.end()]) |
Not off to a great start – first example should be x=”[a-z]+@[a-z]+\.[a-z]+”
You are right, thanks
fixed
using iterator print(m) is missing info on print, also needs m.span() and m,m.span(),m.group(0) to get “span=” and “match=”
[…] For complete guide look here […]