• To use regular expressions, import re.
  • Use re.search() to see if a string matches a regular expression.
  • Use re.findall() to extract strings that match a regular expression.
  • Reference: https://docs.python.org/3/howto/regex.html

Match a Regular Expression

import re

fileH = open('file.txt')
for line in fileH:
   line = line.rstrip()
   if re.search('From: ', line):
      print(line)

Extract Substring Matching a Regular Expression

Function re.findall extracts substrings that match a regular expression, returning a list of matches.

import re

x= 'My 2 favorite numbers are 14 and 98'
y = re.findall('[0-9]+', x)
>>> print(y)
['2', '14', '98']

Parentheses are not part of the match, but can define what needs to be extracted

x = 'From: someone@hotmail.com Sat Jan 5'
y = re.findall('^From: (\S+@\S+)', x)
print(y)
['someone@hotmail.com']

Greedy vs Non-Greedy

  • Greedy. The + and * operators will favor the largest possible match.  Example:  ‘^F.+:’
  • Non-greedy.  A question mark after + or * will suspend the greedy match.  Example:  ‘^F.+?:’

Special Characters

^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end