Documentation for the Re module for Python 3 in . Re module for regular expressions

Regular expressions are a very popular component of almost any programming language. They help you quickly access the information you need. In particular, they are used when it is necessary to process text. Python comes with a special module by default. re, which is responsible for working with regular expressions.

Today we will talk in detail about what it is in general, how to work with them and how the module re will help.

Regular expressions: an introduction

What are the uses of regular expressions? Almost all. For example, these:

  1. Web applications that require text validation. A typical example is online mail clients.
  2. Any other projects related to texts, databases and so on.

Before we start parsing the syntax, we should understand in more detail the basic principles of the functioning of the library re and in general, what is generally good about it. We will also give examples from real practice, where we will describe the mechanism of their use. You can create such a template, suitable for you to perform a wide variety of operations with text.

What is a template in the Re library?

With it, you can search for information of various types, get information corresponding to them, in order to make other functions more adaptive. And, of course, to process this data.

For example, take the following template: s+. It means any space character. If you add a plus sign to it, then this means that the pattern includes more than one space. It can even match tab characters that are called with t+.

Before using them, you need to import the library Re. After that, we use a special command to compile the template. This is done in two steps.

>>> import re

>>> regex = re.compile(‘s+’)

Specifically, this code performs the operation of compiling a template that can be used. for example, to search for spaces (one or more).

Getting separate information from different strings using regular expressions

Suppose we have a variable containing the following information.

>>> text = “””100 INF Informatics

213 MAT Mathematics  

156 ENG English»»»

It contains three training courses. Each of them consists of three parts – number, code and name. We see that the interval between these words is different. What to do in order to break this line into separate numbers and words? There are two methods to achieve this goal:

  1. call a function re.split.
  2. apply function split for regex.

Here is an example of using the syntax of each of the methods for our variable.

>>> re.split(‘s+’, text)  

# or

>>> regex.split(text)

Output: [‘100’, ‘INF’, ‘Computer Science’, ‘213’, ‘MAT’, ‘Math’, ‘156’, ‘ENG’, ‘English’]

In general, both methods can be used. But it’s actually much easier to use a regular expression instead of using the function multiple times. re.split.

Finding matches with three functions

Let’s say we need to extract only numbers from a string. What needs to be done for this?

re.findall()

Here is a use case for the function findall(), which, together with regular expressions, allows you to extract occurrences of one or more numbers from a text variable.

>>> print(text)  

100 INF Informatics

213 MAT Mathematics  

156 ENG English

>>> regex_num = re.compile(‘d+’)  

>>> regex_num.findall(text)  

[‘100’, ‘213’, ‘156’]

Together with the d symbol, we used a template that indicates absolutely any numeric value located in a variable or text. And since we added one + there, this means that at least one number must be present. 

You can also use the * sign to specify that the presence of a digit is not required for a match to be found.

But in our case, since we used +, we extracted with findall() 1 or more digital designations of courses from the text. Thus, in our case, regular expressions act as settings for the function.

re.search() vs re.match()

As you can guess from the name of the functions, the first searches for a match in the text. Question: What is the difference between findall? The point is that it returns a specific object that matches the pattern, and not the entire sequence of found results in the form of a list, like the previous function.

In turn, the re.match function does the same. Only the syntax is different. The template must be placed at the beginning. 

Let’s take an example that demonstrates this.

>>> # create a variable with text

>>> text2 = «»»INF Informatics

213 MAT Mathematics 156″»»  

>>> # compile regex and look for patterns

>>> regex_num = re.compile(‘d+’)  

>>> s = regex_num.search(text2)  

>>> print(‘First index: ‘, s.start())  

>>> print(‘Last index: ‘, s.end())  

>>> print(text2[s.start():s.end()]) 

First index: 17 

Last index: 20

213

If you want to get a similar result in a different way, you can use the function group().

Replacing part of text with Re library

To replace text, use the function re.sub(). Suppose our list of courses has changed a little. We see that after each digital value we have a tab. Our task is to combine all this sequence into one line. To do this, we need to replace the expression s+ to pass 

The original text was:

# create a variable with text

>>> text = “””100 INF t Informatics

213 MAT t Math  

156 ENG t English»»»  

>>> print(text)  

100 INFO Информатика

213 MAT Mathematics  

156 ANG English

To perform the desired operation, we used the following lines of code.

# replace one or more spaces with 1

>>> regex = re.compile(‘s+’)  

>>> print(regex.sub(‘ ‘, text))  

As a result, we have one line. 

101 COM Computers 205 MAT Mathematics 189 ENG English

Now consider another problem. We are not faced with the task of putting spaces. It is much more important for us that all course names start on a new line. To do this, another expression is used that adds a newline to the exception. What kind of expression is this?

Library Re supports a feature such as negative matching. It differs from the direct one in that it contains an exclamation point before the slash. That is, if we need to skip the newline character, then we need to write !n instead of n.

We get the following code.

# remove all spaces except newline  

>>> regex = re.compile(‘((?!n)s+)’)  

>>> print(regex.sub(‘ ‘, text))  

100 INF Informatics

213 MAT Mathematics  

156 ENG English

What are regular expression groups?

With the help of groups of regular expressions, we can get the desired objects in the form of separate elements, and not in one line. 

Suppose we need to get the course number, code and name not in one line, but as separate elements. To complete the task, you will need to write a huge number of unnecessary lines of code. 

In fact, the task can be greatly simplified. You can compile the template for all entries and just specify the data that you need to get from the brackets.

There will be a very small number of lines. 

# create groups of course text templates and extract them

>>> course_pattern = ‘([0-9]+)s*([A-ZY]{3})s*([a-zA-ZoY]{4,})’  

>>> re.findall(course_pattern, text)  

[(‘100’, ‘INF’, ‘Computer Science’), (‘213’, ‘MAT’, ‘Math’), (‘156’, ‘ENG’, ‘English’)]

The concept of “greedy” matching

By standard, regular expressions are programmed to extract the maximum amount of matching data. And even if you need much less.

Let’s look at a sample HTML code where we need to get the tag.

>>> text = “Example of Greedy Regular Expression Matching”  

>>> re.findall(‘’, text)  

[‘Example of Greedy Regular Expression Matching’]

Instead of extracting just one tag, Python got the whole string. That is why it is called greedy.

And what to do to get only the tag? In this case, you need to use lazy matching. To specify such an expression, a question mark is added to the end of the pattern.

You will get the following code and the output of the interpreter.

>>> re.findall(‘’, text)  

[”, ”]

If it is required to get only the first encountered occurrence, then the method is used search ().

re.search(‘’, text).group()  

Then only the opening tag will be found.

Popular Expression Templates

Here is a table containing the most commonly used regular expression patterns.

Documentation for the Re module for Python 3 in . Re module for regular expressions

Conclusion

We have considered only the most basic methods for working with regular expressions. In any case, you have seen how important they are. And here it makes no difference whether it is necessary to parse the entire text or its individual fragments, whether it is necessary to analyze a post on a social network or collect data in order to process it later. Regular expressions are a reliable helper in this matter.

They allow you to perform tasks such as:

  1. Specifying the format of the data, such as an email address or phone number.
  2. Getting a string and splitting it into several smaller strings.
  3. Perform various operations with text, such as searching, extracting the necessary information, or replacing part of the characters.

Regular expressions also allow you to perform non-trivial operations. At first glance, mastering this science is not easy. But in practice, everything is standardized, so it’s enough to figure it out once, after which this tool can be used not only in Python, but also in any other programming language. Even Excel uses regular expressions to automate data processing. So it’s a sin not to use this tool.

Leave a Reply