How important are regular expressions in programming

danielfett.de

"Regular expressions" are a kind of language that can be used in programming for various problem solutions, especially when it comes to editing and checking character strings or searching for something in them.

And because the name "Regular Expressions" is a bit unwieldy, the "Regular Expressions" are often just called "RegEx (en)".

Here is a little tutorial on these seemingly esoteric but incredibly powerful strings that trigger associations of a small child and his first attempts at the keyboard in the inexperienced viewer.

introduction

What are regular expressions? As I said, regular expressions can be used to check character strings for a certain composition, as is important, for example, in applications that expect input from a user.

Examples:

  • School grade
  • Post Code
  • E-mail address
  • Order numbers

Concrete application

Popular places to come across regular expressions and where you can apply that knowledge:

  • Web applications (e.g. in PHP, Perl)
  • Unix scripts

In the following it is assumed that you have a corresponding position and have already informed yourself about how regular expressions are used in the environment of your choice. In PHP this is done e.g. with the functions preg_match (checking strings, finding character strings) and preg_replace (replacing character strings).

It is best to test your expression "on dry land" first. To do this, you can use an online tool like regex 101 or the Regex Coach.

Good style

Often times you will find that there are many different solutions to a problem. Then face each other

  • A exact Solution and a more general. If there is a risk that the exact solution is too "restrictive" (and so that a user might give up in frustration if his / her correct input is not accepted), the general one should be better. If the problem is very "clear", it is better to use the exact one.
  • A exact Solution and a fast. Long regular expressions can take a long time to process. Here, too, you have to weigh up what is more important to you. It could annoy users if it takes a long time to validate an input.
  • A simple Solution and a "elegant". Can you express in 10 characters what others cannot do in 60? Great, but remember that you might want to change something later! And that can be harder than you think. Therefore, it is better to use the simpler variant or comment on complicated expressions.

Problems

If your regular expression causes problems, but you're pretty sure it's correct, first take a look at the special characters below and what to do with them in order to use them. Special characters are as "banal" things as a point.

You should also urgently test it in the above-mentioned RegEx coach! This sometimes brings out unbelievable irregular behaviors ;-)

Then you can use Google next ;-) There you will find a lot of information, many templates to express various problems in a regular way and the Google Groups are always helpful.

Conventions

In this tutorial the following colors are used for texts (strings): I often use "strings" that are not foreign to the programmers (I save myself the underwear joke) for a piece of text. In addition, the following applies: an expression "meets" = "describes a text exactly" = "performs the desired function" = "matches" = "eats (something)"

Simple expressions

One-element regular expressions

Let's start small. We want to check whether an entry corresponds to a school grade of 1-6. do it for us. You see: in square brackets follows a list of characters that are allowed. Overall, however, the entire expression in brackets only stands for one character: 1 or 2 or ... or 6.

Since our numbers from 1 to 6 follow one another so nicely, we can also write what makes things a little clearer. This is exactly how you could e.g. check whether an entry corresponds to a track on a train station: for a train station with 9 tracks. At our train station, platform 4 is blocked today, so input is not allowed: We have divided the input area into two areas 1-3 and 5-9. You see: the two areas are simply written one after the other. This takes getting used to at the beginning (intuitively you might want to add a space), but we will come across this more often.

Multi-element regular expressions

What if our station is now growing? Let's say it is expanded to 12 tracks: (Track 4 is open again ;-)). But be careful, there is a bug in here!

As mentioned above, the entire Expression in square brackets only for a Character. However, "12" contains two characters. The code above won't work: it means "1 to 1" or "2". We will demonstrate later how to deal with stations with more than 9 tracks.

For the moment we want to check something else: Platforms 1-9 still have sections "a" and "b". So we want to check for 1a, 1b, 2a, ..., 9a, 9b: That means: A number 1-9 followed by a letter a or b.

Several square brackets correspond to several characters. If an expression is to describe several characters, these are simply attached one after the other. Then the input is compared with your expression from left to right.

Of course, not all letters always have to be listed: now also applies, for example, to 4d.

Please note that upper and lower case are treated separately. A useful extension would be e.g.

Options

Let's look at entering a house number. These can consist of one or more (up to 3) digits and have an a-z at the end. But don't have to. The question mark after an element (and [1-9] is a Element!) Means: The previous element can occur, but does not have to be.

The last code is called: One digit (1-9), optionally two further digits (0-9), optionally one letter.

If you imagine such a construction with e.g. 10 digits, you can easily imagine that it quickly becomes very long. Therefore there is a different spelling if you want to allow an element more than once: meets "ah" as well as "aaah". The specification in brackets stands for the {minimum, maximum} number of characters. We can therefore formulate our house numbers like this: and then rewrite them for longer streets (USA ;-)) without any problems: now also allows house numbers in the five-digit range, but requires at least a two-digit number (note the change in front of the comma).

A construction like (i.e. a bracket with only one specification) requires the previous expression to be repeated exactly x times. In our case 5 times: This regular expression would be suitable to check (German) postcodes.

We can also express "at least": requires at least 3 digits.

Any repetitions

Up to now we always had to know how many characters should be found. However, there are certainly cases in which a character can be repeated as often as desired.

Let's look at checking phone numbers. When writing a regular expression, it is often advisable to first make a note of what it should meet. Our telephone numbers should be of this format:

  • 0651/55541-36
  • 0049 160 555678
  • 0180.23.555.63

In addition to numbers, there can also be hyphens, slashes, spaces and periods. A a valid element would be (So: This a Character consists of a number between 0 and 9 or a slash or one point or a space or a minus.) But be careful! As you can see, there are two hyphens in the brackets: once in a special function to indicate a range of numbers and once as a "real" hyphen that may appear. To avoid confusion, we have to mask the latter, so show that it has no special function here. This works with a preceding backslash: And then we want to inform you that these characters can appear as often as you like: The plus means: one or more times the character in front of it, i.e. at least once. If we also want to allow an empty telephone number (the mathematician would say: the zero number), there is another character: now also meets "" (empty string) and of course our telephone numbers.

This term is now of course a very general term (see discussion above). It also allows phone numbers that are obviously incorrect, such as if you want to restrict this further, you have to develop a more complicated expression.

So to summarize again: The + stands for at least one repetition. * means: never or as often as you like.

placeholder

Let's move on to another example: We want to search for an author in a library, but we don't know whether he might have a middle name. So you can use your first name any characters come and then only the last name. A possible solution looks like this: This means in detail: "Marius", then a space followed by any character (that's what the point stands for!) Any number of times (i.e. possibly none at all) and then the surname. This applies to "Marius Easter Bunny" as well as to "Marius Müller Easter Bunny".

In the same way, you can of course add a plus to the point to require at least one character.

The point usually "eats" almost everything, but no line breaks. You can find out how to get him to do this under Modifiers.

The "attentive reader" will not have missed the fact that we already had a point above, namely within the square brackets. As in many, but not all special characters have to be masked as such outside of square brackets (backslash in front of them), but not inside.

Negate character classes

Suppose we don't know exactly the author's middle name, but can remember that it doesn't contain a q or a z. No problem either: Requires a middle name (hence the plus) and allows any character, the Not q or z is. The ^ negates a character class and applies exactly up to the closing bracket.

Brackets

Brackets can be used to combine longer expressions into one element and thus make it possible to apply what you have learned above to partial expressions: exactly matches "Marius Müller Osterhase" or "Marius Osterhase" and nothing else: the question mark relates - thanks of brackets - on the complete middle name and the following space.

You can also write: What now stands for both this yellow thing and, for example, "Banananana". The notation with the curly brackets can also be used here: which of course now has a correspondingly different meaning.

Alternatives

You can also do other things with brackets, e.g. specify alternatives to a partial expression: In this example, the last words "great" or "really bad" can appear, but not both.

Modifiers

In all RegEx variants you can set so-called modifiers and thus control the exact behavior of the expression. In Java you can do this, for example, when constructing a matcher object, in PHP a regular expression always has the syntax, for example, where the slashes are the delimiters (others are possible here, e.g. ~), and "i" and "m" in this case modifiers. Common modifiers include:

  • i Activate case insensitivity (the disregard of upper and lower case)
  • s Point becomes multiline: The point also eats line breaks, this is not the case by default.
  • m Line mode: The characters ^ and $ also match the beginnings and ends of lines. Without the modifier, they only match the beginning and end of the entire string.

Modifiers always refer to the entire expression and are therefore an easily overlooked source of error.

More complicated expressions

Nesting

Of course, various brackets can also be nested in one another, e.g. in the following expression: which applies to "VW Golf", "VW Polo", "Fiat Punto" and "Fiat Panda". Although this enables very short expressions for long strings, it can also take a long time to process.

These alternatives can be repeated in exactly the same way: Describes a sequence of zeros and ones in which a maximum of 2 zeros or ones follow one another. (If this is not immediately clear: just think about what you can put together from "10" and "01".)

Greedy expressions

An example from practice: The following, somewhat longer code should look for the links or their target addresses from an HTML file. A link in an HTML source code usually has a format like this: e.g. A possible expression for it is found very quickly: Looks good, but doesn't work. Why?

The author did indeed thought the expression would end at the closing bracket belonging to the link, but it doesn't. The following is an excerpt from an HTML file with the hit marked: This is clearly going too far! The second point is too "greedy" and eats all characters even over several closing angle brackets. In other situations it is possible that the first point could run amok in a similar way.

In our case there are two possible solutions. Often, however, only one of them remains, so I present both of them:

  • Replace the point by something that no longer eats angle brackets: This is a method that is often used (think about which characters mark the end and exclude them from the one to be taken). The whole thing now has to be done analogously for the first point:
  • "Reeducate" the point (or * make it frugal) so that both eat together only as much as is absolutely necessary. This is done by an attached question mark, which of course no longer has the previously known function: Of course, this special use of the question mark also works after a plus sign.

groups

Almost all RegEx dialects allow you to create groups and save them for later use. Brackets can also be used for this purpose.

Later use can mean, for example, that an expression should match a longer character string, but only part of it should actually be used.

Something like this would be very useful in conjunction with the above expression to filter all destination addresses from an HTML page.

How exactly you get the contents of the groups is stated in the instructions for the language you are using. In PHP you can find it e.g. in preg_match and Java as far as I can remember somewhere near regex.matcher. Google should know more about that. The first point and its "multipliers" are in parentheses. The URL is now in the first group.

Note that when numbering the groups, the order of the opening brackets counts. This is important to note with nested brackets. Also, they usually count all brackets, even if they are only used to identify alternatives (see above).

credentials

There is another very useful use for groups. Imagine that in a sequence of numbers (let's say separated by a space) you want to find all the numbers that start and end with the same digit.

So we can use such an expression: where the \ 1 refers to the content of the first bracket refers (referenced) and therefore must be the same in the place of \ 1 as in these brackets. The first and last characters our expression is a space to mark the end and the beginning of a number.

Let's take a look at what it hits: As you can see, the first and last space still counts towards the respective hit. This is no wonder, after all, there is a space in front of and behind our expression. That's not really good if we don't really care about that space. Therefore, the RegEx language provides a means here as well, which also works if we do not want to or cannot use groups:

Special characters: word boundaries

There are characters that appear in the regular expression, but not in the text that is subsequently matched. Does that mean nothing to you now? Well, let's look at an example of how to write the above expression without the spaces: This "\ b" is a related element and marks the beginning or the end of a word (i.e. a word boundary). Are you familiar with the "Search whole word only" option in the search function that is often used in word processing programs and editors? If you select this, e.g. a search for "Kai" will only find these hits instead of these hits: The latter can be implemented or expressed in regular expressions as follows: So "Kai" only if it is surrounded by word boundaries.What exactly is a word boundary anyway? A word boundary occurs between a word character and a non-word character. Huh? (Explanation follows!)

Special characters: More characters

There is - surprise! - even more special characters that you can use and that can abbreviate a RegEx nicely. These always consist of a backslash followed by another character (letter). By the way, a capital letter always stands for the opposite of a small one. A word sign (small w) stands for [a-zA-Z0-9] and a non-word sign (capital W) stands for everything else, ie [^ a-zA-Z0-9]. One digit, i.e. a digit from 0-9. Corresponds to [0-9] and in capitalization [^ 0-9]. You already got to know the "small" variant above. Accordingly, a capital B stands for all places where there is no word boundary. The small variant stands for all whitespaces: These are almost all the characters that you cannot see. So return (or enter), space, tab (ulator). If you start special characters with backslash, as you have seen, you have to be able to fabricate the backslash yourself somehow if you mean exactly that. You can do this simply by doing a double such. They say: the first "masks" the second. As you saw above, the point and many other signs have a special function. They are therefore prevented ("masked") from this special function by means of a backslash. This only applies to the minus sign within character classes, but you can omit the backslashes in many other characters.

Start and end characters

Imagine you want to check a date, we say with And you let go of this expression. What comes out Your expression will tell you that it hits. It's clear somewhere, after all, if you leave out a one, your date is there. But you want the entire date to look like yours? Isn't there something of a ratio thing? But there is: This roof at the beginning says: Only hit if the string to be searched starts here. And the dollar sign stands for: Only hit if the string to be searched ends here.

Of course, you can also use both individually; As an example, let's look at two regular expressions: The first meets a string that ends with a number. The second meets a string that begins with an opening bracket. We use the masking given above for the bracket.

Positive lookaheads and lookbehinds

You have now learned about ways in which you can determine word boundaries without actually eating them yourself. There is also a universal possibility.

The following sequence of numbers should serve as an example: From this sequence of numbers, all those sequences should be fished out that contain three non-zero digits and are delimited by a zero on each side. In our case "358", "345", "523", "789", "928" and "768". You can use the following code to do this: Don't panic! To explain, and we start in the middle: the [1-9] {3} should be clear (3 non-zero digits). Before that you will find the bracket (? <= 0). You have to look at the characters? <= "En bloc" and they indicate a special function for the brackets. This stands for "leading a zero". And because "in front" in the normal reading direction of the western world means that the bracket looks "backwards", this function is called "positive lookbehind".

The same applies to the characters "? =", Which assign the special function "must come behind ..." to the brackets, in our case "behind a zero". The whole thing is called "positive lookahead" because the bracket looks "forward".

We could also express our "\ b" example from above as follows, if we use our knowledge of "\ W": To explain, we start again in the middle: There is the code known from above (one digit, any number further digits and one more digit that is the same as the first). The very first bracket requires a non-word character before the first digit and the very last bracket requires a non-word character after the last number.

Of course, lookaheads and lookbehinds can also appear individually in a regular expression and even multiple times. The only thing that Not is a lookbehind with an unknown number of characters. (so things like *, + and {0,4})

All right? If not, go back to the beginning of this section, don't click "Go" and don't include 200 regexes.

Negative lookaheads and lookbehinds

You found it complicated and you are slowly realizing how voodoo are regular expressions? It gets better.

Here's an example: You want to check an e-mail address that should not come from France. First, let's look at a simple email address check without the restriction: (One character not equal to the @ sign, followed by an @ sign, any character (but at least one), a period and any characters after that, but no more point. Applies to [email protected] as well as [email protected]) Important: the last pair of square brackets always eats up the last part (the top-level domain , in the example "com") the e-mail address.

Now we add a restriction and want to exclude "fr" at this point. So the point must not be followed by "fr". do it for us. ?! thus indicates a bracket as "here may." Not consequences".

This could also be used to create a special search for a word. matches all words that begin with "F" but are not equal to "Feta". In the following text, all strings that are hit are marked: That was the "negative lookahead".

And because it was so nice, I add one more, the "negative lookbehind": encounters all buckets that do not belong in rubbish, so means: "hit at this point if there is no" rubbish "in front of it".

I hope this little tutorial helped you a little through the drawing of regular expressions. If you still have questions, I recommend the websites mentioned above as reference works. If you found the tutorial helpful, I would also appreciate a donation: