lf53, UNIXBasics: Regular Expressions

by Guido Socher
<guido(at)linuxfocus.org>

About the author:
Loves linux because it is a free system and it is also a lot of fun to work with people from the Linux community all over the world. He spends his spare time with his girl-friend, listens to BBC World Service radio, rides his bike through the countryside and enjoys playing with Linux.
Content:

Introduction
A simple example
The syntax rules
Using Regular Expression for text editing

Regular Expressions

Abstract:

Regular expressions are used for advanced context sensitive searches and text modifications. They can be found in many advanced editors, in parser programs and in languages.

Introduction

Regular expressions can be found in many advanced editors like vi and emacs, in the programs grep/egrep and languages like awk, perl and sed.

Regular Expressions are used for advanced context sensitive searches and text modifications. A Regular Expression is a formal description of a template to be matched against a text string.

When I saw several years ago a person using regular expressions I was fascinated. Text editing and searching tasks that would normally take hours could be done in just a few seconds. Yet, I did not understand a word when I saw the expressions on the screen. They looked like a strange combination of dots slashes, stars and some other characters. Still I was determined to learn how they worked and soon I found that they are quite easy to use. They follow a few simple syntax rules.

Although regular expressions are quite wide spread in the Unix world there is no such thing as 'the standard regular expression language'. It is more like several different dialects. There are e.g. two types of grep programs; grep and egrep. Both use regular expressions with slightly different capabilities. Perl has probably the most complete set of regular expressions. Fortunately all of them follow the same principles. Once you understand the basic idea, it is easy to learn the details of the different dialects.

This article will introduce you to the basics and you can look in the manual pages of the different programs to learn about the different aspects and capabilities of the program.

A simple example

Let's say you have a phone list of a company and it looks like this:

Phone Name  ID
     ...
     ...
3412    Bob 123
3834  Jonny 333
1248   Kate 634
1423   Tony 567
2567  Peter 435
3567  Alice 535
1548  Kerry 534
     ...

It is a company with 500 people. They keep the data just in a plain ascii text file. People with a 1 as the first digit of the phone number are working in building 1. Who is working in building 1?

Regular Expressions can answer that:
grep '^1' phonelist.txt
or
egrep '^1' phonelist.txt
or
perl -ne 'print if (/^1/)' phonelist.txt

In words this means, search for all lines that start with a one. The "^" matches the beginning of a line. It forces the whole expression to match only if a line has a one as the first character.

The syntax rules

Single-character Patterns

The basic building block of a regular expression is the single-character pattern. It matches just this character. An example of a single-character pattern is the 1 in the example above. It just matches a one in the text.

Another example for single character patterns is:
egrep 'Kerry' phonelist.txt

This pattern consists only of single-character patterns (The letters K,e ...)

Characters can be grouped together in a set. A set is represented by a pair of open and close square brackets and a list of characters between the brackets. A set is as a whole also a single-character pattern. One and only one of these characters must be present in the search text for the pattern to match. An example:

[abc]    Is a single-character pattern that matches
         either the letter a, b or c
[ab0-9]  Is a single-character pattern that matches
         either a or b or a digit in the ascii range
         from zero to nine
[a-zA-Z0-9\-] This matches a single-character that
              is either an upper case or lower case
              letter, a digit or the minus sign.

Let's try it:
egrep '^1[348]' phonelist.txt

This searches for lines that start with 13 or 14 or 18.

We saw already that most ASCII characters match just that ASCII character but some ASCII characters have a special meaning. The square brackets start e.g a set. In the set the "-" has the special meaning of a range. To take away the special meaning of a special character you can precede it with a backslash. The minus sign in [a-zA-Z0-9\-] is an example for this. There are also some dialects of the regexp language where special characters start with a backslash. In this case you need to remove the backslash to get the normal meaning.

The dot is an important special character. It matches everything except the newline character. Example:

grep '^.2' phonelist.txt
 or
egrep '^.2' phonelist.txt

This searches for lines with a 2 at the second position and anything as the first character.

Sets can be inverted by starting the set definition with "[^" instead of "[". Now the "^" means no longer beginning of the line but the combination of "[" and "^" indicates the inverted set.

[0-9]    Is a single character pattern that matches
         digit in the ascii range from
         zero to nine.
[^0-9]   Match any single NON-digit character.
[^abc]   Match any single character that is not an
         a, b or c.
 .       The dot matches everything except new line.
         It is the same as [^\n]. Where \n is the
         newline character.

To match all lines that start NOT with a 1 we could
write:
grep '^[^1]' phonelist.txt
or
egrep '^[^1]' phonelist.txt

Anchors

Already in the previous part we saw the "^" that matched the beginning of a line. Anchors are special regexp characters that match a position in the text and not any character of the text.

^  Match the beginning of a line
$  Match the end of a line

To look for people with the company ID number 567 in our phonelist.txt we would use:

egrep '567$' phonelist.txt

This looks for lines with the number 567 at the end of the line.

Multipliers

A multiplier determines how often a single-character pattern must occur in the text.

description grep egrep perl vi vim vile elvis emacs

zero or more times * * * * * * * *

one or more times \{1,\} + + \+ \+ \+ +

zero or one time \? ? ? \= \? \= ?

n to m times \{n,m\} {n,m} \{n,m\}

description	grep	egrep	perl	vi	vim	vile	elvis	emacs
zero or more times	*	*	*	*	*	*	*	*
one or more times	\{1,\}	+	+		\+	\+	\+	+
zero or one time	\?	?	?		\=	\?	\=	?
n to m times	\{n,m\}		{n,m}				\{n,m\}

Note: The various VIs have the magic option set to work as shown above.

An example from the phone list:

....
1248   Kate 634
....
1548  Kerry 534
....

To match a line that starts with a 1, has some digits, at least one space and a name that starts with a K we can write:

grep '^1[0-9]\{1,\} \{1,\}K' phonelist.txt
or use * and repeat [0-9] and space:
grep '^1[0-9][0-9]*  *K' phonelist.txt
or
egrep '^1[0-9]+ +K' phonelist.txt
or
perl -ne 'print if (/^1[0-9]+ +K/)' phonelist.txt

The multiplier multiplies the occurrence of the preceding single-character pattern. So "23*4" does NOT mean " 2 then 3 anything 4" (This would be "23.*4"). It means "one time 2 then maybe several 3 and one 4"

It is also important to note that these multipliers are greedy. It means that the first multiplier in the pattern extends the match as much to the right as possible.

The expression ^1.*4
would match the whole line
1548  Kerry 534
form the start until the very last 4.
It does NOT match only the 154.

This does not make much difference for grep but is important for editing and substitution.

Parentheses as Memory

The Parentheses as Memory construct does not change the way an expression matches but instead causes the enclosed text part to be remembered, so that it may be refered to later on in the expression.

The remembered part is available via variables. The first Parentheses as Memory construct is available via variable one, the second Parentheses as Memory construct is available via variable two and so on.

program name parentheses syntax variable syntax

grep  \1

egrep () \1

perl () \1 or ${1}

vi,vim,vile,elvis  \1

emacs  \1

program name	parentheses syntax	variable syntax
grep	\(\)	\1
egrep	()	\1
perl	()	\1 or ${1}
vi,vim,vile,elvis	\(\)	\1
emacs	\(\)	\1

Example:

The expression [a-z][a-z] would
match two lower case letters.

Now we can uses these variables to search for patterns like the text 'otto':

egrep '([a-z])([a-z])\2\1'

The variable \1 did contain the letter o
and the \2 the letter t.

The expression would also match the name
anna but not yxyx.

Parentheses as Memory constructs are not so much used for finding names like otto and anna but rather for editing and substitution.

Using Regular Expression for text editing

To do editing you will need an editor like vi, emacs or you can use e.g perl.

In emacs you use M-x query-replace-regexp or you can put the query-replace-regexp command on some function key. Alternatively you can also use the command replace-regexp. The query-replace-regexp is interactive, the other not.

In vi the substitution command :%s/ / /gc is used. The percent refers to the ex-range 'whole file' and can be replaced by any appropriate range. E.g in vim you type shift-v, mark an area and then use the substitution on that area only. I don't explain more about vim here as this would be a tutorial on its own. The 'gc' is the interactive version. The no interactive is s/ / /g

Interactive means that you are prompted at each match on whether or not to execute the substitution.

In perl you can use

perl -pe 's/ / /g'

Let's look at a few examples. The numbering plan in our company has changed and all phone numbers that start with a 1 get a 2 inserted after the second digit.

This means e.g 1423 should become 14223.

The old list:

Phone Name  ID
     ...
3412    Bob 123
3834  Jonny 333
1248   Kate 634
1423   Tony 567
2567  Peter 435
3567  Alice 535
1548  Kerry 534
     ...

Here is how to do it:

vi:    s/^\(1.\)/\12/g
emacs: ^\(1.\)   replaced by  \12
perl:  perl -pe 's/^(1.)/${1}2/g' phonelist.txt

Now the new phone list looks like this:

Phone Name  ID
     ...
3412    Bob 123
3834  Jonny 333
12248   Kate 634
14223   Tony 567
2567  Peter 435
3567  Alice 535
15248  Kerry 534
     ...

Perl can handle more than only the memory variables \1 to \9 therefore \12 would refer to the 12-th variable which is of course empty. To solve this we just use ${1}.

Now the allignment in the list is a bit disturbed. How can you fix it? You could just test if there is a space in the 5th position and insert an other one:

vi:     s/^\(....\) /\1  /g
emacs:  '^\(....\) '  replaced by  '\1  '
perl:   perl -pe 's/^(....) /${1}  /g' phonelist.txt
Now the phone list looks like this

Phone Name  ID
      ...
3412     Bob 123
3834   Jonny 333
12248   Kate 634
14223   Tony 567
2567   Peter 435
3567   Alice 535
15248  Kerry 534
      ...

A collegue has manually edited the list and accidently inserted some spaces at the beginning of some lines. How can we remove them?

Phone Name  ID
      ...
3412     Bob 123
     3834   Jonny 333
12248   Kate 634
14223   Tony 567
 2567   Peter 435
3567   Alice 535
  15248  Kerry 534
      ...

This should correct it:
vi:     s/^  *//  (There is 2 spaces as we do not have a +)
emacs:  '^ +'  replaced by the empty string
perl:   perl -pe 's/^ +//' phonelist.txt

You are writing a program and you have the variables temp and temporary. Now you would like to replace variable temp by the variable named counter. If the string temp is just replaced with counter then temporary becomes counterorary which is not really what you want.

Regular expressions can do it. Just replace temp([^o]) with counter\1. That is, temp and not the letter o. (An alternative solution would be to use boundaries but we have not discussed this kind of anchoring pattern.)

I hope that this article did catch your interest. Now you might want to have a look at the man-pages and documentation of your favorite editor and learn the details.

There are also more special characters like e.g the alteration which is a kind of "or" and also the word boundaries mentioned above.

Have fun, happy editing.

Translation information:

en --> -- : Guido Socher <guido(at)linuxfocus.org>

2002-10-22, generated by lfparser version 2.32