Return Home

Overview

<attach>
<body>
<body>
<clientip>
<comment>
<configuration>
<copy>
<delay>
<dnslookup>
<domainquery>
<email>
<expression>
<expressions>
<file>
<filter>
<filters>
<header>
<header>
<helo>
<http>
<imap>
<index>
<inputpath>
<ipquery>
<log>
<mailfrom>
<maxbuffer>
<maxlines>
<message>
<modifysource>
<multiplerecipients>
<option>
<parser>
<path>
<quickresult>
<rcptto>
<result>
<results>
<return>
<rule>
<rules>
<scope>
<select>
<smtpextendedstatuscode>
<smtpstatuscode>
<target_filter>
<target_index>
<test>
<title>

Boolean type

Macros
Regular Expressions

Regular Expressions (RegEx) Quick Start Tutorial

Most people are familiar with the original MS-DOS wildcards: *, # and ?. "*" meant "match zero or more characters," "#" meant match a single digit, and "?" meant match any single character.

RegEx can be thought of as MS-DOS wildcards on steroids. Lots of steroids! The same principle applies: there are certain keywords or operators that have a particular meaning in how they'll be used to match up with whatever they're being tested against. Everything else matches exactly what is typed.

 

Operators

Here's a quick list of the operators that are understood by RegEx as used in the Content Filter DLL:

Operator   Description
.    Matches any single character
^   Matches the beginning of a line
$   Matches the end of a line
\b   Matches a word boundary
\B   Matches within a word
\w   Matches any character that is word-constituent
\W   Matches any character that is not word-constituent
\<   Matches the beginning of a word
\>   Matches the end of a word
\.   Matches a period character
\^   Matches a circumflex character
\$   Matches a dollar sign character
\[   Matches an opening bracket character
\]   Matches a closing bracket character
\(   Matches an opening parentheses character
\)   Matches a closing parentheses character
\*   Matches an asterisk character
\+   Matches a plus sign character
\?   Matches a question mark character
\\   Matches a backslash character
[abc]   Lists characters that are acceptable matches in that single character position (more on this later)
[^abc]   Lists characters that are not acceptable matches in that single character position (more on this later)
(xxx)   Specifies a group of operators (more on this later)
\digit   Back-reference operator (more on this later)
*   Means "zero or more" of the previous operator
+   Means "one or more" of the previous operator
?   Means "zero or one" of the previous operator
{a,b}   Means "between a and b occurrences" of the previous operator
{a,}   Means "a or more occurrences" of the previous operator
{,b}   Means "between zero and b occurrences" of the previous operator
\|   Means "or" and is only valid within a grouping of operators (see '(xxx)' above)
character   Any other character not listed above matches itself

The backslash is a special operator. In combination with another character it often indicates an operator itself, but for single-character operators, it is used to "escape" the operator so it matches the character of what would otherwise have a special meaning. If used in combination with a normal character that isn't in itself an operator, it's interpreted like just that character without the backslash.

 

List Operator [abc]

The List Operator defines a list of characters or character set. If a character being tested is in this list, it matches. This is often used in combination with the *, +, ? or {} operators, so you can test multiple occurrences of any character in the character set.

If you place the "^" character in the first position, the list becomes everything except the characters specified. The "^" character anywhere else in the list is a match for that character itself.

Within a list, you can specify ranges of characters using the "-" character. For example, "a-z" would be all the lower case letters a through z. If you want a dash itself to be part of the list, it must be the first character in the list (or second if you use the "^" option), otherwise it has the special meaning of a range operator.

There are special operators that specify predefined groups of characters that can be used in a list:

Operator   Description
[:alnum:]    Letters and digits
[:alpha:]   Letters
[:blank:]   A space or tab
[:cntrl:]   Control characters like ASCII 0177 and codes less than ASCII 040.
[:digit:]   Digits
[:graph:]   Printable characters except a space
[:lower:]   Lower case characters
[:print:]   Printable characters (ASCII 040 through ASCII 0176)
[:punct:]   Neither control nor alphanumeric characters
[:space:]   Space, carriage return, newline, vertical tab, form feed
[:upper:]   Upper case characters
[:xdigit:]   Hexadecimal digit 0-9, a-f, A-F

For example:

[abc]+

Matches things like:

ababababababa
abcabcaaaabcbaba
aaaaaaaa

Other examples:

[a-z1-9()]
[aeiou0123456789]
["']
[[:blank:]]
[[:graph:][:lower:]]
 

Grouping Operator (xxx)

The grouping operator logically combines operators so they're treated as if they were a single unit or operator. This is useful because the group is treated as a single unit and can be matched repetitiously with operators like *, +, ? and {}.

For example:

(abc)+

Would match things like:

abc
abcabcabcabcabcabc

Within the grouping operator, you can define multiple alternate groups using the "\|" operator. This is often used to provide varying match alternatives to a larger expression without having to duplicate the entire expression. For example:

the lazy brown (fox\|dog\|cat) jumped over the moon

Would match:

the lazy brown fox jumped over the moon
the lazy brown dog jumped over the moon
the lazy brown cat jumped over the moon

In addition, because the group is treated as a single unit, you can use operators like *, +, ?, and {} with it, such as:

the lazy brown (fox\|dog\|cat)+ jumped over the moon

This would match things like:

the lazy brown fox jumped over the moon
the lazy brown dogfox jumped over the moon
the lazy brown catcatcatcat jumped over the moon

The example itself probably isn't a useful one, but the concept is.

 

Back-reference operator \digit

The back-reference operator takes some time to get used to. It's used following a grouping operator. It allows you to "reach back what a preceding grouping operator matched" and use that to match something again.

For a simple example:

my (boat\|ship) is better than (his\|her) \2

Would match:

my boat is better than his boat
my ship is better than his ship
my ship is better than her ship

But would not match:

my ship is better than her boat
my boat is better than her ship
my boat is better than his ship

The "\2" reaches back to the second previous group and uses what that group matched.

It's a complex operator and takes some practice becoming accustomed to how it works. You should refer to more detailed Regular Expression tutorials and texts for other examples.

 

Common Pitfalls

Because there are so many operators, it's easy to misuse them by accident.

Remember that using a period alone matches any character. If you want to match a period, use "\.".

Unlike MS-DOS wildcards, a "*" alone doesn't match (any number of characters). You need to use it in conjunction with another operator like the operators ".*", which indicates zero or more of any character. If you use a "*" alone, you're really saying "zero or more of the character prior to the *".

Expressions designed to match short words are also in danger of matching larger words that use that smaller word. For example "junction" would match "conjunction". Use the word boundary operator "\b" to ensure you're getting only that word: "\bjunction\b".

Be careful of the "*" operator. It means zero or more of the previous operator. If you create an expression with a single operator like "a*", it will match everything, since it's looking for "zero or more of the character a". You might really want to use the "+" operator, meaning "one or more", or explicitly describing the minimum and maximum number of occurrences using the {} operator, like "a{1,10}" which means "at least 1 and no more than 10 of the character a".

 

Further Reading

You can review this Microsoft Word document for a longer description of Regular Expressions. There are also quite a few published books on using Regular Expressions that you can find in your local bookstore.

 

Return to main page