Regular Expressions

Last updated
Save as PDF

DataStore®DSX uses regular expressions to define the Content fields used in Data Definitions and Search Templates.

Each defined Position can be used find one match. When a match is found, the rest of the line(s) is ignored.

The regular expression anchor \G is not supported.

The choice of * and + makes a difference to what is found. For example, part of a bank statement may look like:

01/01/88 05227 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 05228 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 02745 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02754 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02757 CASH	ZZ000000Z	TEST2 - IM/270691/1040 - SRP	5,000.25
01/01/88 02768 CASH	ZZ000000Z	ASSURED DESC	333.34

The regular expression \s*\w* applied to all the lines shown, returns the matches shown in blue:

01/01/88 05227 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 05228 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 02745 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02754 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02757 CASH	ZZ000000Z	TEST2 - IM/270691/1040 - SRP	5,000.25
01/01/88 02768 CASH	ZZ000000Z	ASSURED DESC	333.34

This is because \s* means “0 or more spaces”.

The regular expression \s+\w* applied to all the lines shown, returns the matches shown in blue:

01/01/88 05227 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 05228 PREM	DP000001K	DATASURE (ACCOUNT HOLDER) LTD	234,041.09
01/01/88 02745 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02754 PREM	ZZ000000Z	ASSURED DESCRIPTION	5,000.25
01/01/88 02757 CASH	ZZ000000Z	TEST2 - IM/270691/1040 - SRP	5,000.25
01/01/88 02768 CASH	ZZ000000Z	ASSURED DESC	333.34

This is because \s+ means “1 or more spaces”.

When the result is returned, the spaces are removed and so only “01” or “02696”, for example, are stored.

The DataStore®DSX regular expression engine works in single-line mode. This means some multi-line mode commands are equivalent to single-line mode commands.

The supported characters are described in Table 204.

Table 204. Regular Expressions: Supported Characters

Character	Description	Example
Any character except [\^$.\|?*+()	All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they are part of a valid regular expression token (e.g. the {n} quantifier, see “{n} where n is an integer >= 1”).	account matches account
\ (backslash) followed by any of [\^$.\|?*+(){}	A backslash escapes special characters to suppress their special meaning.	\+ matches + \.txt matches .txt \\ matches \
\t	Match a tab character. Can be used in character classes.	\t matches a tab space.

Character

Description

Example

Any character except [\^$.|?*+()

All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they are part of a valid regular expression token (e.g. the {n} quantifier, see “{n} where n is an integer >= 1”).

account matches account

\ (backslash) followed by any of [\^$.|?*+(){}

A backslash escapes special characters to suppress their special meaning.

\+ matches +

\.txt matches .txt

\\ matches \

Match a tab character. Can be used in character classes.

\t matches a tab space.

The supported characters classes are described in Table 205.

Table 205. Regular Expressions: Supported Character Classes

Character	Description	Example
[ (opening square bracket) Any character except ^-]\ adds that character to the possible matches for the character class.	Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. Note: The rules outside this table are not valid in character classes.	[abc] matches a, b or c
\ (backslash) followed by any of ^-]\	A backslash escapes special characters to suppress their special meaning.	[\^\]] matches ^ or ]
- (hyphen) except immediately after the opening [	Specifies a range of characters. Note: Specifies a hyphen if placed immediately after the opening [	[a-z] matches any lower- case letter [a-zA-Z0-9] matches any letter or digit
^ (caret) immediately after the opening [	Negates the character class, causing it to match a single character not listed in the character class. Note: Specifies a caret if placed anywhere except after the opening [	[^a-d] matches any character except a, b, c or d
\d, \w and \s	Shorthand character classes matching digits, word characters (letters, digits and underscores) and whitespace (spaces, tabs and line breaks). Can be used inside and outside character classes.	[\d\s] matches a character that is a digit or whitespace [\w] matches a character that is a letter, digit or underscore
\D, \W and \S	Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.)	\D matches a character that is not a digit
. (dot)	Matches any single character except line break characters \r and \n.	. matches x or (almost) any other character

The supported quantifiers are described in Table 206.

Table 206. Regular Expressions: Supported Quantifiers

Character	Description	Example
? (question mark)	Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.	abc? tries to match abc but if that fails, tries to match ab
??	Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.	abc?? matches ab or abc
* (star)	Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.	“.*” matches “” “def” “ghi” in abc “” “def” “ghi” jkl
*? (lazy star)	Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.	“.*?” matches “def” in abc 22 “def” “ghi” jkl
+ (plus)	Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.	“.+” matches “def” “ghi” in abc “def” “ghi” jkl
+? (lazy plus)	Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.	“.+?” matches “def” in abc “def” “ghi” jkl
{n} where n is an integer >= 1	Repeats the previous item exactly n times.	a{3} matches aaa
{n,m} where n >= 0 and m >= n	Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.	a{2,4} matches aaaa, aaa or aa
{n,m}? where n >= 0 and m >= n	Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.	a{2,4}? matches aa, aaa or aaaa Lazy so tries to match aa. If that succeeds, stops. Else tries aaa, etc. ^a{2,4}?[a-z]$ matches aaaac because the ^ specifies the 2-4 a’s must be at the start of the string and the final letter must be at the end of the string [a-z]$.
{n,} where n >= 0	Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.	a{2,} matches aaaaa in aaaaa
{n,}? where n >= 0	Repeats the previous item n or more times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.	a{2,}? matches aa in aaaaa Lazy so tries to match aa. If that succeeds, stops.

The supported boundaries are described in Table 207.

Table 207. Regular Expressions: Supported Boundaries And Alternation

Character	Description	Example
^ (caret) or \A (equivalent in single-line mode)	Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.	^. matches a in abc \A. matches a in abc
$ (dollar) or \Z or z (equivalent in single-line mode)	Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.	.$ matches c in abc .\Z matches c in abc .\z matches c in abc
\b	Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters.	.\b matches c in abc def .\b matches _ in abc123_+:lj9: (as _ is a word character)
\B	Matches at the position between two word characters (i.e. the position between \w\w) as well as at the position between two non- word characters (i.e. \W\W).	.\B matches a in abc def
\| (pipe)	Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options. The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.	abc\|def\|xyz matches abc, def or xyz abc(def\|xyz) matches abcdef or abcxyz