Regex Test bed

Gathered to help me learn; thought it may help others too. Add your own to the embedded <dp:testdata> element, and the regex's in the template for <l>. Source here. Run it against itself. No input document needed.


Anchored Match
String searched is: [RepeatrepeatrepEat text]
Regex is: [^(Repeat)\p{L}+]
Match No. 1 is [Repeat]


Unanchored match
String searched is: [UnwantedRubbishwith Wanted Text xyzaAndLotsMore, then more rubbish]
Regex is: [(Wanted Text)]
Match No. 1 is [Wanted Text]


Greedy quantifier
String searched is: [UnwantedRubbishwith Wanted Text xyzaAndLotsMore, then more rubbish]
Regex is: [xyz([\p{L}]+)]
Match No. 1 is [aAndLotsMore]


Lazy quantifier
String searched is: [UnwantedRubbishwith Wanted Text xyzaAndLotsMore, then more rubbish]
Regex is: [xyz([\p{L}]+?)]
Match No. 1 is [a]


nfa or nda engine
String searched is: [nfa not]
Regex is: [(nfa)|(nfa not)]
Match No. 1 is [nfa]


Newline match
String searched is: [Testing for newline and tab On new line ]
Regex is: [(\nOn)]
Match No. 1 is [ On]


tab match
String searched is: [ tabbed input]
Regex is: [(\ttabbed)]
Match No. 1 is [ tabbed]


Class shorthand, \w\s\W
String searched is: [Words are made of letters]
Regex is: [Words [\w]+\W([\w]+)\s]
Match No. 1 is [made]


Lookahead fails, work round it.
String searched is: [Redhat and no drawers, or Redhat Linux FC-1]
Regex is: [Redhat Linux ([\-\w]+)]
Match No. 1 is [FC-1]


Numerical Character Entities
String searched is: [NCR's, ABCD etc. Src = &#x0042;BCD]
Regex is: [(ABC)]
Match No. 1 is [ABC]


Alternates
String searched is: [Alternatives, nappy :-)]
Regex is: [(diaper|nappy)]
Match No. 1 is [nappy]


Using \d for digits
String searched is: [The key to my safe isn't abd12345AB789]
Regex is: [[\d]+[\i]+([\d]+)]
Match No. 1 is [789]


Anchored text
String searched is: [Last word checks?]
Regex is: [([\w]+)\?$]
Match No. 1 is [checks]


Controlling greedy expressions
String searched is: [With blah blah blah 42 blah blah]
Regex is: [^W.*([\d]+)]
Match No. 1 is [2]


Two digits required, not optional
String searched is: [With blah blah blah 42 blah blah]
Regex is: [^W.*([0-9][0-9])]
Match No. 1 is [42]


Word match
String searched is: [28 Word word word]
Regex is: [28 ([\-0-9A-Za-z]+)]
Match No. 1 is [Word]


url
String searched is: [http://www.dpawson.co.uk ]
Regex is: [^(http://[a-z.]+)]
Match No. 1 is [http://www.dpawson.co.uk]

And the rest.

To my knowledge, the current WD|rec doesn't cover the following regex idioms.

Lookaround (?=) (?< ) Omitted
\x mode + comments # comment .. \n \x OK, no comments though.
Word boundaries \b \<..\> try \w workaround
Unicode combining char. \X No valid alternate
Comments (?#...) and #.. Sadly missing.
Embed literals \Q...\E For ease of reading.
Backreferences \1 AFAIK

And the following need escaping within a character class

Oddments

Be aware of the XSd usage of these. They may not match your previous expericnece. Of use to anyone using these for XML stuff, see XML for a definition of namechar, the \c option and the \i option (XML

Character sequence Equivalent character class
. [^\n\r]
\s [#x20\t\n\r]
\S [^\s]
\i the set of initial name characters, those matched by Letter | '_' | ':'
\I [^\i]
\c the set of name characters, those matched by NameChar
\C [^\c]
\d \p{Nd}
\D [^\d]
\w [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
(all characters except the set of "punctuation", "separator" and "other" characters)
\W [^\w]