<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:dp="http://www.dpawson.co.uk/namespace#"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="dp"
                version="2.0">
  <dp:docs>
    <revhistory>
      <revision>
	<revnumber>1.0</revnumber>
	<date>11 Aug 2004</date>
	<authorinitials>DaveP</authorinitials>
	<revdescription><para>Initial Release</para> </revdescription>
      </revision>
      <revision>
	<revnumber>1.1</revnumber>
	<date>15 Aug 2004</date>
	<authorinitials>DaveP</authorinitials>
	<revdescription><para>Renamed for publication on faq.</para> </revdescription>
      </revision>

      </revhistory>
    </dp:docs>


  <xsl:output method="html" indent="yes" encoding="utf-8"/>


  <dp:testdata>
<l>Plain text</l>
<l>RepeatrepeatrepEat text</l>
<l>UnwantedRubbishwith Wanted Text xyzaAndLotsMore, then more rubbish</l>
<l>nfa not</l>
<l>=XX===</l>
<l>Testing for lazy qualifiers</l>
<l>Testing for newline and tab
On new line
</l>
<l>	tabbed input</l>
<l>Words are made of letters</l>
<l>Redhat and no drawers, or Redhat Linux FC-1</l>
<l>NCR's, A&#x0042;CD etc. Src = &amp;#x0042;BCD</l>
<l>Alternatives, nappy :-)</l>
<l>The key to my safe isn't abd12345AB789</l>
<l>Last word checks?</l>
<l>With blah blah blah 42  blah blah</l>
<l>28 Word word word</l>
<l>http://www.dpawson.co.uk </l>

</dp:testdata>



  <xsl:template match="/">

    <html>
      <head>
        <title>Regex Testbed</title>
      </head>
      <body bgcolor="#FFFFFF">
        <h3>Regex Test bed</h3>

        <p>Gathered to help me learn; thought it may help others too. Add your own to the embedded &lt;dp:testdata> element, and the regex's in the template for &lt;l>. <a href="regextestbed.xsl">Source here</a>. Run it against itself. No input document needed.</p>


        <xsl:apply-templates select="document('')//dp:testdata"/>


<div>
  <h3>And the rest.</h3>
  <p>To my knowledge, the current WD|rec doesn't cover the following regex idioms.</p>
  <table border="1">
    <tr><td>Lookaround</td><td>(?=) (?&lt; )</td> <td>Omitted</td></tr>
     <tr> <td>\x mode + comments</td><td>  # comment .. \n</td> <td>\x OK, no comments though.</td></tr>
      <tr><td>Word boundaries</td><td>\b \&lt;..\></td> <td>try \w workaround</td></tr>
      <tr><td>Unicode combining char.</td><td>\X</td> <td>No valid alternate</td></tr>
      <tr><td>Comments</td><td>(?#...) and #..</td> <td>Sadly missing.</td></tr>
      <tr><td>Embed literals</td><td>\Q...\E</td><td>For ease of reading.</td></tr>
      <tr><td>Backreferences</td><td>\1</td> <td>AFAIK</td></tr>
      <tr><td></td><td></td> <td></td></tr>
    </table>

    <p>And the following need escaping within a character class</p>

    <ul>
      <li>\</li>
      <li>|</li>
      <li>.</li>
      <li>-</li>
      <li>^</li>
      <li>?</li>
      <li>*</li>
      <li>+</li>
      <li>{</li>
      <li>}</li>
      <li>(</li>
      <li>[</li>
      <li>]</li>

</ul>

<h3>Oddments</h3>

<p>Be aware of the XSd usage of these. They may not match your previous expericnece. Of use to anyone using these for XML stuff, see <a href="http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar">XML</a> for a definition of namechar, the \c option and the \i option (<a href="http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter">XML</a></p>

<table border="1">

<thead>
<tr>
<th>Character sequence</th>

<th  >Equivalent character class</th>
</tr>
</thead>
<tbody>
  <tr>
    <td>.</td>
    <td>[^\n\r]</td>
  </tr>
  <tr>
    <td>\s</td>
    <td>[#x20\t\n\r]</td>
  </tr>
  <tr>
    <td>\S</td>
    <td>[^\s]</td>
  </tr>
  <tr>
    <td>\i</td>
    <td>the set of initial name characters, those matched by Letter | '_' | ':'</td>
  </tr>
  <tr>
    <td>\I</td>
    <td>[^\i]</td>
  </tr>
  <tr>
    <td>\c</td>
    <td>the set of name characters, those matched by  NameChar</td> 
  </tr>
  <tr>
    <td>\C</td>
    <td>[^\c]</td>
  </tr>
  <tr>
    <td>\d</td>
    <td>\p{Nd}</td>
  </tr>
  <tr>
    <td>\D</td>
    <td>[^\d]</td>
  </tr>
  <tr>
    <td>\w</td>
    <td>[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] <br />(all characters except the set of "punctuation", "separator" and "other" characters) </td>
  </tr>
  <tr>
    <td>\W</td>
    <td>[^\w]</td>
  </tr>
</tbody>
</table>



  </div>
      </body>
    </html>
  </xsl:template>


  <xsl:template match="dp:testdata">
    <xsl:apply-templates/>
  </xsl:template>


  
  <xsl:template match="l">

    
<!-- starting match -->
<xsl:copy-of select="dp:re(.,'^(Repeat)\p{L}+',1,'Anchored Match')"/>
<!-- mid block match -->
<xsl:copy-of select="dp:re(.,'(Wanted Text)',1,'Unanchored match')"/>
<!-- regex engine test -->
<xsl:copy-of select="dp:re(.,'(nfa)|(nfa not)',1,'nfa or nda engine')"/>
<!-- Non lazy quantifiers -->
<xsl:copy-of select="dp:re(.,'xyz([\p{L}]+)',1,'Greedy quantifier')"/>
<!-- Lazy Quantifiers -->
<xsl:copy-of select="dp:re(.,'xyz([\p{L}]+?)',1,'Lazy quantifier')"/>

<!-- newline, tab -->
<xsl:copy-of select="dp:re(.,'(\nOn)',1,'Newline match')"/>
<xsl:copy-of select="dp:re(.,'(\ttabbed)',1,'tab match')"/>
<!-- Class shorthands, \w \s \W -->
<xsl:copy-of select="dp:re(.,'Words [\w]+\W([\w]+)\s',1,'Class shorthand, \w\s\W')"/> 

<xsl:copy-of select="dp:re(.,'Redhat Linux ([\-\w]+)',1,'Lookahead fails, work round it.')"/> 
<!-- ncr's -->
<xsl:copy-of select="dp:re(.,'(ABC)',1,'Numerical Character Entities')"/> 
<!-- altneratives -->
<xsl:copy-of select="dp:re(.,'(diaper|nappy)',1,'Alternates')"/> 
<!-- \d -->
<xsl:copy-of select="dp:re(.,'[\d]+[\i]+([\d]+)',1,'Using \d for digits')"/> 
<!-- $ -->
<xsl:copy-of select="dp:re(.,'([\w]+)\?$',1,'Anchored text')"/> 
<!-- greed -->
<xsl:copy-of select="dp:re(.,'^W.*([\d]+)',1,'Controlling greedy expressions')"/> 
<xsl:copy-of select="dp:re(.,'^W.*([0-9][0-9])',1,'Two digits required, not optional')"/> 
<xsl:copy-of select="dp:re(.,'28 ([\-0-9A-Za-z]+)',1,'Word match')"/> 
<xsl:copy-of select="dp:re(.,'^(http://[a-z.]+)',1,'url')"/> 

<!-- 
<xsl:copy-of select="dp:re(.,'',1,'')"/> 
 -->




  </xsl:template>





<!-- This is the function that does all the work.
param 1 = node with input string
param 2 = regular expression
param 3 = which containing brace set is wanted for any output.
     set to 0 for no output/

 -->
  <xsl:function name="dp:re" as="node()+">
    <xsl:param name="nd" />
    <xsl:param name="re"/>
    <xsl:param name="matchNo" as="xs:integer"/>
    <xsl:param name="desc" as="xs:string"/>
    <p>
    <xsl:for-each select="$nd">

      <xsl:analyze-string
        select="$nd"
        regex="{$re}">
        <xsl:matching-substring>
          <hr />
          <b><xsl:value-of select="$desc"/></b><br />
      <xsl:text>String searched is: [</xsl:text>
      <xsl:value-of select="$nd"/><xsl:text>] </xsl:text>
      <br/>     <xsl:text>Regex is: [</xsl:text>
      <i><xsl:value-of select="$re"/><xsl:text>] </xsl:text></i>
      <xsl:if test="$matchNo != 0">
      <br />      <xsl:text>Match No. </xsl:text>
      <xsl:value-of select="$matchNo"/><xsl:text> is </xsl:text>
      <b>[<xsl:value-of select="regex-group($matchNo)"/>]</b>
    </xsl:if>
        </xsl:matching-substring>
      </xsl:analyze-string>
    </xsl:for-each>
  </p>
  </xsl:function>

  <xsl:function name="dp:num">
    <xsl:param name="nd" />
    <xsl:for-each select="$nd">
      <xsl:text>Line </xsl:text> <xsl:number level="single" count="l" from="dp:testdata" format="1"/>
<br />
  </xsl:for-each>
</xsl:function>



  <xsl:template match="*"/>

</xsl:stylesheet>
