2009-08-23T05:51:22
Dave Pawson.
link
Home
xml and text search
The task. To search 800 plus XML files. The search terms are vague. The choice? A plain text tool such as Lucene, or an XML tool based on XPATH, or something else? The source of the XML is the documentation for Erlang. Uses some 20+ DTDs and is quite well marked up, semantically, so there is information in the metadata|markup. The choices seem to be to deal with it as plain text, either ignoring the XML markup (Lucene variants do this) which gives me good search query support but loses out on the markup, or to use an XML tool kit which provides access to the markup, yet hasn't the same level of support for 'text engineering' as Sheffield Uni calls it with their GATE product. GATE may be a solution but it's a big beast and I haven't yet persuaded it to give up its secrets.
I'm really quite surprised that Lucene hasn't addressed XML as yet. TIKA looks like it should have done, but instead of allowing the use of the current XML markup, it does a transform to some HTML variant! solr wasn't much better, again providing a 'solr' XML rather than a generic solution. I'm coming to the conclusion that the text search world hasn't yet come to terms with XML, or at least isn't natrually at home in a markup environment. I'll keep looking. GATE and NLP seems promising though the number of times it crashes speaks for it's dev state. It has an OWL connection which may be of help, if I can figure out the integration with markup.
More to come on this one I figure.
Keywords: xml
Comments (View)Return to main index