The main problem with putting TEI specifically into an XQuery engine is that TEI is such a mess! At least the stuff I've seen. A query to find a word in a "div" requires doing //div1 || //div2 || //div3 ... queries (and that's the prettiest example). Having to know the messed up structure of the original source at query time is pretty crazy and I've yet to come across a *real* real-world need to query like that.

If you're structured input is disastrous, an XML engine isn't going to help.

I fought with this stuff for a while in my efforts to refactor the search engine of the Rossetti Archive (http://www.rossettiarchive.org/rose ). That structured search section was made to satisfy the scholars that couldn't stand the thought of not having the control of fielded search, but the single query box to rule them all is more than good enough for the job (and if you enter a structured search you'll see the Lucene query in the results page) - and this was built pre-Solr/ pre-dismax. Initially they were using a Tamino XML (to be fair, an ancient version) database for this stuff, and their searches took *minutes* literally, and it was doing the expansion across a boatload of XPaths for a single query from a user.

Another interesting avenue to explore is using payloads to tag terms with the hierarchical position. See Tricia's work here (and I believe her need was TEI too): http://issues.apache.org/jira/browse/SOLR-380

        Erik


On May 27, 2009, at 11:15 PM, Otis Gospodnetic wrote:


Nice and timely topic for me.

You may find this this interesting:

http://www.jroller.com/otis/entry/xml_dbs_vs_search_engines

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Walter Underwood <wunderw...@netflix.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 27, 2009 10:53:16 PM
Subject: Re: term vectors

If you really, really need to do XML-smart queries, go ahead and buy
MarkLogic. I've worked with the principle folk there and they are
really sharp. Their engine is awesome. XML search is hard, and you
can't take a regular search engine, even a really good one, and make
it do full XML without tons of work.

If, as Erik and Matt suggest, you can discover a substantially simpler (and flat) search schema that makes your users happy, then go ahead and
use Solr.

wunder

On 5/27/09 7:00 PM, "Matt Mitchell" wrote:

I've been experimenting with the XML + Solr combo too. What I've found to be
a good working solution is to:

pick out the nodes you want as solr documents (every div1 or div2 etc.)
index the text only (with lots of metadata fields)
add a field for either the xpath to that node, or
save the individual nodes (at index time) into seperate files and store
the name of the file in the solr doc
You could even store the chunked XML in a non-tokenized, stored field in
the solr document as long as the XML isn't too huge.

So when you do your search, you get all of the power of solr. Then use the
xpath field or the filename field to load the chunk, then transform.

Matt

On Wed, May 27, 2009 at 8:25 PM, Erik Hatcher
wrote:


On May 27, 2009, at 4:56 PM, Yosvanys Aponte wrote:

i undestand what you say
but the problem i have is

user can make query like this:

//tei.2//p"[quijote"]


A couple of problems with this... for one, there's no query parser that'll interpret that syntax as you mean it in Solr. And also, indexing the hierarchical structure (of TEI, which I'm painfully familiar with) requires flattening or doing lots of overlapped indexing of fields that represent the
hierarchy at various levels.

In my experience with the TEI domain, users don't *really* want to query like that even though they'll say they do because it's the only way they're
used to doing it.

Perhaps step back and ask yourself and your users what is really desired from the search application you're building. What's the goal? What needs to displayed? What type of query entry form will they be typing into?

      Erik



Reply via email to