The main problem with putting TEI specifically into an XQuery engine
is that TEI is such a mess! At least the stuff I've seen. A query to
find a word in a "div" requires doing //div1 || //div2 || //div3 ...
queries (and that's the prettiest example). Having to know the messed
up structure of the original source at query time is pretty crazy and
I've yet to come across a *real* real-world need to query like that.
If you're structured input is disastrous, an XML engine isn't going to
help.
I fought with this stuff for a while in my efforts to refactor the
search engine of the Rossetti Archive (http://www.rossettiarchive.org/rose
). That structured search section was made to satisfy the scholars
that couldn't stand the thought of not having the control of fielded
search, but the single query box to rule them all is more than good
enough for the job (and if you enter a structured search you'll see
the Lucene query in the results page) - and this was built pre-Solr/
pre-dismax. Initially they were using a Tamino XML (to be fair, an
ancient version) database for this stuff, and their searches took
*minutes* literally, and it was doing the expansion across a boatload
of XPaths for a single query from a user.
Another interesting avenue to explore is using payloads to tag terms
with the hierarchical position. See Tricia's work here (and I believe
her need was TEI too): http://issues.apache.org/jira/browse/SOLR-380
Erik
On May 27, 2009, at 11:15 PM, Otis Gospodnetic wrote:
Nice and timely topic for me.
You may find this this interesting:
http://www.jroller.com/otis/entry/xml_dbs_vs_search_engines
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Walter Underwood <wunderw...@netflix.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 27, 2009 10:53:16 PM
Subject: Re: term vectors
If you really, really need to do XML-smart queries, go ahead and buy
MarkLogic. I've worked with the principle folk there and they are
really sharp. Their engine is awesome. XML search is hard, and you
can't take a regular search engine, even a really good one, and make
it do full XML without tons of work.
If, as Erik and Matt suggest, you can discover a substantially
simpler
(and flat) search schema that makes your users happy, then go ahead
and
use Solr.
wunder
On 5/27/09 7:00 PM, "Matt Mitchell" wrote:
I've been experimenting with the XML + Solr combo too. What I've
found to be
a good working solution is to:
pick out the nodes you want as solr documents (every div1 or div2
etc.)
index the text only (with lots of metadata fields)
add a field for either the xpath to that node, or
save the individual nodes (at index time) into seperate files and
store
the name of the file in the solr doc
You could even store the chunked XML in a non-tokenized, stored
field in
the solr document as long as the XML isn't too huge.
So when you do your search, you get all of the power of solr. Then
use the
xpath field or the filename field to load the chunk, then transform.
Matt
On Wed, May 27, 2009 at 8:25 PM, Erik Hatcher
wrote:
On May 27, 2009, at 4:56 PM, Yosvanys Aponte wrote:
i undestand what you say
but the problem i have is
user can make query like this:
//tei.2//p"[quijote"]
A couple of problems with this... for one, there's no query
parser that'll
interpret that syntax as you mean it in Solr. And also, indexing
the
hierarchical structure (of TEI, which I'm painfully familiar
with) requires
flattening or doing lots of overlapped indexing of fields that
represent the
hierarchy at various levels.
In my experience with the TEI domain, users don't *really* want
to query
like that even though they'll say they do because it's the only
way they're
used to doing it.
Perhaps step back and ask yourself and your users what is really
desired
from the search application you're building. What's the goal?
What needs
to displayed? What type of query entry form will they be typing
into?
Erik