Re: term vectors

Erik Hatcher Thu, 28 May 2009 03:01:51 -0700

The main problem with putting TEI specifically into an XQuery engineis that TEI is such a mess! At least the stuff I've seen. A query tofind a word in a "div" requires doing //div1 || //div2 || //div3 ...queries (and that's the prettiest example). Having to know the messedup structure of the original source at query time is pretty crazy andI've yet to come across a *real* real-world need to query like that.

If you're structured input is disastrous, an XML engine isn't going tohelp.

I fought with this stuff for a while in my efforts to refactor thesearch engine of the Rossetti Archive (http://www.rossettiarchive.org/rose). That structured search section was made to satisfy the scholarsthat couldn't stand the thought of not having the control of fieldedsearch, but the single query box to rule them all is more than goodenough for the job (and if you enter a structured search you'll seethe Lucene query in the results page) - and this was built pre-Solr/pre-dismax. Initially they were using a Tamino XML (to be fair, anancient version) database for this stuff, and their searches took*minutes* literally, and it was doing the expansion across a boatloadof XPaths for a single query from a user.

Another interesting avenue to explore is using payloads to tag termswith the hierarchical position. See Tricia's work here (and I believeher need was TEI too): http://issues.apache.org/jira/browse/SOLR-380


        Erik


On May 27, 2009, at 11:15 PM, Otis Gospodnetic wrote:

Nice and timely topic for me.

You may find this this interesting:

http://www.jroller.com/otis/entry/xml_dbs_vs_search_engines

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Walter Underwood <wunderw...@netflix.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 27, 2009 10:53:16 PM
Subject: Re: term vectors

If you really, really need to do XML-smart queries, go ahead and buy
MarkLogic. I've worked with the principle folk there and they are
really sharp. Their engine is awesome. XML search is hard, and you
can't take a regular search engine, even a really good one, and make
it do full XML without tons of work.
If, as Erik and Matt suggest, you can discover a substantiallysimpler(and flat) search schema that makes your users happy, then go aheadand
use Solr.

wunder

On 5/27/09 7:00 PM, "Matt Mitchell" wrote:
I've been experimenting with the XML + Solr combo too. What I'vefound to be
a good working solution is to:
pick out the nodes you want as solr documents (every div1 or div2etc.)
index the text only (with lots of metadata fields)
add a field for either the xpath to that node, or
save the individual nodes (at index time) into seperate files andstore
the name of the file in the solr doc
You could even store the chunked XML in a non-tokenized, storedfield in
the solr document as long as the XML isn't too huge.
So when you do your search, you get all of the power of solr. Thenuse the
xpath field or the filename field to load the chunk, then transform.

Matt

On Wed, May 27, 2009 at 8:25 PM, Erik Hatcher
wrote:
On May 27, 2009, at 4:56 PM, Yosvanys Aponte wrote:
i undestand what you say
but the problem i have is

user can make query like this:

//tei.2//p"[quijote"]
A couple of problems with this... for one, there's no queryparser that'llinterpret that syntax as you mean it in Solr. And also, indexingthehierarchical structure (of TEI, which I'm painfully familiarwith) requiresflattening or doing lots of overlapped indexing of fields thatrepresent the
hierarchy at various levels.
In my experience with the TEI domain, users don't *really* wantto querylike that even though they'll say they do because it's the onlyway they're
used to doing it.
Perhaps step back and ask yourself and your users what is reallydesiredfrom the search application you're building. What's the goal?What needsto displayed? What type of query entry form will they be typinginto?
      Erik

Re: term vectors

Reply via email to