Re: metadata about result sets?

Corey Tisdale Fri, 10 Mar 2006 07:56:28 -0800

Interesting... I have been looking at lucene in my spare time at workfor all of 3 days now, so I have to apologize for my lack ofunderstanding when it comes to how it works specifically :) We havea terrrible internal search that we are looking to replace, and theonly thing it does well is help you refine a terrible resultset withfacted metadata. The way that we build the metatdata list is post-indexing of product, we would actual build a bit sequence thatcorresponds to all possible key/value combos for each product andassociate each variation with the product. Then wehn someone refineswith the price range, lets say, we just look for the one thatmatches. It gets a bit crazy, but it is the only way we could get thespeed down for millions of documents in the index. I did read thatemail from cnet the other day, but it didn't really register whatthey were talking about until I saw your metadata group xml exmapledoc here.

For the schema, I just meant the document format. the file is calledschema.xml. I haven't tried it, but it looks like you can change thatto affect the way solr works without actually affecting the waylucene handles it. Is that wrong? I guess it doesn't really matter,since it looks like your indexable groups make more sense from amaintainability standpoint (less redundant data).

As for 'scanning the resultset', I can see how I was a little shy onthe details. Sorry about that. I meant look through the results tosee what facets apply to the resultset. So if my company sells booksand power tools, when someone searches for 'the little engine whocould', once we know there are no power tools in the result set, Idon't show the refinement facets for power tool metadata (likewattage or battery operated or blade size). For a big group ofdiverse data, we could potentially have several hundred group names,and it seems like it might be redundent to search for 300 metatdatatypes when we know that only 5 apply to the resultset. If, however,speed is not impacted noticably by searching for metadata that doesnot exist, then we don't need to worry about this. I am not familirenough with lucene's performance to know which would be more optimal.

In your example file, how does the name facet know to display onlythe names that start with whatever intial was selected? Would thatbe built in by modifying our result set first (by applying theauthor:a from the initial metadata group) then letting it gather allauthor names on the new set? That seems the easiset way to me, but Idon't know how the would affect speed with lucene.

I think what I am starting to understand is that coming from what wehave (a rdbms based metadata gathering system), I need to rethink myprocess. Ive spent so much time training myself to think in terms ofhow to make things fast in mysql that I need to re-open my mind :)


-Corey

On Mar 10, 2006, at 12:44 AM, Chris Hostetter wrote:

: I like the idea of the wiki page; I think I will attempt to set one
: up after this email, but I wanted to see if I could do a little bit
: better job of fleshing out how pulling metadata out might work(in my
I finally got a chance to look at your ideas.

first off: as far as i know, there isn't any spcial edit permissions
neccessary to modify the TaskList ... if the edit link wasn'tshowing upfor you after you logged in, it might just be that the page wascached,
try a force-reload.

Okay, on to the topic at hand..

: We add suggestable metadata as part of the product schema, so we
: could have something like
There's a difference between the index schema, and the "xml schema/dtd"for adding documents. You seem to be suggesting a change to thexml usedwhen adding documents to indicate wether a field should besuggestable or
not, but that syntax is tied directly to the underlyng lucene API for
Documents/Fields -- where would the suggestable/preceding info bestored?
: Once we reindex, we do a search for 'legal' again and our book is in
: it. Based on our index,  we can scan the resultset and see that the
: results have three suggestable fields, two of which do not require a
: preceding field.

I'm not sure what you mean by "scan the result" to get to get the
suggestable (and their values) ... can you elaborate?
I'm not sure if you read the thread yonik mentioned earlier abouthow wedo this at CNET, but the way we store info about which fields wewant to
have facets on (and what those facets should be in the case of range
queries and such) is to put "metadata documents" into the index.for asingle user request, you pull out the metadata document, then usethe infocontained in it to determine facets to search on and intersect withthe
main result.

the format of hte metadata docs we use is very custom, but perhaps a
similar, generalized approach could be implimented?
The plugin could dictate a specific XML format indicating thebehavior todrive the facets using either of hte following mechanisms (morecould be
added as needed)...
  * make group FF of all indexed values of field F
  * make group G using queries x, y, and z with labels a, b, and c
...users could index one or more metadata documents, containing theXML
info in any stored field they want defined in the schema -- when
configuring the plugin, they'd specify the field in thesolrconfig.xml.at query time, they specify two queries: one to restrict the mainresults,and one to identify the metadata doc they want to use (if it'sallways the
same one, a defualt could be configured in solrconfig as well)

an example of what i mean about XML stored in a field of the metadata
doc...

   <facets>
     <group id="price" label="Price">
       <facet id="0-20"  label="Under $20">price:[0 TO 20]</facet>
       <facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
       <facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
     </group>
     <group id="initial" label="Author">
       <facet id="a" label="A">author:a*</facet>
       ...
     </group>
     <group id="name" label="Author" depends="initial">
       <facet use-terms-field="author" />
     </group>
     ...
   </facets>


-Hoss

Re: metadata about result sets?

Reply via email to