Interesting... I have been looking at lucene in my spare time at work for all of 3 days now, so I have to apologize for my lack of understanding when it comes to how it works specifically :) We have a terrrible internal search that we are looking to replace, and the only thing it does well is help you refine a terrible resultset with facted metadata. The way that we build the metatdata list is post- indexing of product, we would actual build a bit sequence that corresponds to all possible key/value combos for each product and associate each variation with the product. Then wehn someone refines with the price range, lets say, we just look for the one that matches. It gets a bit crazy, but it is the only way we could get the speed down for millions of documents in the index. I did read that email from cnet the other day, but it didn't really register what they were talking about until I saw your metadata group xml exmaple doc here.

For the schema, I just meant the document format. the file is called schema.xml. I haven't tried it, but it looks like you can change that to affect the way solr works without actually affecting the way lucene handles it. Is that wrong? I guess it doesn't really matter, since it looks like your indexable groups make more sense from a maintainability standpoint (less redundant data).

As for 'scanning the resultset', I can see how I was a little shy on the details. Sorry about that. I meant look through the results to see what facets apply to the resultset. So if my company sells books and power tools, when someone searches for 'the little engine who could', once we know there are no power tools in the result set, I don't show the refinement facets for power tool metadata (like wattage or battery operated or blade size). For a big group of diverse data, we could potentially have several hundred group names, and it seems like it might be redundent to search for 300 metatdata types when we know that only 5 apply to the resultset. If, however, speed is not impacted noticably by searching for metadata that does not exist, then we don't need to worry about this. I am not familir enough with lucene's performance to know which would be more optimal.

In your example file, how does the name facet know to display only the names that start with whatever intial was selected? Would that be built in by modifying our result set first (by applying the author:a from the initial metadata group) then letting it gather all author names on the new set? That seems the easiset way to me, but I don't know how the would affect speed with lucene.

I think what I am starting to understand is that coming from what we have (a rdbms based metadata gathering system), I need to rethink my process. Ive spent so much time training myself to think in terms of how to make things fast in mysql that I need to re-open my mind :)

-Corey

On Mar 10, 2006, at 12:44 AM, Chris Hostetter wrote:


: I like the idea of the wiki page; I think I will attempt to set one
: up after this email, but I wanted to see if I could do a little bit
: better job of fleshing out how pulling metadata out might work (in my

I finally got a chance to look at your ideas.

first off: as far as i know, there isn't any spcial edit permissions
neccessary to modify the TaskList ... if the edit link wasn't showing up for you after you logged in, it might just be that the page was cached,
try a force-reload.

Okay, on to the topic at hand..

: We add suggestable metadata as part of the product schema, so we
: could have something like

There's a difference between the index schema, and the "xml schema/ dtd" for adding documents. You seem to be suggesting a change to the xml used when adding documents to indicate wether a field should be suggestable or
not, but that syntax is tied directly to the underlyng lucene API for
Documents/Fields -- where would the suggestable/preceding info be stored?

: Once we reindex, we do a search for 'legal' again and our book is in
: it. Based on our index,  we can scan the resultset and see that the
: results have three suggestable fields, two of which do not require a
: preceding field.

I'm not sure what you mean by "scan the result" to get to get the
suggestable (and their values) ... can you elaborate?


I'm not sure if you read the thread yonik mentioned earlier about how we do this at CNET, but the way we store info about which fields we want to
have facets on (and what those facets should be in the case of range
queries and such) is to put "metadata documents" into the index. for a single user request, you pull out the metadata document, then use the info contained in it to determine facets to search on and intersect with the
main result.

the format of hte metadata docs we use is very custom, but perhaps a
similar, generalized approach could be implimented?

The plugin could dictate a specific XML format indicating the behavior to drive the facets using either of hte following mechanisms (more could be
added as needed)...
  * make group FF of all indexed values of field F
  * make group G using queries x, y, and z with labels a, b, and c
...users could index one or more metadata documents, containing the XML
info in any stored field they want defined in the schema -- when
configuring the plugin, they'd specify the field in the solrconfig.xml. at query time, they specify two queries: one to restrict the main results, and one to identify the metadata doc they want to use (if it's allways the
same one, a defualt could be configured in solrconfig as well)

an example of what i mean about XML stored in a field of the metadata
doc...

   <facets>
     <group id="price" label="Price">
       <facet id="0-20"  label="Under $20">price:[0 TO 20]</facet>
       <facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
       <facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
     </group>
     <group id="initial" label="Author">
       <facet id="a" label="A">author:a*</facet>
       ...
     </group>
     <group id="name" label="Author" depends="initial">
       <facet use-terms-field="author" />
     </group>
     ...
   </facets>


-Hoss


Reply via email to