Interesting... I have been looking at lucene in my spare time at work
for all of 3 days now, so I have to apologize for my lack of
understanding when it comes to how it works specifically :) We have
a terrrible internal search that we are looking to replace, and the
only thing it does well is help you refine a terrible resultset with
facted metadata. The way that we build the metatdata list is post-
indexing of product, we would actual build a bit sequence that
corresponds to all possible key/value combos for each product and
associate each variation with the product. Then wehn someone refines
with the price range, lets say, we just look for the one that
matches. It gets a bit crazy, but it is the only way we could get the
speed down for millions of documents in the index. I did read that
email from cnet the other day, but it didn't really register what
they were talking about until I saw your metadata group xml exmaple
doc here.
For the schema, I just meant the document format. the file is called
schema.xml. I haven't tried it, but it looks like you can change that
to affect the way solr works without actually affecting the way
lucene handles it. Is that wrong? I guess it doesn't really matter,
since it looks like your indexable groups make more sense from a
maintainability standpoint (less redundant data).
As for 'scanning the resultset', I can see how I was a little shy on
the details. Sorry about that. I meant look through the results to
see what facets apply to the resultset. So if my company sells books
and power tools, when someone searches for 'the little engine who
could', once we know there are no power tools in the result set, I
don't show the refinement facets for power tool metadata (like
wattage or battery operated or blade size). For a big group of
diverse data, we could potentially have several hundred group names,
and it seems like it might be redundent to search for 300 metatdata
types when we know that only 5 apply to the resultset. If, however,
speed is not impacted noticably by searching for metadata that does
not exist, then we don't need to worry about this. I am not familir
enough with lucene's performance to know which would be more optimal.
In your example file, how does the name facet know to display only
the names that start with whatever intial was selected? Would that
be built in by modifying our result set first (by applying the
author:a from the initial metadata group) then letting it gather all
author names on the new set? That seems the easiset way to me, but I
don't know how the would affect speed with lucene.
I think what I am starting to understand is that coming from what we
have (a rdbms based metadata gathering system), I need to rethink my
process. Ive spent so much time training myself to think in terms of
how to make things fast in mysql that I need to re-open my mind :)
-Corey
On Mar 10, 2006, at 12:44 AM, Chris Hostetter wrote:
: I like the idea of the wiki page; I think I will attempt to set one
: up after this email, but I wanted to see if I could do a little bit
: better job of fleshing out how pulling metadata out might work
(in my
I finally got a chance to look at your ideas.
first off: as far as i know, there isn't any spcial edit permissions
neccessary to modify the TaskList ... if the edit link wasn't
showing up
for you after you logged in, it might just be that the page was
cached,
try a force-reload.
Okay, on to the topic at hand..
: We add suggestable metadata as part of the product schema, so we
: could have something like
There's a difference between the index schema, and the "xml schema/
dtd"
for adding documents. You seem to be suggesting a change to the
xml used
when adding documents to indicate wether a field should be
suggestable or
not, but that syntax is tied directly to the underlyng lucene API for
Documents/Fields -- where would the suggestable/preceding info be
stored?
: Once we reindex, we do a search for 'legal' again and our book is in
: it. Based on our index, we can scan the resultset and see that the
: results have three suggestable fields, two of which do not require a
: preceding field.
I'm not sure what you mean by "scan the result" to get to get the
suggestable (and their values) ... can you elaborate?
I'm not sure if you read the thread yonik mentioned earlier about
how we
do this at CNET, but the way we store info about which fields we
want to
have facets on (and what those facets should be in the case of range
queries and such) is to put "metadata documents" into the index.
for a
single user request, you pull out the metadata document, then use
the info
contained in it to determine facets to search on and intersect with
the
main result.
the format of hte metadata docs we use is very custom, but perhaps a
similar, generalized approach could be implimented?
The plugin could dictate a specific XML format indicating the
behavior to
drive the facets using either of hte following mechanisms (more
could be
added as needed)...
* make group FF of all indexed values of field F
* make group G using queries x, y, and z with labels a, b, and c
...users could index one or more metadata documents, containing the
XML
info in any stored field they want defined in the schema -- when
configuring the plugin, they'd specify the field in the
solrconfig.xml.
at query time, they specify two queries: one to restrict the main
results,
and one to identify the metadata doc they want to use (if it's
allways the
same one, a defualt could be configured in solrconfig as well)
an example of what i mean about XML stored in a field of the metadata
doc...
<facets>
<group id="price" label="Price">
<facet id="0-20" label="Under $20">price:[0 TO 20]</facet>
<facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
<facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
</group>
<group id="initial" label="Author">
<facet id="a" label="A">author:a*</facet>
...
</group>
<group id="name" label="Author" depends="initial">
<facet use-terms-field="author" />
</group>
...
</facets>
-Hoss