crawler feed?
Hi: Are there relatively stand-alone crawler that are suitable/customizable for Solr? has anyone done any trials.. I have seen some discussion about coocon crawler.. was that successfull? Regards
Re: crawler feed?
On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote: > Hi: > > Are there relatively stand-alone crawler that are > suitable/customizable for Solr? has anyone done any trials.. I have > seen some discussion about coocon crawler.. was that successfull? http://wiki.apache.org/solr/SolrForrest I am using this approach in a custom project that is cocoon based and is working very fine. However cocoons crawler is not standalone but using the cocoon cli. I am using the solr/forrest plugin for the commit and dispatching the update. The indexing transformation in the plugin is a wee bit different then the one in my project since I needed to extract more information from the documents to create better filters. However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and forrest uses this as its main component, I am keen to write a simple crawler that could be reused for cocoon, forrest, solr, nutch, ... I may will start something pretty soon (I guess I will open a project in Apache Labs) and will keep this list informed. My idea is to write simple crawler which could be easily extended by plugins. So if a project/app needs special processing for a crawled url one could write a plugin to implement the functionality. A solr plugin for this crawler would be very simple, basically it would parse the e.g. html page and dispatches an update command for the extracted fields. I think one should try to reuse much code from nutch as possible for this parsing. If somebody is interested in such a standalone crawler project, I welcome any help, ideas, suggestion, feedback and/or questions. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions
Re: crawler feed?
Thorsten: Thank you very much for the update. On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote: On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote: > Hi: > > Are there relatively stand-alone crawler that are > suitable/customizable for Solr? has anyone done any trials.. I have > seen some discussion about coocon crawler.. was that successfull? http://wiki.apache.org/solr/SolrForrest I am using this approach in a custom project that is cocoon based and is working very fine. However cocoons crawler is not standalone but using the cocoon cli. I am using the solr/forrest plugin for the commit and dispatching the update. The indexing transformation in the plugin is a wee bit different then the one in my project since I needed to extract more information from the documents to create better filters. However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and forrest uses this as its main component, I am keen to write a simple crawler that could be reused for cocoon, forrest, solr, nutch, ... I may will start something pretty soon (I guess I will open a project in Apache Labs) and will keep this list informed. My idea is to write simple crawler which could be easily extended by plugins. So if a project/app needs special processing for a crawled url one could write a plugin to implement the functionality. A solr plugin for this crawler would be very simple, basically it would parse the e.g. html page and dispatches an update command for the extracted fields. I think one should try to reuse much code from nutch as possible for this parsing. I have seen some discussion regarding nutch crawler. I think a standalone crawler would be more desirable .. as you pointed out one could extend such crawler via plugins. Is seems difficult to "rip nutch" crawler as a standalone crawler? no? Cos you would want as much as possible "same code base" no? I also think such crawler is interesting in vertical search engine space. So Nutch 0.7 could be good target no? Regards
Re: Debugging Solr memory usage/heap problems
To help find leaks I had good luck with jmap and even jhat in Java 1.6. Otis - Original Message From: Graham Stead <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, February 7, 2007 12:49:47 AM Subject: RE: Debugging Solr memory usage/heap problems Thanks, Chris. I will test with vanilla Solr to clarify whether the problem occurs with it, or only in the version where we have made changes. -Graham > : To tweak our scoring, a custom hit collector in > SolrIndexSearcher creates 1 > : fieldCache and 3 ValueSources from 3 fields: > : - an integer field with many unique values (order 10^4) > : - another integer field with many unique values (order 10^4) > : - an integer field with hundreds of unique values > > so you customized SolrIndexSearcher? ... is it possible you > have a memory leak in that code? > > If you have all of your cache sizes set to zero, you should > be able to start up the server, hit it with a bunch of > queries, then trigger a commit and see your heap usage drop > significantly. ... if you do that over and over again and see > the heap usage grow and grow, there may be something else > going on in those changes of yours.
Re: Analyzers and Tokenizers?
FYI, I have added that link into the Wiki. Bill On 2/6/07, rubdabadub <[EMAIL PROTECTED]> wrote: Thanks Thorsten! On 2/6/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote: > On Tue, 2007-02-06 at 17:27 +0100, rubdabadub wrote: > > Hi: > > > > Are there more filters/tokenizers then the ones mentioned here..? > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > > > I have found in the example/schema.xml, which are new ... > > > > > sortMissingLast="true" omitNorms="true"> > > > > > > more > > > > > > Is there any complete list somewhere ..or how can I find more info about them? > > http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/ > > HTH > > salu2 > > > > Kind regards, > -- > Thorsten Scherler thorsten.at.apache.org > Open Source Java & XML consulting, training and solutions > >
Re: crawler feed?
rubdabadub wrote: > Hi: > > Are there relatively stand-alone crawler that are > suitable/customizable for Solr? has anyone done any trials.. I have > seen some discussion about coocon crawler.. was that successfull? There's also integration path available for Nutch[1] that i plan to integrate after 0.9.0 is out. -- Sami Siren [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
facet optimizing
Any suggestions on how to optimize the loading of facets? My index is roughly 35,000 and I am asking solr to return 6 six facet fields on every query. On large result sets with facet params set to false searching is zippy, but when set to true, and facet fields designated, it takes some time to load. I've tried adjusting some/all of the Common Cache Configuration Parameters in the config but haven't gotten any better result times. Any suggestions? Thanks, Andrew
Re: facet optimizing
How many unique values do you have for those 6 fields? And are those fields multiValued or not? Single valued facets are much faster (though not realistic in my domain). Lots of values per field do not good facets make. Erik On Feb 7, 2007, at 11:10 AM, Gunther, Andrew wrote: Any suggestions on how to optimize the loading of facets? My index is roughly 35,000 and I am asking solr to return 6 six facet fields on every query. On large result sets with facet params set to false searching is zippy, but when set to true, and facet fields designated, it takes some time to load. I've tried adjusting some/all of the Common Cache Configuration Parameters in the config but haven't gotten any better result times. Any suggestions? Thanks, Andrew
Re: crawler feed?
This is really interesting. You mean to say i could give the patch a try now i.e. the patch in the blog post :-) I am looking forward to it. I hope it will be standalone i.e. you don't need "the whole nutch" to get a standalone crawler working.. I am not sure if this is how you planned. Regards On 2/7/07, Sami Siren <[EMAIL PROTECTED]> wrote: rubdabadub wrote: > Hi: > > Are there relatively stand-alone crawler that are > suitable/customizable for Solr? has anyone done any trials.. I have > seen some discussion about coocon crawler.. was that successfull? There's also integration path available for Nutch[1] that i plan to integrate after 0.9.0 is out. -- Sami Siren [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
RE: facet optimizing
Yes most all terms are multi-valued which I can't avoid. Since the data is coming from a library catalogue I am translating a subject field to make a subject facet. That facet alone is the biggest, hovering near 39k. If I remove this facet.field things return faster. So am I to assume that this particular field bogging down operations and there are no other optimization options besides cutting down this field? Thanks! -Andrew -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 07, 2007 11:22 AM To: solr-user@lucene.apache.org Subject: Re: facet optimizing How many unique values do you have for those 6 fields? And are those fields multiValued or not? Single valued facets are much faster (though not realistic in my domain). Lots of values per field do not good facets make. Erik On Feb 7, 2007, at 11:10 AM, Gunther, Andrew wrote: > > Any suggestions on how to optimize the loading of facets? My index is > roughly 35,000 and I am asking solr to return 6 six facet fields on > every query. On large result sets with facet params set to false > searching is zippy, but when set to true, and facet fields designated, > it takes some time to load. I've tried adjusting some/all of the > Common > Cache Configuration Parameters in the config but haven't gotten any > better result times. Any suggestions? > > > Thanks, > > > Andrew > > > >
cache warming optmization
I'm interested in improving my existing custom cache warming by being selective about what updates rather than rebuilding completely. How can I tell what documents were updated/added/deleted from the old cache to the new IndexSearcher? Thanks, Erik
Re: crawler feed?
Hi: Just want to say that my tiny experiment with Sami's Solr/Nutch integration worked :-!) Super thanks for the pointer. Which leads me to write the following.. It would be great if I could use this in my current project. This way I can eliminate my current python based aggregator/crawler which was used to submit docs to Solr. This solution works but the crawler is not as robust as I wanted it to be. As far as I understand SOLR-20 seems to be good to go for trunk? no? So I am lobbying for SOLR-20 :-) Cheers On 2/7/07, rubdabadub <[EMAIL PROTECTED]> wrote: This is really interesting. You mean to say i could give the patch a try now i.e. the patch in the blog post :-) I am looking forward to it. I hope it will be standalone i.e. you don't need "the whole nutch" to get a standalone crawler working.. I am not sure if this is how you planned. Regards On 2/7/07, Sami Siren <[EMAIL PROTECTED]> wrote: > rubdabadub wrote: > > Hi: > > > > Are there relatively stand-alone crawler that are > > suitable/customizable for Solr? has anyone done any trials.. I have > > seen some discussion about coocon crawler.. was that successfull? > > There's also integration path available for Nutch[1] that i plan to > integrate after 0.9.0 is out. > > -- > Sami Siren > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html >
Re: cache warming optmization
On 2/7/07 10:04 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote: > I'm interested in improving my existing custom cache warming by being > selective about what updates rather than rebuilding completely. > > How can I tell what documents were updated/added/deleted from the old > cache to the new IndexSearcher? We could add a system-maintained timestamp field. LDAP has that. Knowing which documents were added or changed doesn't actually work for this, because the new or changed documents might now match queries that they didn't match before. Add a term to a document, and it shows up in new queries. Those queries need to be re-run. In order to selectively warm, you need to know which terms changed. Build a set of all terms in documents before they are updated and all from the new documents. Then extract the terms from each query. If a query has any term that is in the set from the document changes, that query must be re-run. We used to do something similar manually for stemmer dictionary changes. The same would be necessary for changes to protwords.txt. Search for the old and new forms, and reindex only the matching documents. This is very efficient for stemmer changes, but I'm not sure how well it would work for document changes. If your documents are a good match to your queries (and I hope they are), a few changes could match many queries, then you are back to a full re-warm. wunder -- Walter Underwood Search Guru, Netflix
Re: facet optimizing
Gunther, Andrew wrote: Yes most all terms are multi-valued which I can't avoid. Since the data is coming from a library catalogue I am translating a subject field to make a subject facet. That facet alone is the biggest, hovering near 39k. If I remove this facet.field things return faster. So am I to assume that this particular field bogging down operations and there are no other optimization options besides cutting down this field? Andrew, I haven't yet found a successful way to implement the SOLR faceting for library catalog data. I developed my own system, so for every query, I first query 20 records. Let's say it find 1000 records and returns the 20 records. Then I make a second query returning all 1000 records and build my own facets based on the 1000 records. It's a bit faster than using SOLRs faceting system, but as you said. For large records it still takes a bit of time. I implemented it using AJAX so it doesn't slow down the loading of the page. I'd be curious if anyone has been able to find a better way using SOLRs faceting system Andrew
Re: facet optimizing
On 2/7/07, Gunther, Andrew <[EMAIL PROTECTED]> wrote: Yes most all terms are multi-valued which I can't avoid. Since the data is coming from a library catalogue I am translating a subject field to make a subject facet. That facet alone is the biggest, hovering near 39k. If I remove this facet.field things return faster. So am I to assume that this particular field bogging down operations and there are no other optimization options besides cutting down this field? Well, the applicable optimizations probably will be related to how you use the results. Surely you are not displaying 39,000 facet counts to the user? If you are only displaying the top subjects, one solution is to collect more documents than you need, enumerate the subjects of the results, and only facet on that subset. This could be built into solr eventually (some sort of sampling bitset intersection size). -Mike
Re: cache warming optmization
: I'm interested in improving my existing custom cache warming by being : selective about what updates rather than rebuilding completely. : : How can I tell what documents were updated/added/deleted from the old : cache to the new IndexSearcher? cache warming in Solr is based mainly arround the idea of "what keys were in the old cache?" then "what's changed?" ... because regardless of what updates may have happened, wholesale docids shifts might have taken place. Of course, if you are dealing with a custom cache where the values aren't DocSetws or DocLists but your own custom objects that don't know about indiviual docIds, this doesn't really affect you as much. I'm not entirely sure i understand your situation, but one trick yonik found that really improved the cache warming in a custom CacheRegenerator i had was in dealing with big metadata documents that i was parsing into objects for use in a custom request handler. He pointed out that if i put the Lucene Document in my CacheValue objects, then when warming my newCache, i could do a search on the newSearcher, get the Document back, and if it was the same as the Document in the value from my oldCache i could copy it wholesale instead of redoing all of the parsing (this was complicated by Document not supporting equals, but you get the idea) I suppose to try and make CacheRegenerator's lives easier, we could expose the SolrIndexSearcher use with the oldCache -- but i'm still not sure how usefull that would be ... "diffing" two IndexSearchers isn't very easy, but i suppose in some cases comparing hte TermEnums for some fields (like the uniqueKey field for example) might be helpful. -Hoss
Re: facet optimizing
: Andrew, I haven't yet found a successful way to implement the SOLR : faceting for library catalog data. I developed my own system, so for Just to clarify: the "out of hte box" faceting support Solr has at the moment is very deliberately refered to as "SimpleFacets" ... it's intended to solve Simple problems where you want Facets based on all of the values in a field, or one specific hardcoded queries. It was primarily written as a demonstration of what is possiblewhen writting a custom SolrRequestHandler. when you start talking about really large data sets, with an extremely large vloume of unique field values for fields you want to facet on, then "generic" solutions stop being very feasible, and you have to start ooking at solutions more tailored to your dataset. at CNET, when dealing with Product data, we don't make any attempt to use the Simple Facet support Solr provides to facet on things like Manufacturer or Operating System because enumerating through every Manufacturer in the catalog on every query would be too expensive -- instead we have strucured metadata that drives the logic: only compute the constraint counts for this subset of manufactures where looking at teh Desktops category, only look at teh Operating System facet when in these categories, etc... rules like these need to be defined based on your user experience, and it can be easy to build them using the metadata in your index -- but they really need to be precomputed, not calculated on the fly every time. For something like a LIbrary system, where you might want to facet on Author, but you have way to many to be practical, a system that either required a category to be picked first (allowing you to constrain the list of authors you need to worry about) or precomputed the top 1000 authors for displaying initially (when the user hasn't provided any other constraints) are examples of the types of things a RequestHandler Solr Plugin might do -- but the logic involved would probably be domain specific. -Hoss
Re: cache warming optmization
7 feb 2007 kl. 19.04 skrev Erik Hatcher: I'm interested in improving my existing custom cache warming by being selective about what updates rather than rebuilding completely. I know it is not Solr, but I've made great progress on my cache that updates affected results only, on insert and delete. It's available in LUCENE-550, and based on the InstantiatedIndex and NotifiableIndex avilable in the same patch. Java 1.5. Perhaps that is something you can take a look at for some ideas. -- karl
Re: facet optimizing
On 2/7/07, Gunther, Andrew <[EMAIL PROTECTED]> wrote: Any suggestions on how to optimize the loading of facets? My index is roughly 35,000 35,000 documents? That's not that big. and I am asking solr to return 6 six facet fields on every query. On large result sets with facet params set to false searching is zippy, but when set to true, and facet fields designated, it takes some time to load. I've tried adjusting some/all of the Common Cache Configuration Parameters in the config but haven't gotten any better result times. Any suggestions? The first time you do these facet requests, they should be slower. After that, they should be somewhat faster. Solr relies on the filter cache for faceting, and if it's not big enough you're going to get a near 0% hit rate. Check the statistics page and make sure there aren't any evictions after you do a query with facets. If there are, make the cache larger. -Yonik
Re: facet optimizing
On Feb 7, 2007, at 4:42 PM, Yonik Seeley wrote: Solr relies on the filter cache for faceting, and if it's not big enough you're going to get a near 0% hit rate. Check the statistics page and make sure there aren't any evictions after you do a query with facets. If there are, make the cache larger. Yonik - thanks! I was too deep into other things to worry about the slowness of massive multiValued facets, mainly because I was going to use the mess of all those nasty values we have in typical library data to push back and have it cleaned up. But, I just adjusted my filter cache settings and my responses went from 2000+ ms to 85 ms! Now it takes longer to render the pie charts than it does to get the results back :) Erik
Re: crawler feed?
On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote: > rubdabadub wrote: > > Hi: > > > > Are there relatively stand-alone crawler that are > > suitable/customizable for Solr? has anyone done any trials.. I have > > seen some discussion about coocon crawler.. was that successfull? > > There's also integration path available for Nutch[1] that i plan to > integrate after 0.9.0 is out. sounds very nice, I just finished to read. Thanks. Today a submitted a proposal for an Apache Labs project called Apache Druids. http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser Basic idea is to create a flexible crawler framework. The core should be a simple crawler which could be easily expended by plugins. So if a project/app needs special processing for a crawled url one could write a plugin to implement the functionality. salu2 > > -- > Sami Siren > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XMLconsulting, training and solutions
Re: facet optimizing
Are there any simple automatic test we can run to see what fields would support fast faceting? Is it just that the cache size needs to be bigger then the number of distinct values for a field? If so, it would be nice to add an /admin page that lists each field, the distinct value count and a green/red box showing if the current configuration will facet quickly. This could be a good place to suggest a good configuration for the data in your index. ryan
Re: facet optimizing
: Is it just that the cache size needs to be bigger then the number of : distinct values for a field? basically yes, but the cache is going to be used for all filters -- not just those for a single facet (so your cache might be big enough that faceting on fieldA or fieldB is fine, but if you facet on both you'll get performance pains) : If so, it would be nice to add an /admin page that lists each field, : the distinct value count and a green/red box showing if the current : configuration will facet quickly. This could be a good place to : suggest a good configuration for the data in your index. that reminds me of an idea i had way back when for a Solr sanity checker tool that would inspect your schema, index, and logs of sample queries and point out things that didn't seem to make any sense, ie: String fields that don't omitNorms, sorting on a tokenized field, or a non-sortable int/float fields that werre multivalued but didnt' seem like they needed to be (because every doc only had one value) etc... -Hoss
RE: facet optimizing
In the library subject heading context, I wonder if a layered approach would bring performance into the acceptable range. Since Library of Congress Subject Headings break into standard parts, you could have first-tier facets representing the main heading, second-tier facets with the main heading and first subdivision, etc. So to extract the subject headings from a given result set, you'd first test all the first-tier facets like "Body, Human", then where warranted test the associated second-tier facets like "Body, Human--Social aspects.". If the first-tier facets represent a small enough subset of the set of subject headings as a whole, that might be enough to reduce the total number of facet tests. I'm told by our metadata librarian, by the way, that there are 280,000 subject headings defined in LCSH at the moment (including cross-references), so that provides a rough upper limit on distinct values... Peter -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 07, 2007 2:02 PM To: solr-user@lucene.apache.org Subject: Re: facet optimizing : Andrew, I haven't yet found a successful way to implement the SOLR : faceting for library catalog data. I developed my own system, so for Just to clarify: the "out of hte box" faceting support Solr has at the moment is very deliberately refered to as "SimpleFacets" ... it's intended to solve Simple problems where you want Facets based on all of the values in a field, or one specific hardcoded queries. It was primarily written as a demonstration of what is possiblewhen writting a custom SolrRequestHandler. when you start talking about really large data sets, with an extremely large vloume of unique field values for fields you want to facet on, then "generic" solutions stop being very feasible, and you have to start ooking at solutions more tailored to your dataset. at CNET, when dealing with Product data, we don't make any attempt to use the Simple Facet support Solr provides to facet on things like Manufacturer or Operating System because enumerating through every Manufacturer in the catalog on every query would be too expensive -- instead we have strucured metadata that drives the logic: only compute the constraint counts for this subset of manufactures where looking at teh Desktops category, only look at teh Operating System facet when in these categories, etc... rules like these need to be defined based on your user experience, and it can be easy to build them using the metadata in your index -- but they really need to be precomputed, not calculated on the fly every time. For something like a LIbrary system, where you might want to facet on Author, but you have way to many to be practical, a system that either required a category to be picked first (allowing you to constrain the list of authors you need to worry about) or precomputed the top 1000 authors for displaying initially (when the user hasn't provided any other constraints) are examples of the types of things a RequestHandler Solr Plugin might do -- but the logic involved would probably be domain specific. -Hoss
RE: facet optimizing
: headings from a given result set, you'd first test all the first-tier : facets like "Body, Human", then where warranted test the associated : second-tier facets like "Body, Human--Social aspects.". If the : first-tier facets represent a small enough subset of the set of subject : headings as a whole, that might be enough to reduce the total number of : facet tests. that's exactly the type of thing i'm suggesting -- the trick would be to size your caches so that the first-tier constraints were pretty much always cached, and the popular second-tier constraints are usually cached -- but once you get to the second or third tiers the number of possible constraints is small enough that even if they aren't cached, you can compute their counts in a resonable amount of time. a really cach concious RequestHandler could even use the non caching SolrIndexSearcher methods if it knew it was dealing with a really low tier constraint (allthough at that point, spending the time to implement a Cache implementation with an approximate LFU replacement strategy instead of LRU would probably be a more robust use of engineering resources) -Hoss
Re: facet optimizing
Hi: when you start talking about really large data sets, with an extremely large vloume of unique field values for fields you want to facet on, then "generic" solutions stop being very feasible, and you have to start ooking at solutions more tailored to your dataset. at CNET, when dealing with Product data, we don't make any attempt to use the Simple Facet support Solr provides to facet on things like Manufacturer or Operating System because enumerating through every Manufacturer in the catalog on every query would be too expensive -- instead we have strucured metadata that drives the logic: only compute the constraint counts for this subset of manufactures where looking at teh Desktops category, only look at teh Operating System facet when in these categories, etc... rules like these need to be defined based on your user experience, and it can be easy to build them using the metadata in your index -- but they really need to be precomputed, not calculated on the fly every time. Sounds interesting, Could you please provide an example of how would one go about doing a precomuted qurery? For something like a LIbrary system, where you might want to facet on Author, but you have way to many to be practical, a system that either required a category to be picked first (allowing you to constrain the list of authors you need to worry about) or precomputed the top 1000 authors for displaying initially (when the user hasn't provided any other constraints) are examples of the types of things a RequestHandler Solr Plugin might do -- but the logic involved would probably be domain specific. Specifically here without getting any user constrains? how would one do this.. I thought facets needs to have user constraints? I would appreciate your feedback. thanks -Hoss
Re: facet optimizing
On 2/7/07, Binkley, Peter <[EMAIL PROTECTED]> wrote: In the library subject heading context, I wonder if a layered approach would bring performance into the acceptable range. Since Library of Congress Subject Headings break into standard parts, you could have first-tier facets representing the main heading, second-tier facets with the main heading and first subdivision, etc. So to extract the subject headings from a given result set, you'd first test all the first-tier facets like "Body, Human", then where warranted test the associated second-tier facets like "Body, Human--Social aspects.". Yes... we've had discussions about hierarchical facets in the past, but more focused on organization/presentation than performance: http://www.nabble.com/Hierarchical-Facets--tf2560327.html#a7135353 Which got me thinking... if we could use hierarchical facets to speed up faceting, then we should also be able to use the same type of strategy for non-hierarchical facets! We could create a facet-tree, where sets at parent nodes are the union of the child sets. This should allow one to more quickly zoom into where higher counts are concentrated, without necessitating storing all the facets. One could control the space/time tradeoff with the branching factor of the tree. -Yonik
Re: facet optimizing
Yonik - I like the way you think Yeah! It's turtles (err, trees) all the way down. Erik /me Pulling the Algorithms book off my shelf so I can vaguely follow along. On Feb 7, 2007, at 8:22 PM, Yonik Seeley wrote: On 2/7/07, Binkley, Peter <[EMAIL PROTECTED]> wrote: In the library subject heading context, I wonder if a layered approach would bring performance into the acceptable range. Since Library of Congress Subject Headings break into standard parts, you could have first-tier facets representing the main heading, second-tier facets with the main heading and first subdivision, etc. So to extract the subject headings from a given result set, you'd first test all the first-tier facets like "Body, Human", then where warranted test the associated second-tier facets like "Body, Human--Social aspects.". Yes... we've had discussions about hierarchical facets in the past, but more focused on organization/presentation than performance: http://www.nabble.com/Hierarchical-Facets--tf2560327.html#a7135353 Which got me thinking... if we could use hierarchical facets to speed up faceting, then we should also be able to use the same type of strategy for non-hierarchical facets! We could create a facet-tree, where sets at parent nodes are the union of the child sets. This should allow one to more quickly zoom into where higher counts are concentrated, without necessitating storing all the facets. One could control the space/time tradeoff with the branching factor of the tree. -Yonik
Re: facet optimizing
On 2/7/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: Yonik - I like the way you think Yeah! It's turtles (err, trees) all the way down. Heh... I'm still thinking/brainstorming about it... it only helps if you can effectively prune though. Each node in the tree could also keep the max docfreq of any of it's children. If either the facet count or max doc freq for a node is less than the minimum facet count in your priority queue, then you don't have to expand that node. One other piece of info you can glean: If your branching factor is 10, and the facet count for a node is 430, then you know that you are guaranteed at least a count of 43 at the next level. I'm not sure how one could use that info though (beyond just picking the highest node anyway). The facet counts at nodes could be used to direct where you expand first, but can't really be used to prune like the docfreq can. Pruning won't work well when the base docset size is small... but if it's small enough, we can use other strategies for that. It also won't work well for fields where higher docfreqs are rare... author *might* be a case of that. If we are making our own hierarchy, we can order it to be more effective. For instance, the node containing terms 1-10 doesn't have to be a sibling of the node with terms 11-20... the hard part is figuring how we want to arrange them. At the leaves... perhaps putting nodes with high maxDf together??? That might make it more likely you could prune off other nodes quicker. It seems like one would want to minimize set overlap (i.e. minimize the total cardinallity of all the sets at any one level). But that doesn't seem easy to do in a timely manner. So, off the top of my head right now, I guess maybe just sort the leaves by maxDf and build the rest of the tree on top of that. Thinking all this stuff up from scratch seems like the hard way... Does anyone know how other people have implemented this stuff? -Yonik