On 9/8/2015 9:10 AM, adfel70 wrote: > I am trying to understand why faceting on a field with lots of unique values > has a great impact on query performance. Since Googling for Solr facet > algorithm did not yield anything, I looked how facets are implemented in > Lucene. I found out that there are 2 methods - taxonomy-based and > SortedSetDocValues-based. Does Solr facet capabilities are based on one of > those methods? if so, I still cant understand why unique values impacts > query performance...
Lucene's facet implementation is completely separate (and different) from Solr's implementation. I am not familiar with the inner workings of either implementation. Solr implemented faceting long before Lucene did. I think *Solr* actually contains at least two different facet implementations, used for different kinds of facets. Faceting on a field with many unique values uses a HUGE amount of heap memory, which is likely why query performance is impacted. I have a dev system with all my indexes (each of which has dedicated hardware for production) on it. Normally it requires 15GB of heap to operate properly. Every now and then, I get asked to do a duplicate check on a field that *should* be unique, on an index with 250 million docs in it. The query that I am asked to do for the facet matches about 100 million docs. This facet query, on a field that DOES have docValues, will throw OOM if my heap is less than 27GB. The dev machine only has 32GB of RAM, so as you might imagine, performance is really terrible when I do this query. Thankfully it's a dev machine. When I was doing these queries, it was running 4.9.1. I have since upgraded it to 5.2.1, as a proof of concept for upgrading our production indexes ... but I have not attempted the facet query since the upgrade. Thanks, Shawn