Hello, I'm using faceted search (perhaps in a dumb way) to collect some statistics for my index. I have documents in various languages, one of the field is "language" and I simply want to see how many documents I have for each language. I have noticed that the search builds a int[maxDoc] array and then traverses the array to count. If facet.method=enum (discovered later) is used, the things are still counted in a different way. But for this case where all the documents are retrieved, the information is already available in the lucene index. So, I think it would be a good optimization to detect these cases (i.e. no filtering) and just return the number from the index instead of counting the docs again.
Another issue: there is no way currently to disable the caching of the int[maxDoc], is there? If there are many fields to be faceted, this can quikly lead to out of memory situations. I think it would be good to give the option (as part of the query) to disable the caching, even if it is slow, at least it works and is useful for non-interactive processing. And another possibe optimization for the int[maxDoc] inspired from the column stored databases: the way they do it is to find the minimum number of bits to represent a value. If for example my language field has 30 possible values (i.e. I have docs in 30 languages), I only need 5 bits for each doc (instead of int=32 bits). Then I can represent the whole int[maxDoc] in less than 1/6 of the space required now. What's even better, sometimes the documents can be partitioned such that not all the values of a field are represented in the same partition. For example let's assume that I have a field called doc_generation_date. If I harverst the documents each three days, and I consider a partition as having the same three days of data, for each partition I will basically have only three possible values for the doc_generation_date. That means that I only need to have 2 bits for each document plus a table for each partition that maps from the partition value id (one of the three values represented on two bits) to the index value id (that is the id stored in the lucene index). Of course, for the language field above, the partitioning would not help unless I index successively only english docs, then only french, etc. And also it wouldn't work just like that for multi-value fields. nicolae