I don't think it is an XY problem. I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for "human"), it takes very long to obtain the data from the index. "Very long" means about a day with 15 distributed Katta nodes. Katta is just a distribution and shard balancing solution on top of Lucene.
Initially, I tried distributed search with Solr. But it was too slow to retrieve a large set of documents. Then I switch to Lucene and made some improvements. I enabled the field cache for my ID field and another single char field (PAS type) to get the benefit of accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory. I gave 12 GB of RAM to each node and also tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory. Then I investigated the storage of the fields. String fields are stored in UTF-8 encoding. But my ID will never contain UTF8 characters. It consists of number schema but does not fit into a single long. I encoded it into a byte array of 11 bytes (compared to 30 bytes of UTF-8 encoding). Then I changed the field description in schema.xml to binary. I still use the EmbeddedSolrServer to create the indices. Also, I had to remove the uniquekey node because binary fields cannot be indexed, which is the requirement for the unique key. After reindexing I discovered that nonindexed or binary fields cannot be used with the FieldCache. Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene client. The search result never looked identical compared to the IDs used to create the SolrInputDocument. I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8 again and then stored. At some point the information is lost and cannot be recovered. Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms directly. But the bytes are encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode. The question is now, how to increase the performance of the binary field retrieval by not exploding the memory? I also read some comments which suggest using of payloads. But I never tried this approach. Also, the column-stride fields approach (LUCENE-2186) looks promising but is not released yet. BTW: I made some tests with a smaller index and the ID encoded as string. Using the field cache improves the hit retrieval dramatically (from 18 seconds down to 2 seconds per query, with a large number of results). -- Kind regards, Mathias > -----Ursprüngliche Nachricht----- > Von: Erick Erickson [mailto:erickerick...@gmail.com] > Gesendet: Samstag, 23. Oktober 2010 21:40 > An: solr-user@lucene.apache.org > Betreff: Re: FieldCache > > Why do you want to? Basically, the caches are there to improve > #searching#. To search something, you must index it. Retrieving > it is usually a rare enough operation that caching is irrelevant. > > This smells like an XY problem, see: > http://people.apache.org/~hossman/#xyproblem > > If this seems like gibberish, could you explain your problem > a little more? > > Best > Erick > > On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter > <mathias.wal...@gmx.net>wrote: > > > Hi, > > > > does a field which should be cached needs to be indexed? > > > > I have a binary field which is just stored. Retrieving it via > > FieldCache.DEFAULT.getTerms returns empty ByteRefs. > > > > Then I found the following post: > > http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html > > > > How can I use the FieldCache with a binary field? > > > > -- > > Kind regards, > > Mathias > > > >