AW: FieldCache

Mathias Walter Mon, 25 Oct 2010 00:42:30 -0700

I don't think it is an XY problem.

I indexed about 90 million sentences and the PAS (predicate argument 
structures) they consist of (which are about 500 million). Then
I try to do NER (named entity recognition) by searching about 5 million 
entities. For each entity I need the all search results, not
just the top X. Since about 10 percent of the entities are high frequent (i. e. 
there are more than 5 million hits for "human"), it
takes very long to obtain the data from the index. "Very long" means about a 
day with 15 distributed Katta nodes. Katta is just a
distribution and shard balancing solution on top of Lucene.

Initially, I tried distributed search with Solr. But it was too slow to 
retrieve a large set of documents. Then I switch to Lucene
and made some improvements. I enabled the field cache for my ID field and 
another single char field (PAS type) to get the benefit of
accessing the fields with an array. Unfortunately, the IDs are too large to fit 
in memory. I gave 12 GB of RAM to each node and also
tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of 
memory.

Then I investigated the storage of the fields. String fields are stored in 
UTF-8 encoding. But my ID will never contain UTF8
characters. It consists of number schema but does not fit into a single long. I 
encoded it into a byte array of 11 bytes (compared
to 30 bytes of UTF-8 encoding). Then I changed the field description in 
schema.xml to binary. I still use the EmbeddedSolrServer to
create the indices.
Also, I had to remove the uniquekey node because binary fields cannot be 
indexed, which is the requirement for the unique key.

After reindexing I discovered that nonindexed or binary fields cannot be used 
with the FieldCache.

Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. 
The size was increased to 7 characters (= 14 bytes)
which is still a gain of more than 50 percent compared to the UTF8 encoding. 
BTW: I found no sample how to use the
IndexableBinaryStringTools class except in the unit tests.

Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene 
client. The search result never looked identical
compared to the IDs used to create the SolrInputDocument.

I assume that the char[] returned form IndexableBinaryStringTools.encode is 
encoded in UTF-8 again and then stored. At some point
the information is lost and cannot be recovered.

Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from 
FieldCache.DEFAULT.getTerms directly. But the bytes are
encoded in an unknown form (unknown to me) and cannot be decoded with 
IndexableBinaryStringTools.decode.

The question is now, how to increase the performance of the binary field 
retrieval by not exploding the memory?

I also read some comments which suggest using of payloads. But I never tried 
this approach. Also, the column-stride fields approach
(LUCENE-2186) looks promising but is not released yet.

BTW: I made some tests with a smaller index and the ID encoded as string. Using 
the field cache improves the hit retrieval
dramatically (from 18 seconds down to 2 seconds per query, with a large number 
of results).

--
Kind regards,
Mathias

> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Samstag, 23. Oktober 2010 21:40
> An: solr-user@lucene.apache.org
> Betreff: Re: FieldCache
> 
> Why do you want to? Basically, the caches are there to improve
> #searching#. To search something, you must index it. Retrieving
> it is usually a rare enough operation that caching is irrelevant.
> 
> This smells like an XY problem, see:
> http://people.apache.org/~hossman/#xyproblem
> 
> If this seems like gibberish, could you explain your problem
> a little more?
> 
> Best
> Erick
> 
> On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter 
> <mathias.wal...@gmx.net>wrote:
> 
> > Hi,
> >
> > does a field which should be cached needs to be indexed?
> >
> > I have a binary field which is just stored. Retrieving it via
> > FieldCache.DEFAULT.getTerms returns empty ByteRefs.
> >
> > Then I found the following post:
> > http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html
> >
> > How can I use the FieldCache with a binary field?
> >
> > --
> > Kind regards,
> > Mathias
> >
> >

AW: FieldCache

Reply via email to