On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <mathias.wal...@gmx.net> wrote:
> I indexed about 90 million sentences and the PAS (predicate argument 
> structures) they consist of (which are about 500 million). Then
> I try to do NER (named entity recognition) by searching about 5 million 
> entities. For each entity I need the all search results, not
> just the top X. Since about 10 percent of the entities are high frequent (i. 
> e. there are more than 5 million hits for "human"), it
> takes very long to obtain the data from the index. "Very long" means about a 
> day with 15 distributed Katta nodes. Katta is just a
> distribution and shard balancing solution on top of Lucene.

if you aren't getting top-N results/doing search, are you sure a
search engine library/server is the right tool for this job?

> Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. 
> The size was increased to 7 characters (= 14 bytes)
> which is still a gain of more than 50 percent compared to the UTF8 encoding. 
> BTW: I found no sample how to use the
> IndexableBinaryStringTools class except in the unit tests.

it is deprecated in trunk, because you can index binary terms (your
own byte[]) directly if you want. To do this, you need to use a custom
AttributeFactory.

See src/test/org/apache/lucene/index/Test2BTerms or
https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how
to do this.

Reply via email to