On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <mathias.wal...@gmx.net> wrote: > I indexed about 90 million sentences and the PAS (predicate argument > structures) they consist of (which are about 500 million). Then > I try to do NER (named entity recognition) by searching about 5 million > entities. For each entity I need the all search results, not > just the top X. Since about 10 percent of the entities are high frequent (i. > e. there are more than 5 million hits for "human"), it > takes very long to obtain the data from the index. "Very long" means about a > day with 15 distributed Katta nodes. Katta is just a > distribution and shard balancing solution on top of Lucene.
if you aren't getting top-N results/doing search, are you sure a search engine library/server is the right tool for this job? > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. > The size was increased to 7 characters (= 14 bytes) > which is still a gain of more than 50 percent compared to the UTF8 encoding. > BTW: I found no sample how to use the > IndexableBinaryStringTools class except in the unit tests. it is deprecated in trunk, because you can index binary terms (your own byte[]) directly if you want. To do this, you need to use a custom AttributeFactory. See src/test/org/apache/lucene/index/Test2BTerms or https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how to do this.