see: https://issues.apache.org/jira/browse/LUCENE-4258
I'm sure the people working on this would gladly get all
the help they can. WARNING: I suspect (although I haven't
looked myself) that this is very hairy code <G>.
Ah excellent! Thanks! Exactly what I was looking for. Looks like this has been in the pipeline for a good while now. I'll have a look over the patches, and if it's not too hairy I'll see what I can do.
I'll challenge this statement a bit, knowing full well that I don't
understand your problem space just by saying I've seen
some pretty big, high-throughput installations go ahead and
store all the fields and use them for atomic updates. As in
billions of documents. And note that "index size" as it relates
to storing content is orthogonal to searching. By that I mean
the index bloat you get when storing fields doesn't
really impact search memory requirements much, the stored
data is kept in separate files and only assembled for docs
as you return them (i.e. a page worth).
Without going into too much detail about this, I'll say that we have billions of documents with ~50 indexed fields, fewer than 5 of which need to be updated, though some documents have to be updated 10 times in a reasonably short timespan. All the while maintaining an indexing throughput of ~4k messages/second. Near real time. On COTS hardware. Every IO-operation we can spare is a major win for us.

Impact on index size is around ~15% in my tests. I will need a little more time to measure the impact on throughput and querying, but my gut instinct tells me that it won't be pretty.

 - Bram

Reply via email to