On 11/14/2014 01:43 PM, Erick Erickson wrote:
Just skimming, so maybe I misinterpreted.

ExternalFileField and ExternalFileFieldReloader
refer to storing values for each doc in an external file, they have
nothing to do with storing _files_.

The usual pattern is to have Solr store just enough data to have the
system-of-record return the actual file rather than have Solr
actually store the file. Solr isn't really built for this and while some
people do this it usually is a poor design if for no other reason than
as segments merge, the data gets copied again and again and again
to no good purpose.
I was worried about this, and spent a bunch of time working on a custom codec that would store files externally (to avoid the merge penalty), while still living inside the Solr/Lucene ecosystem. It was a lot of complicated work, and after a while I thought I'd better do some careful performance measurements to make sure it was worthwhile. What I found was that the merge cost was not very high relative to other indexing costs we were paying (indexing large full text documents with fairly complex analysis, but nothing unusual). So I don't think this particular performance argument against storage in Solr/Lucene is telling, at least for many ratios of stored doc size to indexed tokens size. It's also worth mentioning that my test involved reindexing every document once (basically a query-level replication of an existing index), so perhaps the amount of merging was less than it might be in other cases.

I can see that there might be other reasons to store documents elsewhere, but in my experience, with our use case, it actually works pretty well to store them in Lucene indexes. Consider, for example, that if you are highlighting, you are probably already storing the full text of each document anyway. In our case we also need to store a marked-up version of the full text (so we can highlight an html view of a document as well as deliver plain text snippets), so the incremental cost of storing pdfs was not crushing. Of course these could all be stored externally, too. Maybe we'll try that and get massive performance increases :)

-Mike

Reply via email to