On 11/14/2014 01:43 PM, Erick Erickson wrote:
Just skimming, so maybe I misinterpreted.
ExternalFileField and ExternalFileFieldReloader
refer to storing values for each doc in an external file, they have
nothing to do with storing _files_.
The usual pattern is to have Solr store just enough data to have the
system-of-record return the actual file rather than have Solr
actually store the file. Solr isn't really built for this and while some
people do this it usually is a poor design if for no other reason than
as segments merge, the data gets copied again and again and again
to no good purpose.
I was worried about this, and spent a bunch of time working on a custom
codec that would store files externally (to avoid the merge penalty),
while still living inside the Solr/Lucene ecosystem. It was a lot of
complicated work, and after a while I thought I'd better do some careful
performance measurements to make sure it was worthwhile. What I found
was that the merge cost was not very high relative to other indexing
costs we were paying (indexing large full text documents with fairly
complex analysis, but nothing unusual). So I don't think this particular
performance argument against storage in Solr/Lucene is telling, at least
for many ratios of stored doc size to indexed tokens size. It's also
worth mentioning that my test involved reindexing every document once
(basically a query-level replication of an existing index), so perhaps
the amount of merging was less than it might be in other cases.
I can see that there might be other reasons to store documents
elsewhere, but in my experience, with our use case, it actually works
pretty well to store them in Lucene indexes. Consider, for example,
that if you are highlighting, you are probably already storing the full
text of each document anyway. In our case we also need to store a
marked-up version of the full text (so we can highlight an html view of
a document as well as deliver plain text snippets), so the incremental
cost of storing pdfs was not crushing. Of course these could all be
stored externally, too. Maybe we'll try that and get massive performance
increases :)
-Mike