There's no appreciable RAM cost during querying, faceting, sorting of
search results and so on. Stored fields are separate from the inverted
index. There is some cost in additional disk space required and I/O
during merging, but I think you'll find these are not significant. The
main cost we've observed from handling very large texts is highlighting.
The default highlighter essentially re-scans the entire document, so
it's necessary to limit its scope to get decent performance.
FastVectorHighlighter is better, but also has some scaling issues with
large documents, and it requires term vectors, which are expensive in
their own right. We've gotten best performance from
PostingsHighlighter, but it doesn't handle phrase-sensitive
highlighting, and I will say that I haven't tried it on such large
documents as that: I believe it builds a mini-index on the fly in order
to score highlighting passages, and that could get expensive with 1GB docs.
You might find in the end that you are better off splitting these very
large documents into smaller pieces and rolling those up using
parent/child document indexing or grouping or something, primarily
because of the highlighting.
-Mike
On 12/3/14 4:56 PM, Avishai Ish-Shalom wrote:
The use case is not for pdf or documents with images but very large text
documents. My question is does storing the documents degrade performance
more then just indexing without storing? i will only return highlighted
text of limited length and probably never download the entire document.
On Tue, Dec 2, 2014 at 2:15 AM, Jack Krupansky <j...@basetechnology.com>
wrote:
In particular, if they are image-intensive, all the images go away. And
the formatting as well.
-- Jack Krupansky
-----Original Message----- From: Ahmet Arslan
Sent: Monday, December 1, 2014 6:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Large fields storage
Hi Avi,
I assume your documents are rich documents like pdf word, am I correct?
When you extract textual content from them, their size will shrink.
Ahmet
On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <
avis...@fewbytes.com> wrote:
Hi all,
I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?
Does anyone on this list has experience to share with such large documents?
Thanks,
Avishai