There's no appreciable RAM cost during querying, faceting, sorting of search results and so on. Stored fields are separate from the inverted index. There is some cost in additional disk space required and I/O during merging, but I think you'll find these are not significant. The main cost we've observed from handling very large texts is highlighting. The default highlighter essentially re-scans the entire document, so it's necessary to limit its scope to get decent performance. FastVectorHighlighter is better, but also has some scaling issues with large documents, and it requires term vectors, which are expensive in their own right. We've gotten best performance from PostingsHighlighter, but it doesn't handle phrase-sensitive highlighting, and I will say that I haven't tried it on such large documents as that: I believe it builds a mini-index on the fly in order to score highlighting passages, and that could get expensive with 1GB docs.

You might find in the end that you are better off splitting these very large documents into smaller pieces and rolling those up using parent/child document indexing or grouping or something, primarily because of the highlighting.

-Mike

On 12/3/14 4:56 PM, Avishai Ish-Shalom wrote:
The use case is not for pdf or documents with images but very large text
documents. My question is does storing the documents degrade performance
more then just indexing without storing? i will only return highlighted
text of limited length and probably never download the entire document.

On Tue, Dec 2, 2014 at 2:15 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

In particular, if they are image-intensive, all the images go away. And
the formatting as well.

-- Jack Krupansky

-----Original Message----- From: Ahmet Arslan
Sent: Monday, December 1, 2014 6:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Large fields storage


Hi Avi,

I assume your documents are rich documents like pdf word, am I correct?
When you extract textual content from them, their size will shrink.

Ahmet



On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <
avis...@fewbytes.com> wrote:
Hi all,

I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?

Does anyone on this list has experience to share with such large documents?

Thanks,
Avishai


Reply via email to