Re: Large fields storage

Michael Sokolov Thu, 04 Dec 2014 05:05:52 -0800

There's no appreciable RAM cost during querying, faceting, sorting ofsearch results and so on. Stored fields are separate from the invertedindex. There is some cost in additional disk space required and I/Oduring merging, but I think you'll find these are not significant. Themain cost we've observed from handling very large texts is highlighting.The default highlighter essentially re-scans the entire document, soit's necessary to limit its scope to get decent performance.FastVectorHighlighter is better, but also has some scaling issues withlarge documents, and it requires term vectors, which are expensive intheir own right. We've gotten best performance fromPostingsHighlighter, but it doesn't handle phrase-sensitivehighlighting, and I will say that I haven't tried it on such largedocuments as that: I believe it builds a mini-index on the fly in orderto score highlighting passages, and that could get expensive with 1GB docs.

You might find in the end that you are better off splitting these verylarge documents into smaller pieces and rolling those up usingparent/child document indexing or grouping or something, primarilybecause of the highlighting.


-Mike

On 12/3/14 4:56 PM, Avishai Ish-Shalom wrote:

The use case is not for pdf or documents with images but very large text
documents. My question is does storing the documents degrade performance
more then just indexing without storing? i will only return highlighted
text of limited length and probably never download the entire document.

On Tue, Dec 2, 2014 at 2:15 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

In particular, if they are image-intensive, all the images go away. And
the formatting as well.

-- Jack Krupansky

-----Original Message----- From: Ahmet Arslan
Sent: Monday, December 1, 2014 6:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Large fields storage

Hi Avi,

I assume your documents are rich documents like pdf word, am I correct?
When you extract textual content from them, their size will shrink.

Ahmet

On Tuesday, December 2, 2014 12:11 AM, Avishai Ish-Shalom <
avis...@fewbytes.com> wrote:
Hi all,

I have very large documents (as big as 1GB) which i'm indexing and planning
to store in Solr in order to use highlighting snippets. I am concerned
about possible performance issues with such large fields - does storing the
fields require additional RAM over what is required to index/fetch/search?
I'm assuming Solr reads only the required data by offset from the storage
and not the entire field. Am I correct in this assumption?

Does anyone on this list has experience to share with such large documents?

Thanks,
Avishai

Re: Large fields storage

Reply via email to