That's the unpleasant part of semi-structued documents (PDF, Word,
whatever). You never know the relationship between raw size and
indexable text.

Basically anything that you don't care to contribute to _scoring_ is
often better in an fq clause. You can also use {!cache=false} to
bypass actually using the cache if you know it's unlikely to be
reused.

Two other points:

1> you can offload the parsing to clients rather than Solr and gain
more control over the process (assuming you haven't already). Here's
a blog:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> One reason to not go to fq clauses (except if you use
{!cache=false}) is if you are using bare NOW in your clauses for, say
ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's
another blog on the subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/


Best,
Erick


On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>>You'll have a few economies of scale I think with a single core, but frankly 
>>I don't know if they'd be enough to measure. You say the docs are "quite 
>>large" though, >are you talking books? Magazine articles? is 20K large or are 
>>the 20M?
>
> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>
> Thanks for tip on fq: I will put that into code now as I have other fields 
> used is similar fashion.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Reply via email to