That's the unpleasant part of semi-structued documents (PDF, Word, whatever). You never know the relationship between raw size and indexable text.
Basically anything that you don't care to contribute to _scoring_ is often better in an fq clause. You can also use {!cache=false} to bypass actually using the cache if you know it's unlikely to be reused. Two other points: 1> you can offload the parsing to clients rather than Solr and gain more control over the process (assuming you haven't already). Here's a blog: https://lucidworks.com/2012/02/14/indexing-with-solrj/ 2> One reason to not go to fq clauses (except if you use {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the subject: https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/ Best, Erick On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: >>You'll have a few economies of scale I think with a single core, but frankly >>I don't know if they'd be enough to measure. You say the docs are "quite >>large" though, >are you talking books? Magazine articles? is 20K large or are >>the 20M? > > Technical reports. Sometimes up to 200MB pdfs, but that would include a lot > of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text. > > Thanks for tip on fq: I will put that into code now as I have other fields > used is similar fashion. > Notice: This email and any attachments are confidential and may not be used, > published or redistributed without the prior written consent of the Institute > of Geological and Nuclear Sciences Limited (GNS Science). If received in > error please destroy and immediately notify GNS Science. Do not copy or > disclose the contents.