Thanks Eric. I have already followed the solrj indexing very closely - I have 
to do a lot of manipulation at indexing time. The other blog article is very 
interesting as I do indeed use "year" (year of publication) and it is very 
frequently used to filter queries. I will have a play with that now.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, 5 December 2017 4:11 p.m.
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Multiple cores versus a "source" field.

That's the unpleasant part of semi-structued documents (PDF, Word, whatever). 
You never know the relationship between raw size and indexable text.

Basically anything that you don't care to contribute to _scoring_ is often 
better in an fq clause. You can also use {!cache=false} to bypass actually 
using the cache if you know it's unlikely to be reused.

Two other points:

1> you can offload the parsing to clients rather than Solr and gain
more control over the process (assuming you haven't already). Here's a blog:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> One reason to not go to fq clauses (except if you use
{!cache=false}) is if you are using bare NOW in your clauses for, say ranges, 
one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the 
subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/


Best,
Erick


On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>>You'll have a few economies of scale I think with a single core, but frankly 
>>I don't know if they'd be enough to measure. You say the docs are "quite 
>>large" though, >are you talking books? Magazine articles? is 20K large or are 
>>the 20M?
>
> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>
> Thanks for tip on fq: I will put that into code now as I have other fields 
> used is similar fashion.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Reply via email to