Re: Multiple cores versus a "source" field.

Walter Underwood Mon, 04 Dec 2017 20:51:23 -0800

One more opinion on source field vs separate collections for multiple corpora.


Index statistics don’t really settle down until at least 100k documents. Below 
that, idf is pretty noisy. With Ultraseek, we used pre-calculated frequency 
data for collections under 10k docs.

If your corpora have similar word statistics, you might get more predictable 
relevance with a single collection. For example, if you have data sheets and 
press releases, but they are both about test instruments, then you might get 
some advantage from having more data points about the “text” and “title” fields.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2017, at 7:17 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> 
> Thanks Eric. I have already followed the solrj indexing very closely - I have 
> to do a lot of manipulation at indexing time. The other blog article is very 
> interesting as I do indeed use "year" (year of publication) and it is very 
> frequently used to filter queries. I will have a play with that now.
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, 5 December 2017 4:11 p.m.
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Multiple cores versus a "source" field.
> 
> That's the unpleasant part of semi-structued documents (PDF, Word, whatever). 
> You never know the relationship between raw size and indexable text.
> 
> Basically anything that you don't care to contribute to _scoring_ is often 
> better in an fq clause. You can also use {!cache=false} to bypass actually 
> using the cache if you know it's unlikely to be reused.
> 
> Two other points:
> 
> 1> you can offload the parsing to clients rather than Solr and gain
> more control over the process (assuming you haven't already). Here's a blog:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> 2> One reason to not go to fq clauses (except if you use
> {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, 
> one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the 
> subject:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
> 
> 
> Best,
> Erick
> 
> 
> On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>>> You'll have a few economies of scale I think with a single core, but 
>>> frankly I don't know if they'd be enough to measure. You say the docs are 
>>> "quite large" though, >are you talking books? Magazine articles? is 20K 
>>> large or are the 20M?
>> 
>> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
>> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>> 
>> Thanks for tip on fq: I will put that into code now as I have other fields 
>> used is similar fashion.
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Re: Multiple cores versus a "source" field.

Reply via email to