Re: What are the options for obtaining IDF at interactive speeds?

Kathryn Mazaitis Mon, 08 Jul 2013 13:35:36 -0700

Hi All,

Resolution: I ended up cheating. :P Though now that I look at it, I think
this was Roman's second suggestion. Thanks!


Since the application that will be processing the IDF figures is located on
the same machine as SOLR, I opened a second IndexReader on the lucene index
and used

reader.numDocs()
reader.docFreq(field,term)

to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

As it turns out, using this method to get IDF on all the terms mentioned in
the set of relevant documents runs in time comparable to retrieving the
documents in the first place (so, .1-1s). This makes it fast enough that
it's no longer the slowest part of my algorithm by far. Problem solved! It
is possible that IDFValueSource would be faster; I may swap that in at a
later date.

I will keep Mikhail's debugQuery=true in my pocket, too; that technique
would never have occurred to me. Thank you too!

Best,
Katie


On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

> Hi Kathryn,
> I wonder if you could index all your terms as separate documents and then
> construct a new query (2nd pass)
>
> q=term:term1 OR term:term2 OR term:term3
>
> and use func to score them
>
> *idf(other_field,field(term))*
> *
> *
> the 'term' index cannot be multi-valued, obviously.
>
> Other than that, if you could do it on server side, that weould be the
> fastest - the code is ready inside IDFValueSource:
>
> http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
>
> roman
>
>
> On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
> <kathryn.riv...@gmail.com>wrote:
>
> > Hi,
> >
> > I'm using SOLRJ to run a query, with the goal of obtaining:
> >
> > (1) the retrieved documents,
> > (2) the TF of each term in each document,
> > (3) the IDF of each term in the set of retrieved documents (TF/IDF would
> be
> > fine too)
> >
> > ...all at interactive speeds, or <10s per query. This is a demo, so if
> all
> > else fails I can adjust the corpus, but I'd rather, y'know, actually do
> it.
> >
> > (1) and (2) are working; I completed the patch posted in the following
> > issue:
> > https://issues.apache.org/jira/browse/SOLR-949
> > and am just setting tv=true&tv.tf=true for my query. This way I get the
> > documents and the tf information all in one go.
> >
> > With (3) I'm running into trouble. I have found 2 ways to do it so far:
> >
> > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> > information along with the documents and tf information. Since each term
> > may appear in multiple documents, this means retrieving idf information
> for
> > each term about 20 times, and takes over a minute to do.
> >
> > Option B: After I've gathered the tf information, run through the list of
> > terms used across the set of retrieved documents, and for each term, run
> a
> > query like:
> > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> > ...while this retrieves idf information only once for each term, the
> added
> > latency for doing that many queries piles up to almost two minutes on my
> > current corpus.
> >
> > Is there anything I didn't think of -- a way to construct a query to get
> > idf information for a set of terms all in one go, outside the bounds of
> > what terms happen to be in a document?
> >
> > Failing that, does anyone have a sense for how far I'd have to scale
> down a
> > corpus to approach interactive speeds, if I want this sort of data?
> >
> > Katie
> >
>

Re: What are the options for obtaining IDF at interactive speeds?

Reply via email to