Katie, This case is actually really hard to get. Just let me provide the contra-sample, to let you explain problem better by spotting the gap. What if I say that, debugQuery=true provides tf, idf for the terms and documents from the requested page of results. Why you can't use explain to solve the problem?
On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis <kathryn.riv...@gmail.com>wrote: > Hi, > > I'm using SOLRJ to run a query, with the goal of obtaining: > > (1) the retrieved documents, > (2) the TF of each term in each document, > (3) the IDF of each term in the set of retrieved documents (TF/IDF would be > fine too) > > ...all at interactive speeds, or <10s per query. This is a demo, so if all > else fails I can adjust the corpus, but I'd rather, y'know, actually do it. > > (1) and (2) are working; I completed the patch posted in the following > issue: > https://issues.apache.org/jira/browse/SOLR-949 > and am just setting tv=true&tv.tf=true for my query. This way I get the > documents and the tf information all in one go. > > With (3) I'm running into trouble. I have found 2 ways to do it so far: > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf > information along with the documents and tf information. Since each term > may appear in multiple documents, this means retrieving idf information for > each term about 20 times, and takes over a minute to do. > > Option B: After I've gathered the tf information, run through the list of > terms used across the set of retrieved documents, and for each term, run a > query like: > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1 > ...while this retrieves idf information only once for each term, the added > latency for doing that many queries piles up to almost two minutes on my > current corpus. > > Is there anything I didn't think of -- a way to construct a query to get > idf information for a set of terms all in one go, outside the bounds of > what terms happen to be in a document? > > Failing that, does anyone have a sense for how far I'd have to scale down a > corpus to approach interactive speeds, if I want this sort of data? > > Katie > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>