Dear all,

I am working on a research project in which I create an OS tool which
tries to detect "bad" and "good" records in a metadata collection
(such as a library catalog, museum database etc. -- you can find more
info here http://pkiraly.github.io/). This is not the first project of
that kind, there are some scientific articles in the topic, and there
are some established metrics as well. One of the metrics is
"Conformance to expectation" which is more or less a variation of the
tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

The process in my case is to index the dabase, than iterate over the
records and caculate tf-idf of the important fields. Since I haven't
find a method with which I simply retrieve this from the Solr index, I
followed the method:

take a field value
use /analysis/field handler to extract the terms from the original value
use /terms with terms.limit=1, terms.sort=index, and terms.fl,
terms.prefix parameters to retrieve the document frequencies of each
terms
do the calculations based on those input variables

My question is: is there any more direct way to extract this
information from the Solr index either in Solr, or with the Lucene
API?

Thank you very much in advance!
Péter

-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Reply via email to