Dear all, I am working on a research project in which I create an OS tool which tries to detect "bad" and "good" records in a metadata collection (such as a library catalog, museum database etc. -- you can find more info here http://pkiraly.github.io/). This is not the first project of that kind, there are some scientific articles in the topic, and there are some established metrics as well. One of the metrics is "Conformance to expectation" which is more or less a variation of the tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
The process in my case is to index the dabase, than iterate over the records and caculate tf-idf of the important fields. Since I haven't find a method with which I simply retrieve this from the Solr index, I followed the method: take a field value use /analysis/field handler to extract the terms from the original value use /terms with terms.limit=1, terms.sort=index, and terms.fl, terms.prefix parameters to retrieve the document frequencies of each terms do the calculations based on those input variables My question is: is there any more direct way to extract this information from the Solr index either in Solr, or with the Lucene API? Thank you very much in advance! Péter -- Péter Király software developer GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal http://linkedin.com/in/peterkiraly