1) Diego's observation about IDF is absolutely correct here, but I don't think he was pointing it to be a negative aspect of your new approach. I think he just wanted to warn you about this.
The way BM25 uses the IDF feature of a term is to estimate how important is the term in the context ( giving its document frequency in the corpus). I don't think you should remove IDF from your similarity function, actually the IDF value coming from the bigger index is closer to reality ( being your domain the web, an ideal IDF should be the one calculated over the entire internet...) Of course this is valid if you like BM25 as a similarity function ( and if it is fit for purpose) 2) Related the way to evaluate the experiments based on experiment and crawling cycle, the quickest way to do that may be to have the crawlingCycle field to be a dynamic field. the name of the field will depend on the experimentID. such as : *_crawling_cycle For experimentId= exp01, you will have the field : exp01_crawling_cycle. For experimentId= exp02, you will have the field : exp02_crawling_cycle. If I understood your evaluation time queries, you will be able to check each field depending on the experiment you are interested. My doubt using the incremental approach is that running a query such as : "I want to know for the experiment 1 , which pages where crawled at the first cycle" Will not work, as you just store the last cycle that involved that page. So the exact cycles ids assigned to the pages will not be known. But I am not sure I fully understood your use case, so ignore my observation if it is useless. Regards ----- --------------- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html