Hi Luigi, What about using an updatable DocValue [1] for the field x ? you could initially set it to -1, and then update it for the docs in the step j. Range queries should still work and the update should be fast.
Cheers [1] http://shaierera.blogspot.com/2014/04/updatable-docvalues-under-hood.html From: solr-user@lucene.apache.org At: 01/30/18 18:42:01To: solr-user@lucene.apache.org Subject: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values Hello, I am working on a project that simulates a selective, large-scale crawling. The system adapts its behaviour according with some external user queries received at crawling time. Briefly, it analyzes the already crawled pages in the top-k results for each query, and prioritizes the visit of the discovered links accordingly. In a generic experiment, I measure the time units as the number of crawling cycles completed so far, i.e., with an integer value. Finally, I evaluate the experiment by analyzing the documents fetched over the crawling cycles. In this work I am using Lucene 7.2.1, but this should not be an issue since I need just some conceptual help. In my current implementation, an experiment starts with an empty index. When a Web page is fetched during the crawling cycle *x*, the system builds a document with the URL as StringField, the title and the body as TextFields, and *x* as an IntPoint. When I get an external user query, I submit it to get the top-k relevant documents crawled so far. When I need to retrieve the documents indexed from cycle *i* to cycle *j*, I execute a range query over this last IntPoint field. This strategy does the job, but of course the write operations take some hours overall for a single experiment, even if I crawl just half a million of Web pages. Since I am not crawling real-time data, but I am working over a static set of many billions of Web pages (whose contents are already stored on disk), I am investigating some opportunities to reduce the number of writes during an experiment. For instance, I could avoid to index everything from scratch for each run. I would be happy to index all the static contents of my dataset (i.e., URL, title and body of a Web page) once and for all. Then, for a single experiment, I would mark a document as crawled at cycle *x* without storing this information permanently, in order both to filter out the documents that in the current simulation have not been crawled when processing the external queries, and to still perform the range queries at evaluation time. Do you have any idea on how to do that? Thank you in advance for your support.