Hi, first of all, thank you for your answers.
@ Rick: the reason is that the set of pages that are stored into the disk represents just a static view of the Web, in order to let my experiments be fully replicable. My need is to run simulations of different crawlers on top of it, each working on those pages as if they are coming from the real Web. During a simulation, the crawler receives a set of unpredictable user queries from an external module. Then, it changes the visit priorities to the discovered but uncrawled pages according with the current top-k results for those queries, given the contents of the "crawled" pages so far. Moreover, distinct runs explore different parts of the Web graph and receive different user queries. That's why I need to build a separate index of crawled contents for each run. The observation is that, since I am working with a snapshot of the Web, my indexing process could be engineered such that all the Web pages are already stored in the indexer and a flag enables the retrievability of the page if it has been crawled in the current experiment. In this way, I save some time that I could use to augment the scale of the crawling simulation, and/or to run other experiments. @ Alessandro: your approach of using a static and a dynamic index and then to merge the results by means of query joins was what I had in mind at a first glance. It could still do the job, but you already highlighted a performance limitation on the static index. Moreover, even if I store just the IDs and the crawling cycles, also the dynamic index will still be populated by some million of entries as the experiment proceeds. The atomic updates were another opportunity that I investigated before asking your help, but since eventually they rewrite the entire document I was hoping to find a more efficient solution. @ Diego: your idea of using the NumericDocValues sounds interesting. Probably this is the solution, but, if I get the point, a NumericDocValues has some features in common with the IntPoint that I am currently using in my index [1]. Among them: the storage of primitive data types instead of strings only, and the storage on a data structure different than the inverted index. Now I am asking: is there a chance to use the IntPoint in the same way? Cheers. [1] https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/document/IntPoint.html 2018-01-31 13:45 GMT+01:00 Rick Leir <rl...@leirtech.com>: > Luigi > Is there a reason for not indexing all of your on-disk pages? That seems > to be the first step. But I do not understand what your goal is. > Cheers -- Rick > > On January 30, 2018 1:33:27 PM EST, Luigi Caiazza <lcaiazz...@gmail.com> > wrote: > >Hello, > > > >I am working on a project that simulates a selective, large-scale > >crawling. > >The system adapts its behaviour according with some external user > >queries > >received at crawling time. Briefly, it analyzes the already crawled > >pages > >in the top-k results for each query, and prioritizes the visit of the > >discovered links accordingly. In a generic experiment, I measure the > >time > >units as the number of crawling cycles completed so far, i.e., with an > >integer value. Finally, I evaluate the experiment by analyzing the > >documents fetched over the crawling cycles. In this work I am using > >Lucene > >7.2.1, but this should not be an issue since I need just some > >conceptual > >help. > > > >In my current implementation, an experiment starts with an empty index. > >When a Web page is fetched during the crawling cycle *x*, the system > >builds > >a document with the URL as StringField, the title and the body as > >TextFields, and *x* as an IntPoint. When I get an external user query, > >I > >submit it to get the top-k relevant documents crawled so far. When I > >need > >to retrieve the documents indexed from cycle *i* to cycle *j*, I > >execute a > >range query over this last IntPoint field. This strategy does the job, > >but > >of course the write operations take some hours overall for a single > >experiment, even if I crawl just half a million of Web pages. > > > >Since I am not crawling real-time data, but I am working over a static > >set > >of many billions of Web pages (whose contents are already stored on > >disk), > >I am investigating some opportunities to reduce the number of writes > >during > >an experiment. For instance, I could avoid to index everything from > >scratch > >for each run. I would be happy to index all the static contents of my > >dataset (i.e., URL, title and body of a Web page) once and for all. > >Then, > >for a single experiment, I would mark a document as crawled at cycle > >*x* without > >storing this information permanently, in order both to filter out the > >documents that in the current simulation have not been crawled when > >processing the external queries, and to still perform the range queries > >at > >evaluation time. Do you have any idea on how to do that? > > > >Thank you in advance for your support. > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com