Hi,

first of all, thank you for your answers.

@ Rick: the reason is that the set of pages that are stored into the disk
represents just a static view of the Web, in order to let my experiments be
fully replicable. My need is to run simulations of different crawlers on
top of it, each working on those pages as if they are coming from the real
Web. During a simulation, the crawler receives a set of unpredictable user
queries from an external module. Then, it changes the visit priorities to
the discovered but uncrawled pages according with the current top-k results
for those queries, given the contents of the "crawled" pages so far.
Moreover, distinct runs explore different parts of the Web graph and
receive different user queries. That's why I need to build a separate index
of crawled contents for each run. The observation is that, since I am
working with a snapshot of the Web, my indexing process could be engineered
such that all the Web pages are already stored in the indexer and a flag
enables the retrievability of the page if it has been crawled in the
current experiment. In this way, I save some time that I could use to
augment the scale of the crawling simulation, and/or to run other
experiments.

@ Alessandro: your approach of using a static and a dynamic index and then
to merge the results by means of query joins was what I had in mind at a
first glance. It could still do the job, but you already highlighted a
performance limitation on the static index. Moreover, even if I store just
the IDs and the crawling cycles, also the dynamic index will still be
populated by some million of entries as the experiment proceeds. The atomic
updates were another opportunity that I investigated before asking your
help, but since eventually they rewrite the entire document I was hoping to
find a more efficient solution.

@ Diego: your idea of using the NumericDocValues sounds interesting.
Probably this is the solution, but, if I get the point, a NumericDocValues
has some features in common with the IntPoint that I am currently using in
my index [1]. Among them: the storage of primitive data types instead of
strings only, and the storage on a data structure different than the
inverted index. Now I am asking: is there a chance to use the IntPoint in
the same way?

Cheers.

[1]
https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/document/IntPoint.html

2018-01-31 13:45 GMT+01:00 Rick Leir <rl...@leirtech.com>:

> Luigi
> Is there a reason for not indexing all of your on-disk pages? That seems
> to be the first step. But I do not understand what your goal is.
> Cheers -- Rick
>
> On January 30, 2018 1:33:27 PM EST, Luigi Caiazza <lcaiazz...@gmail.com>
> wrote:
> >Hello,
> >
> >I am working on a project that simulates a selective, large-scale
> >crawling.
> >The system adapts its behaviour according with some external user
> >queries
> >received at crawling time. Briefly, it analyzes the already crawled
> >pages
> >in the top-k results for each query, and prioritizes the visit of the
> >discovered links accordingly. In a generic experiment, I measure the
> >time
> >units as the number of crawling cycles completed so far, i.e., with an
> >integer value. Finally, I evaluate the experiment by analyzing the
> >documents fetched over the crawling cycles. In this work I am using
> >Lucene
> >7.2.1, but this should not be an issue since I need just some
> >conceptual
> >help.
> >
> >In my current implementation, an experiment starts with an empty index.
> >When a Web page is fetched during the crawling cycle *x*, the system
> >builds
> >a document with the URL as StringField, the title and the body as
> >TextFields, and *x* as an IntPoint. When I get an external user query,
> >I
> >submit it  to get the top-k relevant documents crawled so far. When I
> >need
> >to retrieve the documents indexed from cycle *i* to cycle *j*, I
> >execute a
> >range query over this last IntPoint field. This strategy does the job,
> >but
> >of course the write operations take some hours overall for a single
> >experiment, even if I crawl just half a million of Web pages.
> >
> >Since I am not crawling real-time data, but I am working over a static
> >set
> >of many billions of Web pages (whose contents are already stored on
> >disk),
> >I am investigating some opportunities to reduce the number of writes
> >during
> >an experiment. For instance, I could avoid to index everything from
> >scratch
> >for each run. I would be happy to index all the static contents of my
> >dataset (i.e., URL, title and body of a Web page) once and for all.
> >Then,
> >for a single experiment, I would mark a document as crawled at cycle
> >*x* without
> >storing this information permanently, in order both to filter out the
> >documents that in the current simulation have not been crawled when
> >processing the external queries, and to still perform the range queries
> >at
> >evaluation time. Do you have any idea on how to do that?
> >
> >Thank you in advance for your support.
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Reply via email to