Hi, Adding same document many times is actually the scenario I wanted to test--indexing hits from Apache webserver logs with the source of the referring page.
My expectation would be that the majority of hits on a given day would originate from a small number of referrers, so each of these referring pages would be indexed multiple times. I really wanted to check that this would scale better than indexing the same number of different documents--your explanation regarding term distribution explains why this is the case. Many thanks, Phil Otis Gospodnetic wrote: > > Phil, > > Note that adding the same document multiple times and looking at the index > size is not a very good approach. You are adding a fixed number of > distinct terms over and over. In real-life scenario you will have a much > greater term distribution, and that will affect index size. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: philmccarthy <philmccar...@gmail.com> >> To: solr-user@lucene.apache.org >> Sent: Wednesday, January 14, 2009 7:36:38 PM >> Subject: Re: Indexing the same data in many records >> >> >> Thanks Otis. I tweaked the Solr example app a little and then uploaded a >> ~55KB document to it a couple of thousand times (changing the ID each >> time). >> The solr/data directory was 72MB on disc after adding the document 2000 >> times, so it seems that the index is growing by approximately 36KB for >> each >> document. That seems reasonable. >> >> I guess I need to do some research into expected data volumes now, and >> limits on Lucene index size. >> >> Cheers, >> Phil >> >> >> Otis Gospodnetic wrote: >> > >> > Phil, >> > >> > From what you described so far, I don't see any red flags. I would pay >> > attention to reading those timestamps (covered on the Wiki and ML >> > archives), that's all. >> > >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > >> > ----- Original Message ---- >> >> From: philmccarthy >> >> To: solr-user@lucene.apache.org >> >> Sent: Tuesday, January 13, 2009 8:49:33 PM >> >> Subject: Indexing the same data in many records >> >> >> >> >> >> Hi, >> >> >> >> I'd like to use Solr to index some webserver logs, in order to allow >> easy >> >> ad-hoc querying and analysis. Each Solr Document will represent a >> single >> >> request to the webserver, with fields for time, request URL, referring >> >> URL >> >> etc. >> >> >> >> I'm also planning to fetch the page source of each referring URL, and >> add >> >> that as an indexed field in the Solr document. The aim is to allow >> >> queries >> >> like "find hits to /xyz.html where the referring page contains the >> word >> >> 'foobar'". >> >> >> >> Since hundreds or even thousands of hits may all come from the same >> >> referring page, would this approach be horribly inefficient? (Note the >> >> page >> >> source won't be stored in each Document, just indexed). Am I going to >> >> dramatically increase the index size if I do this? >> >> >> >> If so, is there a more elegant way to do what I want? >> >> >> >> Many thanks, >> >> Phil >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21475019.html Sent from the Solr - User mailing list archive at Nabble.com.