RE: retrieve lucene "doc id"

Norskog, Lance Tue, 18 Dec 2007 21:12:00 -0800

Exactly.  We have done some projects where we extract records en masse.
With this technique we can make a query that will fetch exactly 3000
+-50  records, and walk through every 50 records using the query as a
filter. Works pretty well.

Lance

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 18, 2007 11:07 AM
To: solr-user@lucene.apache.org
Subject: Re: retrieve lucene "doc id"

Hi Lance,

You said:
We use the standard (some RFC) text representation of 32 hex
characters.
This has the advantage that F* pulls 1/16 of the total index, with a
completely randomized distribution, F**  1/256, etc.  This is very
handy for data analysis and document extraction. 

Could you elaborate on the last sentence?  Maybe give an example of what
you have in mind?
Are you thinking that this, because of uniform distribution, lets you
easily get a subset of documents of predictable size and thus have an
apriori knowledge of how large of a data set you'll get and work with?
Or something else?

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: "Norskog, Lance" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, December 17, 2007 2:43:55 PM
Subject: RE: retrieve lucene "doc id"

We are using MD5 to generate our IDs. MD5s are 128 bits creating a very
unique and very randomized number for the content. Nobody has ever
reported two different data sets that create the same MD5.

We use the standard (some RFC) text representation of 32 hex
characters.
This has the advantage that F* pulls 1/16 of the total index, with a
completely randomized distribution, F**  1/256, etc.  This is very
handy for data analysis and document extraction. 

MD5 creates 128 bits, but if your index is small enough that you are
willing to risk it, you could pick 64 bits and park them in a Java
long.

-----Original Message-----
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: retrieve lucene "doc id"

Yonik Seeley wrote:
> On Dec 17, 2007 1:40 AM, Ben Incani <[EMAIL PROTECTED]>
wrote:
>> I have converted to using the Solr search interface and I am trying 
>> to retrieve documents from a list of search results (where
 previously

>> I had used the doc id directly from the lucene query results) and
 the

>> solr id I have got currently indexed is unfortunately configured not
be unique!
> 
> Ouch... I'd try to make a unique Id then!
> Or barring that, just try to make the query match exactly the docs
 you

> want back (don't do the 2 phase thing).
> 

In 1.3-dev, you can use UUIDField to have solr generate a UUID for each
doc.

ryan

RE: retrieve lucene "doc id"

Reply via email to