Re: retrieve lucene "doc id"

Norberto Meijome Mon, 17 Dec 2007 20:35:11 -0800

On Mon, 17 Dec 2007 14:43:55 -0500
"Norskog, Lance" <[EMAIL PROTECTED]> wrote:


> We are using MD5 to generate our IDs. MD5s are 128 bits creating a very
> unique and very randomized number for the content. Nobody has ever
> reported two different data sets that create the same MD5.

yup, we use 2 Md5 concatenated . the first part is the MD5 of a group name,the 
2nd part is related to the item in the group (the same item can be in different 
groups, so this 2nd part can also be repeated ) - of course, only 1 item can 
exist in each group, so it is always unique.

> 
> We use the standard (some RFC) text representation of 32 hex characters.
> This has the advantage that F* pulls 1/16 of the total index, with a
> completely randomized distribution, F**  1/256, etc.  This is very handy
> for data analysis and document extraction. 

yup, and in our case, the first half of the docId could be used to get all 
items in a group. But your example is a good one - I haven't used it for that 
yet, but it's a simple and practical  use of the doc id :)

cheers,
B
_________________________
{Beto|Norberto|Numard} Meijome

"I was born not knowing and have had only a little time to change that here and 
there." 
  Richard Feynman

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: retrieve lucene "doc id"

Reply via email to