Re: 400 MB Fields

2011-06-08 Thread Alexander Kanarsky
Otis, Not sure about the Solr, but with Lucene It was certainly doable. I saw fields way bigger than 400Mb indexed, sometimes having a large set of unique terms as well (think something like log file with lots of alphanumeric tokens, couple of gigs in size). While indexing and querying of such thi

RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom
Hi Otis, Our OCR fields average around 800 KB. My guess is that the largest docs we index (in a single OCR field) are somewhere between 2 and 10MB. We have had issues where the in-memory representation of the document (the in memory index structures being built)is several times the size of t

Re: 400 MB Fields

2011-06-07 Thread Lance Norskog
The Salesforce book is 2800 pages of PDF, last I looked. What can you do with a field that big? Can you get all of the snippets? On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi wrote: > Hi Otis, > > > I am recalling "pagination" feature, it is still unresolved (with default > scoring implementation)

Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
Hi Otis, I am recalling "pagination" feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and wi

Re: 400 MB Fields

2011-06-07 Thread Otis Gospodnetic
Hi, > I think the question is strange... May be you are wondering about possible > OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search > I think we can pass

Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
I think the question is strange... May be you are wondering about possible OOM exceptions? I think we can pass to Lucene single document containing comma separated list of "term, term, ..." (few billion times)... Except "stored" and "TermVectorComponent"... I believe thousands companies already in

Re: 400 MB Fields

2011-06-07 Thread Erick Erickson
>From older (2.4) Lucene days, I once indexed the 23 volume "Encyclopedia of Michigan Civil War Volunteers" in a single document/field, so it's probably within the realm of possibility at least ... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic wrote: > Hello, > > What are the biggest do