Thanks Yonik. It explains.

Regards,
Sourav

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size ....

On Mon, Nov 24, 2008 at 9:19 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi Yonik,
>
> Thanks again for the detail input.
>
> Let me try to re-confirm my understanding -
>
> 1. What you say is - if sorting is asked for a field, the same field from all 
> documents, which are indexed, would be put in a memory in an un-inverted 
> form. So given this if I have a field of String type with say 20 characters, 
> then (assuming no multibyte characters - all ascii) for 200M documents I need 
> to have at least 20x200 MB, i.e. 4GB memory.

That's the general idea, yes.
For Strings, it's actually just the unique values in a String[], plus
an int[200000000] of offsets into that String[] for each document.
See Lucene's FieldCache and StringIndex.

-Yonik


> 2. So, if I want to have sorting on 2 such fields I need to allocate at least 
> 8 GB of memory.
>
> 3. Another case is - if there are 2 search requests concurrently hitting the 
> server, each with sorting on the same 20 character date field, then also it 
> would need 2x2GB memory. So if I know that I need to support at least 4 
> concurrent search requests, I need to start the JVM at least with 8 GB heap 
> size.
>
> Please let me know if my understanding is correct.
>
> Regards,
> Sourav
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
> Sent: Monday, November 24, 2008 6:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sorting and JVM heap size ....
>
> On Mon, Nov 24, 2008 at 8:48 PM, souravm <[EMAIL PROTECTED]> wrote:
>> I have around 200M documents in index. The field I'm sorting on is a date 
>> string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the 
>> field is part of the search criteria.
>>
>> Also please note that the number of documents returned by the search 
>> criteria is much less than 200M. In fact even in case of 0 hit I found jvm 
>> out of memory exception.
>
> Right... that's just how the Lucene FieldCache used for sorting works right 
> now.
> The entire field is un-inverted and held in memory.
>
> 200M docs is a *lot*... you might try indexing your date fields as
> integer types that would take only 4 bytes per doc - and that will
> still take up 800M.  Given that 2 searchers can overlap, that still
> adds up to more than your heap - you will need to up that.
>
> The other option is to split your index across multiple nodes and use
> distributed search.  If you want to do any faceting in the future, or
> sort on multiple fields, you will need to do this anyway.
>
> -Yonik
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

Reply via email to