Re: SOLR3.6:Field Collapsing/Grouping throws OOM

Tirthankar Chatterjee Wed, 15 Aug 2012 04:05:39 -0700

Hi Erick,
You are so right on the memory calculations. I am happy that I know now that I 
was doing something wrong. Yes I am getting confused with SQL.


I will back up and let you know the use case. I am tracking file versions. And 
I want to give an option to browse your system for the latest files. So in 
order to remove dups (same filename) I used grouping.

Also when you say Sharding is it okay if I do multi cores and does it mean that 
each core needs a separate tomcat. I meant to say can I use the same machine? 
150 mill docs have 120 mill unique paths too.

One more thing. If I need sharding and need a new box then it wont be great. 
Because this system still have horsepower left which I can use.

Thanks a ton for explaining the issue.

Erick Erickson <erickerick...@gmail.com> wrote:


You'r putting  a lot of data on a single box, then
asking to group on what I presume is a string
field. That's just going to eat up a _bunch_ of
memory.

let's say your average file name is 16 bytes long. Each
unique value will take up 58 + 32 bytes (58 bytes
of overhead, I'm presuming Solr 3.X and 16*2 bytes
for the chars). So, we're up to 90 bytes/string * number
of distinct file names) Say you have, for argument's
sake, 100M distinct file names. You're up to 9G
memory requirement for sorting alone. Solr's
sorting reads all the unique values into memory whether
or not they satisfy the query...

And Grouping can also be expensive. I don't think
you really want to group in this case, I'd simply use
a filter query something like:
fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

Then you're also grouping on conv_sort which doesn't
make much sense, do you really want individual results returned
for _each_ file name?

What it looks like to me is you're confusing SQL with
solr search and getting into bad situations...

Also, 150M documents in a single shard is...really a lot.
You're probably at a point where you need to shard. Not
to mention that your 400G index is trying to be jammed
into 12G of memory.

This actually feels like an XY problem, can you back
up and let us know what the use-case you're
trying to solve is? Perhaps there are less memory-
consumptive solutions possible.

Best
Erick

On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
<tchatter...@commvault.com> wrote:
> Editing the query...remove <smb:.....> I don't know where it came from while 
> I did copy/paste....
>
> Tirthankar Chatterjee <tchatter...@commvault.com> wrote:
>
>
> Hi,
> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit
>
>
> Data Index Dir Size: 400GB
> Metadata of files is stored in it. I have around 15 schema fields.
> Total number of items:150million approx.
>
> I have a scenario which I will try to explain to the best of my knowledge 
> here:
>
> Let us consider the fields I am interested in
>
> Url: Entire path of a file in windows file system including the filename. 
> ex:C:\Documents\A.txt
> mtm: Modified Time of the file
> Jid:JOb ID
> conv_sort is string field type where the filename is stored.
>
> I run a job where the following gets inserted
>
> Total Items:2
> Url:C:\personal\A1.txt
> mtm:08/14/2012 12:00:00
> Jid:1
> Conv_sort:A1.txt
> -----------------------------------
> Url:C:\personal\B1.txt
> mtm:08/14/2012 12:01:00
> Jid:1
> Conv_sort:B1.txt
> In the second run only one item changes:
>
> Url:C:\personal\A1.txt
> mtm:08/15/2012 1:00:00
> Jid:2
> Conv_sort=A1.txt
>
> When queried I would like to return the latest A1.txt and B1.txt back to the 
> end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
> can someone please help… as it is critical for my project
>
> The query I am trying is under a folder there are 1000 files and I putting a 
> filtered query param too asking it to group by filenames or url and none of 
> them work…what am I doing wrong here
>
>
> http://172.19.108.78:8080/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&group.query=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307<smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307>"&group=true&group.limit=1&group.field=conv_sort&group.ngroup=true
>
>
> The stack trace:
>
>
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOfRange(Unknown Source)
>         at java.lang.String.<init>(Unknown Source)
>         at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>         at 
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184
> )
>         at 
> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(
> FieldCacheImpl.java:882)
>         at 
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java
> :233)
>         at 
> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl
> .java:856)
>         at 
> org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN
> extReader(TermFirstPassGroupingCollector.java:74)
>         at 
> org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.
> java:113)
>         at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:576)
>
>         at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364)
>
>         at 
> org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:3
> 76)
>         at org.apache.solr.search.Grouping.execute(Grouping.java:298)
>         at 
> org.apache.solr.handler.component.QueryComponent.process(QueryCompone
> nt.java:372)
>         at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sea
> rchHandler.java:186)
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
> erBase.java:129)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
> .java:365)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
> r.java:260)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
> icationFilterChain.java:243)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
> ilterChain.java:210)
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
> alve.java:225)
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
> alve.java:123)
>         at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authentica
> torBase.java:472)
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
> ava:168)
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
> ava:98)
>         at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
> 927)
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
> ve.java:118)
>         at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
> a:407)
>         at 
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp
> 11Processor.java:1001)
>         at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(
> AbstractProtocol.java:585)
>         at 
> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
> t.java:1770)
>
>
>
> ******************Legal Disclaimer***************************
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *********************************************************

Re: SOLR3.6:Field Collapsing/Grouping throws OOM

Reply via email to