Hi Erick, You are so right on the memory calculations. I am happy that I know now that I was doing something wrong. Yes I am getting confused with SQL.
I will back up and let you know the use case. I am tracking file versions. And I want to give an option to browse your system for the latest files. So in order to remove dups (same filename) I used grouping. Also when you say Sharding is it okay if I do multi cores and does it mean that each core needs a separate tomcat. I meant to say can I use the same machine? 150 mill docs have 120 mill unique paths too. One more thing. If I need sharding and need a new box then it wont be great. Because this system still have horsepower left which I can use. Thanks a ton for explaining the issue. Erick Erickson <erickerick...@gmail.com> wrote: You'r putting a lot of data on a single box, then asking to group on what I presume is a string field. That's just going to eat up a _bunch_ of memory. let's say your average file name is 16 bytes long. Each unique value will take up 58 + 32 bytes (58 bytes of overhead, I'm presuming Solr 3.X and 16*2 bytes for the chars). So, we're up to 90 bytes/string * number of distinct file names) Say you have, for argument's sake, 100M distinct file names. You're up to 9G memory requirement for sorting alone. Solr's sorting reads all the unique values into memory whether or not they satisfy the query... And Grouping can also be expensive. I don't think you really want to group in this case, I'd simply use a filter query something like: fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307 Then you're also grouping on conv_sort which doesn't make much sense, do you really want individual results returned for _each_ file name? What it looks like to me is you're confusing SQL with solr search and getting into bad situations... Also, 150M documents in a single shard is...really a lot. You're probably at a point where you need to shard. Not to mention that your 400G index is trying to be jammed into 12G of memory. This actually feels like an XY problem, can you back up and let us know what the use-case you're trying to solve is? Perhaps there are less memory- consumptive solutions possible. Best Erick On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee <tchatter...@commvault.com> wrote: > Editing the query...remove <smb:.....> I don't know where it came from while > I did copy/paste.... > > Tirthankar Chatterjee <tchatter...@commvault.com> wrote: > > > Hi, > I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2 > Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit > > > Data Index Dir Size: 400GB > Metadata of files is stored in it. I have around 15 schema fields. > Total number of items:150million approx. > > I have a scenario which I will try to explain to the best of my knowledge > here: > > Let us consider the fields I am interested in > > Url: Entire path of a file in windows file system including the filename. > ex:C:\Documents\A.txt > mtm: Modified Time of the file > Jid:JOb ID > conv_sort is string field type where the filename is stored. > > I run a job where the following gets inserted > > Total Items:2 > Url:C:\personal\A1.txt > mtm:08/14/2012 12:00:00 > Jid:1 > Conv_sort:A1.txt > ----------------------------------- > Url:C:\personal\B1.txt > mtm:08/14/2012 12:01:00 > Jid:1 > Conv_sort:B1.txt > In the second run only one item changes: > > Url:C:\personal\A1.txt > mtm:08/15/2012 1:00:00 > Jid:2 > Conv_sort=A1.txt > > When queried I would like to return the latest A1.txt and B1.txt back to the > end user. I am trying to use grouping with no luck. It keeps throwing OOM… > can someone please help… as it is critical for my project > > The query I am trying is under a folder there are 1000 files and I putting a > filtered query param too asking it to group by filenames or url and none of > them work…what am I doing wrong here > > > http://172.19.108.78:8080/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&group.query=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307<smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307>"&group=true&group.limit=1&group.field=conv_sort&group.ngroup=true > > > The stack trace: > > > SEVERE: java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Unknown Source) > at java.lang.String.<init>(Unknown Source) > at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) > at > org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184 > ) > at > org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue( > FieldCacheImpl.java:882) > at > org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java > :233) > at > org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl > .java:856) > at > org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN > extReader(TermFirstPassGroupingCollector.java:74) > at > org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector. > java:113) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:576) > > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364) > > at > org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:3 > 76) > at org.apache.solr.search.Grouping.execute(Grouping.java:298) > at > org.apache.solr.handler.component.QueryComponent.process(QueryCompone > nt.java:372) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sea > rchHandler.java:186) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl > erBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter > .java:365) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte > r.java:260) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl > icationFilterChain.java:243) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF > ilterChain.java:210) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV > alve.java:225) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextV > alve.java:123) > at > org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authentica > torBase.java:472) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j > ava:168) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j > ava:98) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: > 927) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal > ve.java:118) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav > a:407) > at > org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp > 11Processor.java:1001) > at > org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process( > AbstractProtocol.java:585) > at > org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin > t.java:1770) > > > > ******************Legal Disclaimer*************************** > "This communication may contain confidential and privileged > material for the sole use of the intended recipient. Any > unauthorized review, use or distribution by others is strictly > prohibited. If you have received the message in error, please > advise the sender by reply email and delete the message. Thank > you." > *********************************************************