facet.missing=true returns null records with zero count also
All, We had a requirement in our solr powered application where customers want to see all the documents that have a blank value for a field. So when they facet on a field, if the field has null values, they should be able select that facet value and see all documents. I thought facet.missing=true was the answer. When I set facet.missing=true in solrconfig.xml, I expected to get facet values that are null along with their count. However, when there is no null value, I do not want the null to be returned along with a count of zero, which is what is happening now. Background information: Using SolrJ with Solr 3.4 and jdk7 Sample program SolrQuery facquery= new SolrQuery(); facquery.setQuery("*:*"); facquery.addFilterQuery("Field2:\"ISC\""); facquery.setRows(0); facquery.setFacet(true); facquery.setFacetMinCount(1); facquery.setFacetLimit(2); String[] orderedFacetList = new String[] {"Field1", "Field2", "Field3"}; for(int i=0; i < orderedFacetList.length; i++) { facquery.addFacetField(orderedFacetList[i]); } try { facResponse = server.query(facquery); }catch(SolrServerException ex) { } FacetField ff1=facResponse.getFacetField("Field2"); int count = ff1.getValueCount();//This gives count of 2 List flist = ff1.getValues(); //The values are [ISC (1077), null (0)] In the above program, I am applying a filter on the field Field2 with a value ISC. So the results will be only documents that have ISC for Field2. My expectation is that flist in above program should only return [ISC (1077)]. Appreciate any pointers on this. Thank you - Rahul
Re: facet.missing=true returns null records with zero count also
Hoss, We rely heavily on facet.mincount because once a user has selected a facet, it doesn't make sense for us to show that facet field to him and let him filter again with the same facet. Also, when a facet has only one value, it doesn't make sense to show it to the user, since searching with that facet is just going to give the same result set again. So when facet.missing does not work with facet.mincount, it is a bit of a hassle for us Will work on handling it in our program.Thank you for the clarification - Rahul On Wed, Jun 5, 2013 at 12:32 AM, Chris Hostetter wrote: > > : that facet value and see all documents. I thought facet.missing=true was > : the answer. > ... > : facquery.setFacetMinCount(1); > > Hmm, yeah -- it looks like facet.missing doesn't take facet.mincount into > consideration. > > I don't remember if that was intentional or not, but as a special case > one-off count it seems like a toss up as to wether it would be more or > less surprising to hide it if it's below the mincount. (it's very similar > to doing one off facet.query for example, and those are always included in > the response and don't consider the facet.mincount either) > > In general, this seems like a low impact thing though, correct? i mean: > the main advantage of facet.mincount is to reduce what could be a very > large amount of useless data from being stream from the server->client, > particularly in the case of using facet.sort where you really need the > consraints eliminated server side in order to get the sort=limit applied > correctly. > > but with the facet.missing value, it's just a single value per field that > can easily be ignored by the client if it's not desired because of the > mincount. or to put it another way: the amount of work needed to ignor > this on the client, is less then the amount of work to make it > configurable to ignore it on the server. > > > -Hoss >
OR query with null value and non-null value(s)
I have recently enabled facet.missing=true in solrconfig.xml which gives null facet values also. As I understand it, the syntax to do a faceted search on a null value is something like this: &fq=-price:[* TO *] So when I want to search on a particular value (for example : 4) OR null value, I would expect the syntax to be something like this: &fq=(price:4+OR+(-price:[* TO *])) But this does not work. After searching around for more, read somewhere that the right way to achieve this would be: fq=-(-price:4+AND+price:[*+TO+*]) Now this does work but seems like a very roundabout way. Is there a better way to achieve this ? I use solrJ in Solr 3.4. Thank you. - Rahul
Re: OR query with null value and non-null value(s)
Thank you Shawn. This does work. To help me understand better, why do we need the *:* ? Shouldn't it be implicit ? Shouldn't fq=(price:4+OR+(-price:[* TO *])) //does not work mean the same as fq=(price:4+OR+(*:* -price:[* TO *])) //works Why does Solr need the *:* there ? On Fri, Jun 7, 2013 at 12:07 AM, Shawn Heisey wrote: > On 6/6/2013 12:28 PM, Rahul R wrote: > >> I have recently enabled facet.missing=true in solrconfig.xml which gives >> null facet values also. As I understand it, the syntax to do a faceted >> search on a null value is something like this: >> &fq=-price:[* TO *] >> So when I want to search on a particular value (for example : 4) OR null >> value, I would expect the syntax to be something like this: >> &fq=(price:4+OR+(-price:[* TO *])) >> But this does not work. After searching around for more, read somewhere >> that the right way to achieve this would be: >> fq=-(-price:4+AND+price:[*+TO+***]) >> Now this does work but seems like a very roundabout way. Is there a better >> way to achieve this ? >> > > Pure negative queries don't work -- you have to have results in the query > before you can subtract. For some top-level queries, Solr is able to > detect this situation and fix it internally, but on inner queries you must > explicitly state your intentions. It is best if you always use '*:* > -query' syntax, just to be safe. > > fq=(price:4+OR+(*:* -price:[* TO *])) > > Thanks, > Shawn > >
Re: OR query with null value and non-null value(s)
Thank you for the Clarification Shawn. On Fri, Jun 7, 2013 at 7:34 PM, Jack Krupansky wrote: > Yes, it SHOULD! And in the LucidWorks Search query parser it does. Why > doesn't it in Solr? Ask Yonik to explain that! > > -- Jack Krupansky > > -Original Message- From: Rahul R > Sent: Friday, June 07, 2013 1:21 AM > To: solr-user@lucene.apache.org > Subject: Re: OR query with null value and non-null value(s) > > > Thank you Shawn. This does work. To help me understand better, why do > we need the *:* ? Shouldn't it be implicit ? > Shouldn't > fq=(price:4+OR+(-price:[* TO *])) //does not work > mean the same as > fq=(price:4+OR+(*:* -price:[* TO *])) //works > > Why does Solr need the *:* there ? > > > > > On Fri, Jun 7, 2013 at 12:07 AM, Shawn Heisey wrote: > > On 6/6/2013 12:28 PM, Rahul R wrote: >> >> I have recently enabled facet.missing=true in solrconfig.xml which gives >>> null facet values also. As I understand it, the syntax to do a faceted >>> search on a null value is something like this: >>> &fq=-price:[* TO *] >>> So when I want to search on a particular value (for example : 4) OR null >>> value, I would expect the syntax to be something like this: >>> &fq=(price:4+OR+(-price:[* TO *])) >>> But this does not work. After searching around for more, read somewhere >>> that the right way to achieve this would be: >>> fq=-(-price:4+AND+price:[*+TO+*]) >>> >>> Now this does work but seems like a very roundabout way. Is there a >>> better >>> way to achieve this ? >>> >>> >> Pure negative queries don't work -- you have to have results in the query >> before you can subtract. For some top-level queries, Solr is able to >> detect this situation and fix it internally, but on inner queries you must >> explicitly state your intentions. It is best if you always use '*:* >> -query' syntax, just to be safe. >> >> fq=(price:4+OR+(*:* -price:[* TO *])) >> >> Thanks, >> Shawn >> >> >> >
License Info
Hello, Since Apache Solr is governed by Apache License 2.0 - does it mean that all jar files bundled within Solr are also governed by the same License ? Do I have to worry about checking the License information of all bundled jar files in my commercial Solr powered application ? Even if I use them independent of Solr, will the same License apply ? Some of the jar files - slf4j-api-1.6.1.jar, jcl-over-slf4j-1.6.1.jar etc - do not have any License file inside the jar. Regards Rahul
Lucene FieldCache - Out of memory exception
Hello, I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application server on Solaris. I use embedded solr server. More details : Number of docs in solr index : 1.4 million Physical size of index : 640MB Total number of fields in the index : 700 (99% of these are dynamic fields) Total number of fields enabled for faceting : 440 Avg number of facet fields participating in a faceted query : 50-70 Total RAM allocated to weblogic appserver : 3GB (max possible) In a multi user environment with 3 users using this application for a period of around 40 minutes, the application runs out of memory. Analysis of the heap dump shows that almost 85% of the memory is retained by the FieldCache. Now I understand that the field cache is out of our control but would appreciate some suggestions on how to handle this issue. Some questions on this front : - some mail threads on this forum seem to indicate that there could be some connection between having dynamic fields and usage of FieldCache. Is this true ? Most of the fields in my index are dynamic fields. - as mentioned above, most of my faceted queries could have around 50-70 facet fields (I would do SolrQuery.addFacetField() for around 50-70 fields per query). Could this be the source of the problem ? Is this too high for solr to support ? - Initially, I had a facet.sort defined in solrconfig.xml. Since FieldCache builds up on sorting, I even removed the facet.sort and tried, but no respite. The behavior is same as before. - The document id that I have for each document is quite big (around 50 characters on average). Can this be a problem ? I reduced this to around 15 characters and tried but still there is no improvement. - Can the size of the data be a problem ? But on this forum, I see many users talking of more than 100 million documents in their index. I have only 1.4 million with physical size of 640MB. The physical server on which this application is running, has sufficient RAM and CPU. - What gets stored in the FieldCache ? Is it the entire document or just the document Id ? Any help is much appreciated. Thank you. regards Rahul
Re: get a total count
Hello, A related question on this topic. How do I programmatically find the total number of documents across many shards ? For EmbeddedSolrServer, I use the following command to get the total count : solrSearcher.getStatistics().get("numDocs") With distributed search, how do i get the count of all records in all shards. Apart from doing a *:* query, is there a way to get the total count ? I am not able to use the same command above because, I am not able to get a handle to the SolrIndexSearcher object with distributed search. The conf and data directories of my index reside directly under a folder called solr (no core) under the weblogic domain directly. I dont have a SolrCore object. With EmbeddedSolrServer, I used to get the SolrIndexSearcher object using the following call : solrSearcher = (SolrIndexSearcher)SolrCoreObject.getSearcher().get(); Stack Information : OS : Solaris jdk : 1.5.0_14 32 bit Solr : 1.3 App Server : Weblogic 10MP1 Thank you. - Rahul On Tue, Nov 15, 2011 at 10:49 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > I'm assuming the question was about how MANY documents have been indexed > across all shards. > > Answer #1: > Look at the Solr Admin Stats page on each of your Solr instances and add > up the numDocs numbers you see there > > Answer #2: > Use Sematext's free Performance Monitoring tool for Solr > On Index report choose "all, sum" in the Solr Host selector and that will > show you the total # of docs across the cluster, total # of deleted docs, > total segments, total size on disk, etc. > URL: http://www.sematext.com/spm/solr-performance-monitoring/index.html > > Otis > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > > >From: U Anonym > >To: solr-user@lucene.apache.org > >Sent: Monday, November 14, 2011 11:50 AM > >Subject: get a total count > > > >Hello everyone, > > > >A newbie question: how do I find out how documents have been indexed > >across all shards? > > > >Thanks much! > > > > > > >
Re: Lucene FieldCache - Out of memory exception
Here is one sample query that I picked up from the log file : q=*%3A*&fq=Category%3A%223__107%22&fq=S_P1540477699%3A%22MICROCIRCUIT%2C+LINE+TRANSCEIVERS%22&rows=0&facet=true&facet.mincount=1&facet.limit=2&facet.field=S_C1503120369&facet.field=S_P1406389942&facet.field=S_P1430116878&facet.field=S_P1430116881&facet.field=S_P1406453552&facet.field=S_P1406451296&facet.field=S_P1406452465&facet.field=S_C2968809156&facet.field=S_P1406389980&facet.field=S_P1540477699&facet.field=S_P1406389982&facet.field=S_P1406389984&facet.field=S_P1406451284&facet.field=S_P1406389926&facet.field=S_P1424886581&facet.field=S_P2017662632&facet.field=F_P1946367021&facet.field=S_P1430116884&facet.field=S_P2017662620&facet.field=F_P1406451304&facet.field=F_P1406451306&facet.field=F_P1406451308&facet.field=S_P1500901421&facet.field=S_P1507138990&facet.field=I_P1406452433&facet.field=I_P1406453565&facet.field=I_P1406452463&facet.field=I_P1406453573&facet.field=I_P1406451324&facet.field=I_P1406451288&facet.field=S_P1406451282&facet.field=S_P1406452471&facet.field=S_P1424886605&facet.field=S_P1946367015&facet.field=S_P1424886598&facet.field=S_P1946367018&facet.field=S_P1406453556&facet.field=S_P1406389932&facet.field=S_P2017662623&facet.field=S_P1406450978&facet.field=F_P1406452455&facet.field=S_P1406389972&facet.field=S_P1406389974&facet.field=S_P1406389986&facet.field=F_P1946367027&facet.field=F_P1406451294&facet.field=F_P1406451286&facet.field=F_P1406451328&facet.field=S_P1424886593&facet.field=S_P1406453567&facet.field=S_P2017662629&facet.field=S_P1406453571&facet.field=F_P1946367030&facet.field=S_P1406453569&facet.field=S_P2017662626&facet.field=S_P1406389978&facet.field=F_P1946367024 My primary question here is, can Solr handle this kind of queries with so many facet fields. I have tried using both enum and fc for facet.method and there is no improvement with either. Appreciate any help on this. Thank you. - Rahul On Mon, Apr 30, 2012 at 2:53 PM, Rahul R wrote: > Hello, > I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application > server on Solaris. I use embedded solr server. More details : > Number of docs in solr index : 1.4 million > Physical size of index : 640MB > Total number of fields in the index : 700 (99% of these are dynamic fields) > Total number of fields enabled for faceting : 440 > Avg number of facet fields participating in a faceted query : 50-70 > Total RAM allocated to weblogic appserver : 3GB (max possible) > > In a multi user environment with 3 users using this application for a > period of around 40 minutes, the application runs out of memory. Analysis > of the heap dump shows that almost 85% of the memory is retained by the > FieldCache. Now I understand that the field cache is out of our control but > would appreciate some suggestions on how to handle this issue. > > Some questions on this front : > - some mail threads on this forum seem to indicate that there could be > some connection between having dynamic fields and usage of FieldCache. Is > this true ? Most of the fields in my index are dynamic fields. > - as mentioned above, most of my faceted queries could have around 50-70 > facet fields (I would do SolrQuery.addFacetField() for around 50-70 fields > per query). Could this be the source of the problem ? Is this too high for > solr to support ? > - Initially, I had a facet.sort defined in solrconfig.xml. Since > FieldCache builds up on sorting, I even removed the facet.sort and tried, > but no respite. The behavior is same as before. > - The document id that I have for each document is quite big (around 50 > characters on average). Can this be a problem ? I reduced this to around 15 > characters and tried but still there is no improvement. > - Can the size of the data be a problem ? But on this forum, I see many > users talking of more than 100 million documents in their index. I have > only 1.4 million with physical size of 640MB. The physical server on which > this application is running, has sufficient RAM and CPU. > - What gets stored in the FieldCache ? Is it the entire document or just > the document Id ? > > > Any help is much appreciated. Thank you. > > regards > Rahul > > >
Re: Lucene FieldCache - Out of memory exception
Jack, Yes, the queries work fine till I hit the OOM. The fields that start with S_* are strings, F_* are floats, I_* are ints and so so. The dynamic field definitions from schema.xml : *Each FieldCache will be an array with maxdoc entries (your total number of documents - 1.4 million) times the size of the field value or whatever a string reference is in your JVM* So if I understand correct - every field (dynamic or normal) will have its own field cache. The size of the field cache for any field will be (maxDocs * sizeOfField) ? If the field has only 100 unique values, will it occupy (100 * sizeOfField) or will it still be (maxDocs * sizeOfField) ? *Roughly what is the typical or average length of one of your facet field values? And, on average, how many unique terms are there within a typical faceted field?* Each field length may vary from 10 - 30 characters. Average of 20 maybe. Number of unique terms within a faceted field will vary from 100 - 1000. Average of 300. How will the number of unique terms affect performance ? *3 GB sounds like it might not be enough for such heavy use of faceting. It is probably not the 50-70 number, but the 440 or accumulated number across many queries that pushes the memory usage up* I am using jdk1.5.0_14 - 32 bit. With 32 bit jdk, I think there is a limitation that more RAM cannot be allocated. *When you hit OOM, what does the Solr admin stats display say for FieldCache?* I don't have solr deployed as a separate web app. All solr jar files are present in my webapp's WEB-INF\lib directory. I use EmbeddedSolrServer. So is there a way I can get this information that the admin would show ? Thank you for your time. -Rahul On Wed, May 2, 2012 at 5:19 PM, Jack Krupansky wrote: > The FieldCache gets populated the first time a given field is referenced > as a facet and then will stay around forever. So, as additional queries get > executed with different facet fields, the number of FieldCache entries will > grow. > > If I understand what you have said, theses faceted queries do work > initially, but after awhile they stop working with OOM, correct? > > The size of a single FieldCache depends on the field type. Since you are > using dynamic fields, it depends on your "dynamicField" types - which you > have not told us about. From your query I see that your fields start with > "S_" and "F_" - presumably you have dynamic field types "S_*" and "F_*"? > Are they strings, integers, floats, or what? > > Each FieldCache will be an array with maxdoc entries (your total number of > documents - 1.4 million) times the size of the field value or whatever a > string reference is in your JVM. > > String fields will take more space than numeric fields for the FieldCache, > since a separate table is maintained for the unique terms in that field. > Roughly what is the typical or average length of one of your facet field > values? And, on average, how many unique terms are there within a typical > faceted field? > > If you can convert many of these faceted fields to simple integers the > size should go down dramatically, but that depends on your application. > > 3 GB sounds like it might not be enough for such heavy use of faceting. It > is probably not the 50-70 number, but the 440 or accumulated number across > many queries that pushes the memory usage up. > > When you hit OOM, what does the Solr admin stats display say for > FieldCache? > > -- Jack Krupansky > > -Original Message- From: Rahul R > Sent: Wednesday, May 02, 2012 2:22 AM > To: solr-user@lucene.apache.org > Subject: Re: Lucene FieldCache - Out of memory exception > > > Here is one sample query that I picked up from the log file : > > q=*%3A*&fq=Category%3A%223__**107%22&fq=S_P1540477699%3A%** > 22MICROCIRCUIT%2C+LINE+**TRANSCEIVERS%22&rows=0&facet=** > true&facet.mincount=1&facet.**limit=2&facet.field=S_** > C1503120369&facet.field=S_**P1406389942&facet.field=S_** > P1430116878&facet.field=S_**P1430116881&facet.field=S_** > P1406453552&facet.field=S_**P1406451296&facet.field=S_** > P1406452465&facet.field=S_**C2968809156&facet.field=S_** > P1406389980&facet.field=S_**P1540477699&facet.field=S_** > P1406389982&facet.field=S_**P1406389984&facet.field=S_** > P1406451284&facet.field=S_**P1406389926&facet.field=S_** > P1424886581&facet.field=S_**P2017662632&facet.field=F_** > P1946367021&facet.field=S_**P1430116884&facet.field=S_** > P2017662620&facet.field=F_**P1406451304&facet.field=F_** > P1406451306&facet.field=F_**P1406451308&facet.field=S_** > P1500901421&facet.field=S_**P1507138990&facet.field=I_** > P1406452433&facet.field
Re: Lucene FieldCache - Out of memory exception
Jack, Sorry for the delayed response: Total memory allocated : 3GB Free Memory on startup of application server : 2.85GB (95%) Free Memory after first request by first user(1 request involves 3 queries) : 2.7GB (90%) Free Memory after a few requests by same user : 2.52GB (84%) All values recorded above have been done after 2 force GCs were done to identify the free memory. The progression of memory usage looks quite high with the above numbers. As the number of searches widen, the speed of memory consumption decreases. But at some point it does hit OOM. - Rahul On Thu, May 3, 2012 at 8:37 PM, Jack Krupansky wrote: > Just for a baseline, how much memory is available in the JVM (using > jconsole or something similar) before you do your first query, and then > after your first query (that has these 50-70 facets), and then after a few > different queries (different facets.) Just to see how close you are to "the > edge" even before a volume of queries start coming in. > > > -- Jack Krupansky > > -Original Message- From: Rahul R > Sent: Thursday, May 03, 2012 1:28 AM > > To: solr-user@lucene.apache.org > Subject: Re: Lucene FieldCache - Out of memory exception > > Jack, > Yes, the queries work fine till I hit the OOM. The fields that start with > S_* are strings, F_* are floats, I_* are ints and so so. The dynamic field > definitions from schema.xml : > omitNorms="true"/> > omitNorms="true"/> > omitNorms="true"/> > omitNorms="true"/> > omitNorms="true"/> > > *Each FieldCache will be an array with maxdoc entries (your total number of > > documents - 1.4 million) times the size of the field value or whatever a > string reference is in your JVM* > > So if I understand correct - every field (dynamic or normal) will have its > own field cache. The size of the field cache for any field will be (maxDocs > * sizeOfField) ? If the field has only 100 unique values, will it occupy > (100 * sizeOfField) or will it still be (maxDocs * sizeOfField) ? > > *Roughly what is the typical or average length of one of your facet field > > values? And, on average, how many unique terms are there within a typical > faceted field?* > > Each field length may vary from 10 - 30 characters. Average of 20 maybe. > Number of unique terms within a faceted field will vary from 100 - 1000. > Average of 300. How will the number of unique terms affect performance ? > > *3 GB sounds like it might not be enough for such heavy use of faceting. It > > is probably not the 50-70 number, but the 440 or accumulated number across > many queries that pushes the memory usage up* > > I am using jdk1.5.0_14 - 32 bit. With 32 bit jdk, I think there is a > limitation that more RAM cannot be allocated. > > *When you hit OOM, what does the Solr admin stats display say for > FieldCache?* > > I don't have solr deployed as a separate web app. All solr jar files are > present in my webapp's WEB-INF\lib directory. I use EmbeddedSolrServer. So > is there a way I can get this information that the admin would show ? > > Thank you for your time. > > -Rahul > > > On Wed, May 2, 2012 at 5:19 PM, Jack Krupansky ** > wrote: > > The FieldCache gets populated the first time a given field is referenced >> as a facet and then will stay around forever. So, as additional queries >> get >> executed with different facet fields, the number of FieldCache entries >> will >> grow. >> >> If I understand what you have said, theses faceted queries do work >> initially, but after awhile they stop working with OOM, correct? >> >> The size of a single FieldCache depends on the field type. Since you are >> using dynamic fields, it depends on your "dynamicField" types - which you >> have not told us about. From your query I see that your fields start with >> "S_" and "F_" - presumably you have dynamic field types "S_*" and "F_*"? >> Are they strings, integers, floats, or what? >> >> Each FieldCache will be an array with maxdoc entries (your total number of >> documents - 1.4 million) times the size of the field value or whatever a >> string reference is in your JVM. >> >> String fields will take more space than numeric fields for the FieldCache, >> since a separate table is maintained for the unique terms in that field. >> Roughly what is the typical or average length of one of your facet field >> values? And, on average, how many unique terms are there within a typical >> faceted field? >> >> If you can convert many of these faceted fields to simple integers the >> size should
Re: Lucene FieldCache - Out of memory exception
A update on the things I tried today. Since multiValued fields do not use the fieldCache, I changed my schema to define all my fields as multiValued fields. Although these fields need to be only single valued, I made this change and recreated the index and tested with it. Observations : - force GC always results in freeing up most of the heap i.e the FieldCache doesn't seem to be created. So OOM issue does not occur. - response time is terribly slow for faceting queries. Application is almost unusable and system monitoring shows high CPU usage. - using solr caches - documentCache, filterCache & queryResultsCache - does not seem to improve performance. Cache sizes are documentCache - 100K, filterCache - 10K, queryResultsCache - 10K. I don't think I can use this as a solution because response times are very poor. But a few questions : - solr documentation indicates that the fieldCache gets built up on sorting and function queries only. When I use single Valued fields, I don't do any explicit sorting or use any functions. Could there be some setting that results in automatic sorting to happen on the result set (although I don't want a sort) ? - is there a way I can improve faceting performance with all my fields as multiValued fields ? Appreciate any help on this. Thank you. - Rahul On Mon, May 7, 2012 at 7:23 PM, Rahul R wrote: > Jack, > Sorry for the delayed response: > Total memory allocated : 3GB > Free Memory on startup of application server : 2.85GB (95%) > Free Memory after first request by first user(1 request involves 3 > queries) : 2.7GB (90%) > Free Memory after a few requests by same user : 2.52GB (84%) > > All values recorded above have been done after 2 force GCs were done to > identify the free memory. > > The progression of memory usage looks quite high with the above numbers. > As the number of searches widen, the speed of memory consumption decreases. > But at some point it does hit OOM. > > - Rahul > > > On Thu, May 3, 2012 at 8:37 PM, Jack Krupansky wrote: > >> Just for a baseline, how much memory is available in the JVM (using >> jconsole or something similar) before you do your first query, and then >> after your first query (that has these 50-70 facets), and then after a few >> different queries (different facets.) Just to see how close you are to "the >> edge" even before a volume of queries start coming in. >> >> >> -- Jack Krupansky >> >> -Original Message- From: Rahul R >> Sent: Thursday, May 03, 2012 1:28 AM >> >> To: solr-user@lucene.apache.org >> Subject: Re: Lucene FieldCache - Out of memory exception >> >> Jack, >> Yes, the queries work fine till I hit the OOM. The fields that start with >> S_* are strings, F_* are floats, I_* are ints and so so. The dynamic field >> definitions from schema.xml : >> > omitNorms="true"/> >> > omitNorms="true"/> >> > omitNorms="true"/> >> > omitNorms="true"/> >> > omitNorms="true"/> >> >> *Each FieldCache will be an array with maxdoc entries (your total number >> of >> >> documents - 1.4 million) times the size of the field value or whatever a >> string reference is in your JVM* >> >> So if I understand correct - every field (dynamic or normal) will have its >> own field cache. The size of the field cache for any field will be >> (maxDocs >> * sizeOfField) ? If the field has only 100 unique values, will it occupy >> (100 * sizeOfField) or will it still be (maxDocs * sizeOfField) ? >> >> *Roughly what is the typical or average length of one of your facet field >> >> values? And, on average, how many unique terms are there within a typical >> faceted field?* >> >> Each field length may vary from 10 - 30 characters. Average of 20 maybe. >> Number of unique terms within a faceted field will vary from 100 - 1000. >> Average of 300. How will the number of unique terms affect performance ? >> >> *3 GB sounds like it might not be enough for such heavy use of faceting. >> It >> >> is probably not the 50-70 number, but the 440 or accumulated number across >> many queries that pushes the memory usage up* >> >> I am using jdk1.5.0_14 - 32 bit. With 32 bit jdk, I think there is a >> limitation that more RAM cannot be allocated. >> >> *When you hit OOM, what does the Solr admin stats display say for >> FieldCache?* >> >> I don't have solr deployed as a separate web app. All solr jar files are >> present in my webapp's WEB-INF\lib directory. I use EmbeddedSolrServer. So >> is there a way I can get
Solr Caches
Hello, I am trying to understand how I can size the caches for my solr powered application. Some details on the index and application : Solr Version : 1.3 JDK : 1.5.0_14 32 bit OS : Solaris 10 App Server : Weblogic 10 MP1 Number of documents : 1 million Total number of fields : 1000 (750 strings, 225 int/float/double/long, 25 boolean) Number of fields on which faceting and filtering can be done : 400 Physical size of index : 600MB Number of unique values for a field : Ranges from 5 - 1000. Average of 150 -Xms and -Xmx vals for jvm : 3G Expected number of concurrent users : 15 No sorting planned for now Now I want to set appropriate values for the caches. I have put below some of my understanding and questions about the caches. Please correct and answer accordingly. FilterCache: As per the solr wiki, this is used to store an unordered list of Ids of matching documents for an fq param. So if a query contains two fq params, it will create two separate entries for each of these fq params. The value of each entry is the list of ids of all documents across the index that match the corresponding fq param. Each entry is independent of any other entry. A minimum size for filterCache could be (total number of fields * avg number of unique values per field) ? Is this correct ? I have not enabled . Max physical size of the filter cache would be (size * avg byte size of a document id * avg number of docs returned per fq param) ? QueryResultsCache: Used to store an ordered list of ids of the documents that match the most commonly used searches. So if my query is something like q=Status:Active&fq=Org:Apache&fq=Version:13, it will create one entry that contains list of ids of documents that match this full query. Is this correct ? How can I size my queryResultsCache ? Some entries from solrconfig.xml : 50 200 Max physical size of the filterCache would be (size * avg byte size of a document id * avg number of docs per query). Is this correct ? documentCache: Stores the documents that are stored in the index. So I do two searches that return three documents each with 1 document being common between both result sets. This will result in 5 entries in the documentCache for the 5 unique documents that have been returned for the two queries ? Is this correct ? For sizing, SolrWiki states that "*The size for the documentCache should always be greater than * *". Why do we need the max_concurrent_queries parameter here ? Is it when max_results is much lesser than numDocs ? In my case, a q=*:*search is done the first time the index is loaded. So, will setting documentCache size to numDocs be correct ? Can this be like the max that I need to allocate ? Max physical size of document cache would be (size * avg byte size of a document in the index). Is this correct ? Thank you -Rahul
Limiting facets for huge data - setting indexed=false in schema.xml
Hello, We are trying to get Solr to work for a really huge parts database. Details of the database - 55 million parts - Totally 3700 properties (facets). But each record will not have value for all properties. - Most of these facets are defined as dynamic fields within the Solr Index We were getting really unacceptable timing while doing faceting/searches on an index created with this database. With only one user using the system, query times are in excess of 1 minute. With more users concurrently using the system, the response times are further high. We thought that by limiting the number of properties that are available for faceting, the performance can be improved. To test this, we enabled only 6 properties for faceting by setting indexed=true (in schema.xml) for only these properties. All other properties which are defined as dynamic properties had indexed=false. The observations after this change : - Index size reduced by a meagre 5 % only - Performance did not improve. Infact during PSR run we observed that it degraded. My questions: - Will reducing the number of facets improve faceting and search performance ? - Is there a better way to reduce the number of facets ? - Will having a large number of properties defined as dynamic fields, reduce performance ? Thank you. Regards Rahul
Re: Limiting facets for huge data - setting indexed=false in schema.xml
Erik, I understand that caching is going to improve performance. Infact we did a PSR run with caches enabled and we got awesome results. But these wouldn't be really representative because the PSR scripts will be doing the same searches again and again. These would be cached and there would be virtually no evictions. This is not a practical case. My hardware (in the PSR environment where I am testing) is pretty good - 12 CPU, 24 G RAM, Ultrasparc III 1.2 GHz processors, Solaris 10. We have allocated 3.2 GB RAM for Weblogic (JVM). This is the maximum that I am able to allocate for one JVM. I think I need to go back and check if I am not using all the fields in the query. I understand that setting indexed=false alone will not ensure that all fields don't participate in the query. Thanks a lot for your response. Regards Rahul On Fri, Jul 31, 2009 at 3:33 PM, Erik Hatcher wrote: > > On Jul 31, 2009, at 2:35 AM, Rahul R wrote: > > Hello, >> We are trying to get Solr to work for a really huge parts database. >> Details >> of the database >> - 55 million parts >> - Totally 3700 properties (facets). But each record will not have value >> for >> all properties. >> - Most of these facets are defined as dynamic fields within the Solr Index >> >> We were getting really unacceptable timing while doing faceting/searches >> on >> an index created with this database. >> > > Were you accounting for cache warming? Were your caches sized > appropriately? What kind of hardware and RAM were you using? What were the > JVM settings? > > And certainly not least important - what version of Solr are you running? > The difference in faceting performance and scalability between Solr 1.3 and > what will be Solr 1.4 is quite dramatic. > > We thought that by limiting the number of properties that are available for >> faceting, the performance can be improved. To test this, we enabled only 6 >> properties for faceting by setting indexed=true (in schema.xml) for only >> these properties. All other properties which are defined as dynamic >> properties had indexed=false. >> > > These settings won't matter - what matters in this case is what facets you > request, not what is actually in the index. > > > My questions: >> - Will reducing the number of facets improve faceting and search >> performance ? >> > > Reducing what fields you request will, of course. But what you actually > index has no effect on performance until you request it. > > - Is there a better way to reduce the number of facets ? >> > > Hard to say without doing a deeper analysis of your needs. > > - Will having a large number of properties defined as dynamic fields, >> reduce >> performance ? >> > > Dynamic fields versus statically named fields have no effect on > performance. > >Erik > >
Re: Limiting facets for huge data - setting indexed=false in schema.xml
In a production environment, having the caches enabled makes a lot of sense. And most definitely we will be enabling them. However, the primary idea of this exercise is to verify if limiting the number of facets will actually improve the performance. An update on this. I did verify and looks like although I set indexed=false for most of the properties, I have not blocked them from participating in the query. I now enabled only 7 properties for faceting. Now at any given time only a maximum of 7 facets will participate in the query. Performance has now improved from an erstwhile 60 seconds to around 10 seconds. This really helped. Thanks a lot ! Regards Rahul On Fri, Jul 31, 2009 at 6:34 PM, Erik Hatcher wrote: > > On Jul 31, 2009, at 7:17 AM, Rahul R wrote: > > Erik, >> I understand that caching is going to improve performance. Infact we did a >> PSR run with caches enabled and we got awesome results. But these wouldn't >> be really representative because the PSR scripts will be doing the same >> searches again and again. These would be cached and there would be >> virtually >> no evictions. This is not a practical case. >> > > I don't understand how this is not practical. Why wouldn't having the > caches warmed and filled with the facets be practical for your needs? > >Erik > >
Re: Limiting facets for huge data - setting indexed=false in schema.xml
We are using 1.3.0. Thanks for the suggestion. Will see if I can try one of the ngihtly builds. On Fri, Jul 31, 2009 at 7:49 PM, Erik Hatcher wrote: > What version of Solr? Try a nightly build if you're at Solr 1.3 or > earlier and you'll be amazed at the difference. > >Erik > > > On Jul 31, 2009, at 10:00 AM, Rahul R wrote: > > In a production environment, having the caches enabled makes a lot of >> sense. >> And most definitely we will be enabling them. However, the primary idea of >> this exercise is to verify if limiting the number of facets will actually >> improve the performance. >> >> An update on this. I did verify and looks like although I set >> indexed=false >> for most of the properties, I have not blocked them from participating in >> the query. I now enabled only 7 properties for faceting. Now at any given >> time only a maximum of 7 facets will participate in the query. Performance >> has now improved from an erstwhile 60 seconds to around 10 seconds. >> >> This really helped. Thanks a lot ! >> >> Regards >> Rahul >> >> On Fri, Jul 31, 2009 at 6:34 PM, Erik Hatcher > >wrote: >> >> >>> On Jul 31, 2009, at 7:17 AM, Rahul R wrote: >>> >>> Erik, >>> >>>> I understand that caching is going to improve performance. Infact we did >>>> a >>>> PSR run with caches enabled and we got awesome results. But these >>>> wouldn't >>>> be really representative because the PSR scripts will be doing the same >>>> searches again and again. These would be cached and there would be >>>> virtually >>>> no evictions. This is not a practical case. >>>> >>>> >>> I don't understand how this is not practical. Why wouldn't having the >>> caches warmed and filled with the facets be practical for your needs? >>> >>> Erik >>> >>> >>> >
JVM Heap utilization & Memory leaks with Solr
I am trying to track memory utilization with my Application that uses Solr. Details of the setup : -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0 - Hardware : 12 CPU, 24 GB RAM For testing during PSR I am using a smaller subset of the actual data that I want to work with. Details of this smaller sub-set : - 5 million records, 4.5 GB index size Observations during PSR: A) I have allocated 3.2 GB for the JVM(s) that I used. After all users logout and doing a force GC, only 60 % of the heap is reclaimed. As part of the logout process I am invalidating the HttpSession and doing a close() on CoreContainer. From my application's side, I don't believe I am holding on to any resource. I wanted to know if there are known issues surrounding memory leaks with Solr ? B) To further test this, I tried deploying with shards. 3.2 GB was allocated to each JVM. All JVMs had 96 % free heap space after start up. I got varying results with this. Case 1 : Used 6 weblogic domains. My application was deployed one 1 domain. I split the 5 million index into 5 parts of 1 million each and used them as shards. After multiple users used the system and doing a force GC, around 94 - 96 % of heap was reclaimed in all the JVMs. Case 2: Used 2 weblogic domains. My application was deployed on 1 domain. On the other, I deployed the entire 5 million part index as one shard. After multiple users used the system and doing a gorce GC, around 76 % of the heap was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM where my application was running. This result further convinces me that my application can be absolved of holding on to memory resources. I am not sure how to interpret these results ? For searching, I am using Without Shards : EmbeddedSolrServer With Shards :CommonsHttpSolrServer In terms of Solr objects this is what differs in my code between normal search and shards search (distributed search) After looking at Case 1, I thought that the CommonsHttpSolrServer was more memory efficient but Case 2 proved me wrong. Or could there still be memory leaks in my application ? Any thoughts, suggestions would be welcome. Regards Rahul
Re: Rotating the primary shard in /solr/select
Philip, I cannot answer your question, but I do have a question for you. Does aggregation happen at the primary shard ? For eg : if I have three JVMs JVM 1 : My application powered by Solr JVM 2 : Shard 1 JVM 3 : Shard 2 I initialize my SolrServer like this SolrServer _solrServer = *new* CommonsHttpSolrServer(shard1); Does aggregation now happen at JVM 2 ? Is there any other reason for initializing the SolrServer with one of the shard URLs ? On Wed, Jul 29, 2009 at 2:57 AM, Phillip Farber wrote: > > Is there any value in a round-robin scheme to cycle through the Solr > instances supporting a multi-shard index over several machines when sending > queries or is it better to just pick one instance and stick with it. I'm > assuming all machines in the cluster have the same hardware specs. > > So scenario A (round-robin): > > query 1: /solr-shard-1/select?q=dog... shards=shard-1,shard2 > query 2: /solr-shard-2/select?q=dog... shards=shard-1,shard2 > query 3: /solr-shard-1/select?q=dog... shards=shard-1,shard2 > etc. > > or or scenario B (fixed): > > query 1: /solr-shard-1/select?q=dog... shards=shard-1,shard2 > query 2: /solr-shard-1/select?q=dog... shards=shard-1,shard2 > query 3: /solr-shard-1/select?q=dog... shards=shard-1,shard2 > etc. > > Is there evidence that distributing the overhead of result merging over > more machines (A) gives a performance boost? > > Thanks, > > Phil > > >
Re: Rotating the primary shard in /solr/select
*The SolrServer is initialized to the server to which you want to send the request. It has nothing to do with distributed search by itself.* But isn't the request sent to all the shards ? We set all the shard urls in the 'shards' parameter of our HttpRequest.Or is it something like the request is first sent to the server (with which SolrServer is initialized) and from there it is sent to all the other shards ? Regards Rahul On Tue, Aug 4, 2009 at 2:29 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Aug 4, 2009 at 11:26 AM, Rahul R wrote: > > > Philip, > > I cannot answer your question, but I do have a question for you. Does > > aggregation happen at the primary shard ? For eg : if I have three JVMs > > JVM 1 : My application powered by Solr > > JVM 2 : Shard 1 > > JVM 3 : Shard 2 > > > > I initialize my SolrServer like this > > SolrServer _solrServer = *new* CommonsHttpSolrServer(shard1); > > > > Does aggregation now happen at JVM 2 ? > > > Yes. > > > > Is there any other reason for > > initializing the SolrServer with one of the shard URLs ? > > > > The SolrServer is initialized to the server to which you want to send the > request. It has nothing to do with distributed search by itself. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Rotating the primary shard in /solr/select
Shalin, thank you for the clarification. Philip, I just realized that I have diverted the original topic of the thread. My apologies. Regards Rahul On Tue, Aug 4, 2009 at 3:35 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Aug 4, 2009 at 2:37 PM, Rahul R wrote: > > > *The SolrServer is initialized to the server to which you want to send > the > > request. It has nothing to do with distributed search by itself.* > > > > But isn't the request sent to all the shards ? We set all the shard urls > in > > the 'shards' parameter of our HttpRequest.Or is it something like the > > request is first sent to the server (with which SolrServer is > initialized) > > and from there it is sent to all the other shards ? > > > > The request is sent to the server with which SolrServer is initialized. > That > server makes use of the shards parameter, queries other servers, merges the > responses and sends it back to the client. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: JVM Heap utilization & Memory leaks with Solr
Otis, Thank you for your response. I know there are a few variables here but the difference in memory utilization with and without shards somehow leads me to believe that the leak could be within Solr. I tried using a profiling tool - Yourkit. The trial version was free for 15 days. But I couldn't find anything of significance. Regards Rahul On Tue, Aug 4, 2009 at 7:35 PM, Otis Gospodnetic wrote: > Hi Rahul, > > A) There are no known (to me) memory leaks. > I think there are too many variables for a person to tell you what exactly > is happening, plus you are dealing with the JVM here. :) > > Try jmap -histo:live PID-HERE | less and see what's using your memory. > > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > - Original Message > > From: Rahul R > > To: solr-user@lucene.apache.org > > Sent: Tuesday, August 4, 2009 1:09:06 AM > > Subject: JVM Heap utilization & Memory leaks with Solr > > > > I am trying to track memory utilization with my Application that uses > Solr. > > Details of the setup : > > -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0 > > - Hardware : 12 CPU, 24 GB RAM > > > > For testing during PSR I am using a smaller subset of the actual data > that I > > want to work with. Details of this smaller sub-set : > > - 5 million records, 4.5 GB index size > > > > Observations during PSR: > > A) I have allocated 3.2 GB for the JVM(s) that I used. After all users > > logout and doing a force GC, only 60 % of the heap is reclaimed. As part > of > > the logout process I am invalidating the HttpSession and doing a close() > on > > CoreContainer. From my application's side, I don't believe I am holding > on > > to any resource. I wanted to know if there are known issues surrounding > > memory leaks with Solr ? > > B) To further test this, I tried deploying with shards. 3.2 GB was > allocated > > to each JVM. All JVMs had 96 % free heap space after start up. I got > varying > > results with this. > > Case 1 : Used 6 weblogic domains. My application was deployed one 1 > domain. > > I split the 5 million index into 5 parts of 1 million each and used them > as > > shards. After multiple users used the system and doing a force GC, around > 94 > > - 96 % of heap was reclaimed in all the JVMs. > > Case 2: Used 2 weblogic domains. My application was deployed on 1 domain. > On > > the other, I deployed the entire 5 million part index as one shard. After > > multiple users used the system and doing a gorce GC, around 76 % of the > heap > > was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM where > my > > application was running. This result further convinces me that my > > application can be absolved of holding on to memory resources. > > > > I am not sure how to interpret these results ? For searching, I am using > > Without Shards : EmbeddedSolrServer > > With Shards :CommonsHttpSolrServer > > In terms of Solr objects this is what differs in my code between normal > > search and shards search (distributed search) > > > > After looking at Case 1, I thought that the CommonsHttpSolrServer was > more > > memory efficient but Case 2 proved me wrong. Or could there still be > memory > > leaks in my application ? Any thoughts, suggestions would be welcome. > > > > Regards > > Rahul > >
Re: JVM Heap utilization & Memory leaks with Solr
*You should try to generate heap dumps and analyze the heap using a tool like the Eclipse Memory Analyzer. Maybe it helps spotting a group of objects holding a large amount of memory* The tool that I used also allows to capture heap snap shots. Eclipse had a lot of pre-requisites. You need to apply some three or five patches before you can start using it My observations with this tool were that some Hashmaps were taking up a lot of space. Although I could not pin it down to the exact HashMap. These would either be weblogic's or Solr's I will anyway give eclipse's a try and see how it goes. Thanks for your input. Rahul On Wed, Aug 12, 2009 at 2:15 PM, Gunnar Wagenknecht wrote: > Rahul R schrieb: > > I tried using a profiling tool - Yourkit. The trial version was free for > 15 > > days. But I couldn't find anything of significance. > > You should try to generate heap dumps and analyze the heap using a tool > like the Eclipse Memory Analyzer. Maybe it helps spotting a group of > objects holding a large amount of memory. > > -Gunnar > > -- > Gunnar Wagenknecht > gun...@wagenknecht.org > http://wagenknecht.org/ > >
Re: JVM Heap utilization & Memory leaks with Solr
My primary issue is not Out of Memory error at run time. It is memory leaks: heap space not being released after doing a force GC also. So after sometime as progressively more heap gets utilized, I start running out of memory The verdict however seems unanimous that there are no known memory leak issues within Solr. I am still looking at my application to analyse the problem. Thank you. On Thu, Aug 13, 2009 at 10:58 PM, Fuad Efendi wrote: > Most OutOfMemoryException (if not 100%) happening with SOLR are because of > > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FieldCache. > html > - it is used internally in Lucene to cache Field value and document ID. > > My very long-term observations: SOLR can run without any problems few > days/months and unpredictable OOM happens just because someone tried sorted > search which will populate array with IDs of ALL documents in the index. > > The only solution: calculate exactly amount of RAM needed for FieldCache... > For instance, for 100,000,000 documents single instance of FieldCache may > require 8*100,000,000 bytes (8 bytes per document ID?) which is almost 1Gb > (at least!) > > > I didn't notice any memory leaks after I started to use 16Gb RAM for SOLR > instance (almost a year without any restart!) > > > > > -Original Message- > From: Rahul R [mailto:rahul.s...@gmail.com] > Sent: August-13-09 1:25 AM > To: solr-user@lucene.apache.org > Subject: Re: JVM Heap utilization & Memory leaks with Solr > > *You should try to generate heap dumps and analyze the heap using a tool > like the Eclipse Memory Analyzer. Maybe it helps spotting a group of > objects holding a large amount of memory* > > The tool that I used also allows to capture heap snap shots. Eclipse had a > lot of pre-requisites. You need to apply some three or five patches before > you can start using it My observations with this tool were that > some > Hashmaps were taking up a lot of space. Although I could not pin it down to > the exact HashMap. These would either be weblogic's or Solr's I will > anyway give eclipse's a try and see how it goes. Thanks for your input. > > Rahul > > On Wed, Aug 12, 2009 at 2:15 PM, Gunnar Wagenknecht > wrote: > > > Rahul R schrieb: > > > I tried using a profiling tool - Yourkit. The trial version was free > for > > 15 > > > days. But I couldn't find anything of significance. > > > > You should try to generate heap dumps and analyze the heap using a tool > > like the Eclipse Memory Analyzer. Maybe it helps spotting a group of > > objects holding a large amount of memory. > > > > -Gunnar > > > > -- > > Gunnar Wagenknecht > > gun...@wagenknecht.org > > http://wagenknecht.org/ > > > > > > >
Re: JVM Heap utilization & Memory leaks with Solr
Fuad, We have around 5 million documents and around 3700 fields. All documents will not have values for all the fields JRockit is not approved for use within my organization. But thanks for the info anyway. Regards Rahul On Tue, Aug 18, 2009 at 9:41 AM, Funtick wrote: > > BTW, you should really prefer JRockit which really rocks!!! > > "Mission Control" has necessary toolongs; and JRockit produces _nice_ > exception stacktrace (explaining almost everything) in case of even OOM > which SUN JVN still fails to produce. > > > SolrServlet still catches "Throwable": > >} catch (Throwable e) { > SolrException.log(log,e); > sendErr(500, SolrException.toStr(e), request, response); >} finally { > > > > > > Rahul R wrote: > > > > Otis, > > Thank you for your response. I know there are a few variables here but > the > > difference in memory utilization with and without shards somehow leads me > > to > > believe that the leak could be within Solr. > > > > I tried using a profiling tool - Yourkit. The trial version was free for > > 15 > > days. But I couldn't find anything of significance. > > > > Regards > > Rahul > > > > > > On Tue, Aug 4, 2009 at 7:35 PM, Otis Gospodnetic > > >> wrote: > > > >> Hi Rahul, > >> > >> A) There are no known (to me) memory leaks. > >> I think there are too many variables for a person to tell you what > >> exactly > >> is happening, plus you are dealing with the JVM here. :) > >> > >> Try jmap -histo:live PID-HERE | less and see what's using your memory. > >> > >> Otis > >> -- > >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls > >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > >> > >> > >> > >> - Original Message > >> > From: Rahul R > >> > To: solr-user@lucene.apache.org > >> > Sent: Tuesday, August 4, 2009 1:09:06 AM > >> > Subject: JVM Heap utilization & Memory leaks with Solr > >> > > >> > I am trying to track memory utilization with my Application that uses > >> Solr. > >> > Details of the setup : > >> > -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0 > >> > - Hardware : 12 CPU, 24 GB RAM > >> > > >> > For testing during PSR I am using a smaller subset of the actual data > >> that I > >> > want to work with. Details of this smaller sub-set : > >> > - 5 million records, 4.5 GB index size > >> > > >> > Observations during PSR: > >> > A) I have allocated 3.2 GB for the JVM(s) that I used. After all users > >> > logout and doing a force GC, only 60 % of the heap is reclaimed. As > >> part > >> of > >> > the logout process I am invalidating the HttpSession and doing a > >> close() > >> on > >> > CoreContainer. From my application's side, I don't believe I am > holding > >> on > >> > to any resource. I wanted to know if there are known issues > surrounding > >> > memory leaks with Solr ? > >> > B) To further test this, I tried deploying with shards. 3.2 GB was > >> allocated > >> > to each JVM. All JVMs had 96 % free heap space after start up. I got > >> varying > >> > results with this. > >> > Case 1 : Used 6 weblogic domains. My application was deployed one 1 > >> domain. > >> > I split the 5 million index into 5 parts of 1 million each and used > >> them > >> as > >> > shards. After multiple users used the system and doing a force GC, > >> around > >> 94 > >> > - 96 % of heap was reclaimed in all the JVMs. > >> > Case 2: Used 2 weblogic domains. My application was deployed on 1 > >> domain. > >> On > >> > the other, I deployed the entire 5 million part index as one shard. > >> After > >> > multiple users used the system and doing a gorce GC, around 76 % of > the > >> heap > >> > was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM > where > >> my > >> > application was running. This result further convinces me that my > >> > application can be absolved of holding on to memory resources. > >> > > >> > I am not sure how to interpret these results ? For searching, I am > >> using > >> > Without Shards : EmbeddedSolrServer > >> > With Shards :CommonsHttpSolrServer > >> > In terms of Solr objects this is what differs in my code between > normal > >> > search and shards search (distributed search) > >> > > >> > After looking at Case 1, I thought that the CommonsHttpSolrServer was > >> more > >> > memory efficient but Case 2 proved me wrong. Or could there still be > >> memory > >> > leaks in my application ? Any thoughts, suggestions would be welcome. > >> > > >> > Regards > >> > Rahul > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/JVM-Heap-utilization---Memory-leaks-with-Solr-tp24802380p25018165.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: JVM Heap utilization & Memory leaks with Solr
All these 3700 fields are single valued non-boolean fields. Thanks Regards Rahul On Wed, Aug 19, 2009 at 8:33 PM, Fuad Efendi wrote: > > Hi Rahul, > > JRockit could be used at least in a test environment to monitor JVM (and > troubleshoot SOLR, licensed for-free for developers!); they have even > Eclipse plugin now, and it is licensed by Oracle (BEA)... But, of course, > in > large companies test environment is in hands of testers :) > > > But... 3700 fields will create (over time) 3700 arrays each of size > 5,000,000!!! Even if most of fields are empty for most of documents... > Applicable to non-tokenized single-valued non-boolean fields only, Lucene > internals, FieldCache... and it won't be GC-collected after user log-off... > prefer dedicated box for SOLR. > > -Fuad > > > -Original Message- > From: Rahul R [mailto:rahul.s...@gmail.com] > Sent: August-19-09 6:19 AM > To: solr-user@lucene.apache.org > Subject: Re: JVM Heap utilization & Memory leaks with Solr > > Fuad, > We have around 5 million documents and around 3700 fields. All documents > will not have values for all the fields JRockit is not approved for use > within my organization. But thanks for the info anyway. > > Regards > Rahul > > On Tue, Aug 18, 2009 at 9:41 AM, Funtick wrote: > > > > > BTW, you should really prefer JRockit which really rocks!!! > > > > "Mission Control" has necessary toolongs; and JRockit produces _nice_ > > exception stacktrace (explaining almost everything) in case of even OOM > > which SUN JVN still fails to produce. > > > > > > SolrServlet still catches "Throwable": > > > >} catch (Throwable e) { > > SolrException.log(log,e); > > sendErr(500, SolrException.toStr(e), request, response); > >} finally { > > > > > > > > > > > > Rahul R wrote: > > > > > > Otis, > > > Thank you for your response. I know there are a few variables here but > > the > > > difference in memory utilization with and without shards somehow leads > me > > > to > > > believe that the leak could be within Solr. > > > > > > I tried using a profiling tool - Yourkit. The trial version was free > for > > > 15 > > > days. But I couldn't find anything of significance. > > > > > > Regards > > > Rahul > > > > > > > > > On Tue, Aug 4, 2009 at 7:35 PM, Otis Gospodnetic > > > > >> wrote: > > > > > >> Hi Rahul, > > >> > > >> A) There are no known (to me) memory leaks. > > >> I think there are too many variables for a person to tell you what > > >> exactly > > >> is happening, plus you are dealing with the JVM here. :) > > >> > > >> Try jmap -histo:live PID-HERE | less and see what's using your memory. > > >> > > >> Otis > > >> -- > > >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls > > >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > >> > > >> > > >> > > >> - Original Message > > >> > From: Rahul R > > >> > To: solr-user@lucene.apache.org > > >> > Sent: Tuesday, August 4, 2009 1:09:06 AM > > >> > Subject: JVM Heap utilization & Memory leaks with Solr > > >> > > > >> > I am trying to track memory utilization with my Application that > uses > > >> Solr. > > >> > Details of the setup : > > >> > -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr > 1.3.0 > > >> > - Hardware : 12 CPU, 24 GB RAM > > >> > > > >> > For testing during PSR I am using a smaller subset of the actual > data > > >> that I > > >> > want to work with. Details of this smaller sub-set : > > >> > - 5 million records, 4.5 GB index size > > >> > > > >> > Observations during PSR: > > >> > A) I have allocated 3.2 GB for the JVM(s) that I used. After all > users > > >> > logout and doing a force GC, only 60 % of the heap is reclaimed. As > > >> part > > >> of > > >> > the logout process I am invalidating the HttpSession and doing a > > >> close() > > >> on > > >> > CoreContainer. From my application's side, I don't believe I am > > holding > > >> on > >
Implementing a logout
Hello, Can somebody give me some pointers on the Solr objects I need to clean up/release while doing a logout on a Solr Application. I find that only the SolrCore object has a close() method. I typically do a lot of faceting queries on a large dataset with my application. I am using Solr 1.3.0. Regards Rahul
Re: Implementing a logout
Just clarifying : My query was more specific to Solr. I wanted to check if there are any Solr resources that are session-specific that we need to release. *>> I can't understand: do you use several web applications in a same >> container? >> Are you trying to close shared SolrCore when one of many users (of another >> application) logs off?* I have only one application that is built on top of Solr. I mentioned SolrCore in my mail only because that was the only object that I noticed which had a close() method. I don't intend to do a close() on SolrCore when a particular user logs out (closes his session) *> There is no 'logout'. There is no permanent state in Solr beyond the Lucene > index. There are caches, but these do not require any termination. The > Lucene API has very solid self-protection for the indexes and Solr uses the > API in the right way.* I understand Solr application does not have a logout. Please correct me if I am wrong but you seem to be stating that there is no explicit action required from our side to release any Solr resources when a user terminates his/her session of a Solr based application. If that is the case, then my query is answered. Thank you all. Regards Rahul On Sun, Aug 23, 2009 at 7:16 AM, Lance Norskog wrote: > Sorry, hit 'send' too soon. You can kill the servlet process, but it is > much > better to use the servlet container's shutdown protocol. > > On Sat, Aug 22, 2009 at 6:46 PM, Lance Norskog wrote: > > > There is no 'logout'. There is no permanent state in Solr beyond the > Lucene > > index. There are caches, but these do not require any termination. The > > Lucene API has very solid self-protection for the indexes and Solr uses > the > > API in the right way. > > > > If you run a Solr distribution in a standard servlet container, you can > > just use the servlet's shutdown protocol. If you call a commit with > > waitFlush=true, then do not index any records, you can kill the servlet > > process. > > > > On Fri, Aug 21, 2009 at 7:45 AM, Fuad Efendi wrote: > > > >> I can't understand: do you use several web applications in a same > >> container? > >> Are you trying to close shared SolrCore when one of many users (of > another > >> application) logs off? > >> > >> Usually one needs to clean up only user-session specific objects (such > as > >> non-persistent shopping cart)... > >> > >> > >> -Original Message- > >> From: Rahul R [mailto:rahul.s...@gmail.com] > >> Sent: August-21-09 1:20 AM > >> To: solr-user@lucene.apache.org > >> Subject: Implementing a logout > >> > >> Hello, > >> Can somebody give me some pointers on the Solr objects I need to clean > >> up/release while doing a logout on a Solr Application. I find that only > >> the > >> SolrCore object has a close() method. I typically do a lot of faceting > >> queries on a large dataset with my application. I am using Solr 1.3.0. > >> > >> Regards > >> Rahul > >> > >> > >> > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > > > > > > -- > Lance Norskog > goks...@gmail.com >
Re: Implementing a logout
*"release any SOLR resources" - no need.* My query is answered. Thank you. Regards Rahul On Mon, Aug 24, 2009 at 12:32 AM, Fuad Efendi wrote: > Truly correct: > > - SOLR does not create HttpSession for user access to Admin screens (do we > have any other screens of UI?) > - SolrCore is shared object; closing it and reopening for each user session > is extremely expensive; this object requires gigabytes of RAM in even > simplest scenario > > User doesn't have any session with SOLR based application. User may have > session with different application, and this one may use resources provided > by SOLR. > > > "release any SOLR resources" - no need. > > Of course, user session may store pointers to thousands documents retrieved > via SOLR query, - just close the session object. > > > >I understand Solr application does not have a logout. Please correct me if > I > >am wrong but you seem to be stating that there is no explicit action > >required from our side to release any Solr resources when a user > terminates > >his/her session of a Solr based application. If that is the case, then my > >query is answered. > > > >
Monitoring split time for fq queries when filter cache is used
Hello, I am trying to measure the benefit that I am getting out of using the filter cache. As I understand, there are two major parts to an fq query. Please correct me if I am wrong : - doing full index queries of each of the fq params (if filter cache is used, this result will be retrieved from the cache) - set intersection of above results (Will be done again even with filter cache enabled) Is there any flag/setting that I can enable to monitor how much time the above operations take separately i.e. the querying and the set-intersection ? Regards Rahul
Re: Monitoring split time for fq queries when filter cache is used
Thank you Martijn. On Tue, Sep 1, 2009 at 8:07 PM, Martijn v Groningen < martijn.is.h...@gmail.com> wrote: > Hi Rahul, > > Yes you are understanding is correct, but it is not possible to > monitor these actions separately with Solr. > > Martijn > > 2009/9/1 Rahul R : > > Hello, > > I am trying to measure the benefit that I am getting out of using the > filter > > cache. As I understand, there are two major parts to an fq query. Please > > correct me if I am wrong : > > - doing full index queries of each of the fq params (if filter cache is > > used, this result will be retrieved from the cache) > > - set intersection of above results (Will be done again even with filter > > cache enabled) > > > > Is there any flag/setting that I can enable to monitor how much time the > > above operations take separately i.e. the querying and the > set-intersection > > ? > > > > Regards > > Rahul > > > > > > -- > Met vriendelijke groet, > > Martijn van Groningen >
Questions on copyField
Hello, I have a few questions regarding the copyField directive in schema.xml 1. Does the destination field store a reference or the actual data ? If I have soemthing like this then will the values in the 'name' field get copied into the 'text' field or will the 'text' field only store a reference to the 'name' field ? To put it more simply, if I later delete the 'name' field from the index will I lose the corresponding data in the 'text' field ? 2. Is there any inbuilt API which I can use to do the copyField action programmatically ? 3. Can I do a copyfield from the schema as well as programmatically for the same destination field Suppose I want the 'text' field to contain values for name, age and location. In my index only 'name' and 'age' are defined as fields. So I can add directives like The location however, I want to add it to the 'text' field programmatically. I don't want to store the location as a separate field in the index. Can I do this ? Thank you. Regards Rahul
Re: Questions on copyField
Would appreciate any help on this. Thanks Rahul On Mon, Sep 14, 2009 at 5:12 PM, Rahul R wrote: > Hello, > I have a few questions regarding the copyField directive in schema.xml > > 1. Does the destination field store a reference or the actual data ? > If I have soemthing like this > > then will the values in the 'name' field get copied into the 'text' field > or will the 'text' field only store a reference to the 'name' field ? To put > it more simply, if I later delete the 'name' field from the index will I > lose the corresponding data in the 'text' field ? > > 2. Is there any inbuilt API which I can use to do the copyField action > programmatically ? > > 3. Can I do a copyfield from the schema as well as programmatically for the > same destination field > Suppose I want the 'text' field to contain values for name, age and > location. In my index only 'name' and 'age' are defined as fields. So I can > add directives like > > > The location however, I want to add it to the 'text' field > programmatically. I don't want to store the location as a separate field in > the index. Can I do this ? > > Thank you. > > Regards > Rahul >
Re: Questions on copyField
Shalin, Can you please elaborate a little more on the third response *You can send the location's value directly as the value of the text field.* I dont follow. I am adding 'name' and 'age' to the 'text' field through the schema. If I add the 'location' from the program, will either one copy (schema or program) not over-write the other ? *Also note, that you don't really need to index/store the source field. You can make the location field's type as ignored in the schema.* Understood Thank you for your response. Regards Rahul On Wed, Sep 16, 2009 at 1:56 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Mon, Sep 14, 2009 at 5:12 PM, Rahul R wrote: > > > Hello, > > I have a few questions regarding the copyField directive in schema.xml > > > > 1. Does the destination field store a reference or the actual data ? > > > > It makes a copy. Storing or indexing of the field depends on the field > configuration. > > > > If I have soemthing like this > > > > then will the values in the 'name' field get copied into the 'text' field > > or > > will the 'text' field only store a reference to the 'name' field ? To put > > it > > more simply, if I later delete the 'name' field from the index will I > lose > > the corresponding data in the 'text' field ? > > > > > The values will get copied. If you delete all values from the 'name' field > from the index, the data in "text" field remain as-is. > > > > > 2. Is there any inbuilt API which I can use to do the copyField action > > programmatically ? > > > > > No. But you can always copy explicitly before sending or you can use a > custom UpdateRequestProcessor to copy values from one field to another > during indexing. > > > > 3. Can I do a copyfield from the schema as well as programmatically for > the > > same destination field > > Suppose I want the 'text' field to contain values for name, age and > > location. In my index only 'name' and 'age' are defined as fields. So I > can > > add directives like > > > > > > The location however, I want to add it to the 'text' field > > programmatically. > > I don't want to store the location as a separate field in the index. Can > I > > do this ? > > > > > You can send the location's value directly as the value of the text field. > Also note, that you don't really need to index/store the source field. You > can make the location field's type as ignored in the schema. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Questions on copyField
Thank you Shalin. Regards Rahul On Thu, Sep 17, 2009 at 11:49 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Thu, Sep 17, 2009 at 11:19 AM, Rahul R wrote: > > > Shalin, > > Can you please elaborate a little more on the third response > > *You can send the location's value directly as the value of the text > > field.* > > I dont follow. I am adding 'name' and 'age' to the 'text' field through > the > > schema. If I add the 'location' from the program, will either one copy > > (schema or program) not over-write the other ? > > > > No, it will not overwrite, it will just append values of name and age to > the > values already sent as the text field. > -- > Regards, > Shalin Shekhar Mangar. >
Question on omitNorms definition
Hello, A rather trivial question on omitNorms parameter in schema.xml. The out-of-the-box schema.xml uses this parameter during both within the tag and tag and If we define the omitNorms during the fieldType definition, will it hold good for all fields that are defined using the same fieldType. For eg: Now, will these dynamic fields have omitNorms=true for it ? I have read about significant RAM usage when omitNorms is not set to true. Hence would like to ensure that it is set to true for most of my fields. Regards Rahul
Measuring timing with debugQuery=true
Hello, I am trying to measure why some of my queries take a long time. I am using EmbeddedSolrServer and with logging statements before and after the EmbeddedSolrServer.query(SolrQuery) function, I have found the time to be around 16s. I added the debugQuery=true and the timing component for this reads as following: * timing:{time=2438.0,prepare={time=0.0,org.apache.solr.handler.component.QueryComponent={time=0.0},org.apache.solr.handler.component.FacetComponent={time=0.0},org.apache.solr.handler.component.MoreLikeThisComponent={time=0.0},org.apache.solr.handler.component.HighlightComponent={time=0.0},org.apache.solr.handler.component.DebugComponent={time=0.0}},process={time=2438.0,org.apache.solr.handler.component.QueryComponent={time=2438.0},org.apache.solr.handler.component.FacetComponent={time=0.0},org.apache.solr.handler.component.MoreLikeThisComponent={time=0.0},org.apache.solr.handler.component.HighlightComponent={time=0.0},org.apache.solr.handler.component.DebugComponent={time=0.0}}} * As you can see, this shows only 2.4s being used by the query. I can't seem to figure out where the rest of the time is being spent. This is within my office intranet and I don't think the request-response time over the wire will cause significant overhead. So my question : is the timing information presented here comprehensive or are there more time consuming operations that are not represented here ? I guess GC pause times could be one answer (I hope not !) Also, the above result was for a faceted query. I can't understand why the FacetComponent would be zero. Any thoughts ? Rahul
Re: Measuring timing with debugQuery=true
Yonik, I understand that the network can be a bottle-neck but I am pretty sure that it is not. I am operating on a 100 MBPS intranet... How do I ensure that stored fields are cached by the OS ? Only the Solr caches within the JVM are under my control.. The result set has around 10K documents of which I am retrieving only 10..I am displaying a max of only 3 fields per document in my result set. Can the reading time for these stored fields be so long ? I have totally around 1 million documents in my index Any thoughts on why the FacetComponent does not take any time while the QueryComponent takes around 2.4s. I am doing a faceted and keyword query ie I have both 'q' and 'fq' params in my query Thank you for your response. Regards Rahul On Mon, Sep 28, 2009 at 1:20 AM, Yonik Seeley wrote: > The response times in a Solr request don't include the time to read > stored fields (since the response is streamed) and doesn't include the > time to transfer/read the response (which can be increased by a > slow/congested network link, or a slow client that doesn't read the > response immediately). > > How many documents are you retrieving? Reading stored fields for > documents can be slow if they aren't cached by the OS since it's often > a disk seek per document read for a large index. > > -Yonik > http://www.lucidimagination.com > > > > On Sun, Sep 27, 2009 at 3:41 PM, Rahul R wrote: > > Hello, > > I am trying to measure why some of my queries take a long time. I am > using > > EmbeddedSolrServer and with logging statements before and > > after the EmbeddedSolrServer.query(SolrQuery) function, I have found the > > time to be around 16s. I added the debugQuery=true and the timing > component > > for this reads as following: > > > > * > > > timing:{time=2438.0,prepare={time=0.0,org.apache.solr.handler.component.QueryComponent={time=0.0},org.apache.solr.handler.component.FacetComponent={time=0.0},org.apache.solr.handler.component.MoreLikeThisComponent={time=0.0},org.apache.solr.handler.component.HighlightComponent={time=0.0},org.apache.solr.handler.component.DebugComponent={time=0.0}},process={time=2438.0,org.apache.solr.handler.component.QueryComponent={time=2438.0},org.apache.solr.handler.component.FacetComponent={time=0.0},org.apache.solr.handler.component.MoreLikeThisComponent={time=0.0},org.apache.solr.handler.component.HighlightComponent={time=0.0},org.apache.solr.handler.component.DebugComponent={time=0.0}}} > > * > > > > As you can see, this shows only 2.4s being used by the query. I can't > seem > > to figure out where the rest of the time is being spent. This is within > my > > office intranet and I don't think the request-response time over the wire > > will cause significant overhead. So my question : is the timing > information > > presented here comprehensive or are there more time consuming operations > > that are not represented here ? I guess GC pause times could be one > answer > > (I hope not !) Also, the above result was for a faceted query. I > can't > > understand why the FacetComponent would be zero. Any thoughts ? > > > > Rahul > > >
Re: Measuring timing with debugQuery=true
Sorry for the delayed response ** *How big are your documents?* I have totally 1 million documents. I have totally 1950 fields in the index. Every document would probably have values for around 20 - 50 fields. *What is the total size of the index?* 1 GB *What's the amout of RAM on your box? How big is the JVM heap (and how much free memory is left on your system)?* I have 4 GB RAM. I am using Weblogic 10, 32 Bit. Since it is a windows box, I am able to allocate only 1 GB to the JVM. No other applications are running on the system. So the entire 4GB is at the disposal of the application. I am simulating load using a load tool (15 users) *Can you show what this slow query looks like (the whole request)?* q=*%3A*&rows=0&facet=true&facet.mincount=1&facet.limit=2&f.S9942.facet.limit=100&facet.field=S9942&facet.field=S6878&facet.field=S9156&facet.field=S0369&facet.field=S9926&facet.field=S1421&facet.field=S8990&facet.field=S6881&facet.field=S3552&debugQuery=true q=*%3A*&fq=S9942%3A%22TEXAS+INSTRUMENTS%22&rows=0&facet=true&facet.mincount=1&facet.limit=2&facet.field=S9942&facet.field=S6878&facet.field=S9156&facet.field=S0369&facet.field=S9926&facet.field=S1421&facet.field=S8990&facet.field=S6881&facet.field=S3552&debugQuery=true Other information Solr 1.3, JDK 1.5.0_14 regards Rahul On Mon, Sep 28, 2009 at 6:48 PM, Yonik Seeley wrote: > On Mon, Sep 28, 2009 at 7:51 AM, Rahul R wrote: > > Yonik, > > I understand that the network can be a bottle-neck but I am pretty sure > that > > it is not. I am operating on a 100 MBPS intranet... How do I ensure > that > > stored fields are cached by the OS ? Only the Solr caches within the JVM > are > > under my control.. The result set has around 10K documents of which I > am > > retrieving only 10..I am displaying a max of only 3 fields per > document > > in my result set. Can the reading time for these stored fields be so long > ? > > It could be a seek per document if the index is too big to fit in the > OS cache - but that still wouldn't be as slow as you report. > Something is fishy here. > > How big are your documents? > What is the total size of the index? > What's the amout of RAM on your box? > How big is the JVM heap (and how much free memory is left on your system)? > Can you show what this slow query looks like (the whole request)? > > > I have totally around 1 million documents in my index Any > thoughts > > on why the FacetComponent does not take any time while the QueryComponent > > takes around 2.4s. > > It could be a field that has very few unique values and faceting just > completes quickly. > Make sure you're actually getting faceting data back (that it's > correctly turned on). > > -Yonik > http://www.lucidimagination.com > > > I am doing a faceted and keyword query ie I have both 'q' > > and 'fq' params in my query Thank you for your response. > > > > Regards > > Rahul > > > > On Mon, Sep 28, 2009 at 1:20 AM, Yonik Seeley < > yo...@lucidimagination.com> > > wrote: > >> > >> The response times in a Solr request don't include the time to read > >> stored fields (since the response is streamed) and doesn't include the > >> time to transfer/read the response (which can be increased by a > >> slow/congested network link, or a slow client that doesn't read the > >> response immediately). > >> > >> How many documents are you retrieving? Reading stored fields for > >> documents can be slow if they aren't cached by the OS since it's often > >> a disk seek per document read for a large index. > >> > >> -Yonik > >> http://www.lucidimagination.com > >> > >> > >> > >> On Sun, Sep 27, 2009 at 3:41 PM, Rahul R wrote: > >> > Hello, > >> > I am trying to measure why some of my queries take a long time. I am > >> > using > >> > EmbeddedSolrServer and with logging statements before and > >> > after the EmbeddedSolrServer.query(SolrQuery) function, I have found > the > >> > time to be around 16s. I added the debugQuery=true and the timing > >> > component > >> > for this reads as following: > >> > > >> > * > >> > > >> > > timing:{time=2438.0,prepare={time=0.0,org.apache.solr.handler.component.QueryComponent={time=0.0},org.apache.solr.handler.component.FacetComponent={time=0.0},org.apache.solr.handler.component.MoreLikeThisComponent={time=0.0},org.apache.solr.
Re: Measuring timing with debugQuery=true
I just want to clarify here that I understand my memory allocation might be less given the load on the system. The response times were only slightly better when we ran the test on a Solaris box with 12CPU, 24G RAM and with 3.2 GB allocated for the JVM. I know that I have a performance problem. My main concern is to identify the reasons for the inconsistency between the timing information shown between the debugQuery output (2.4s) and the entire time taken by the EmbeddedSolrServer.query(SolrQuery) function (16s). I feel that if I can find out where the remaining 13.6s gets used, then I can look to improve accordingly. Thank you. Regards Rahul On Tue, Sep 29, 2009 at 7:12 PM, Rahul R wrote: > Sorry for the delayed response > ** > *How big are your documents?* > I have totally 1 million documents. I have totally 1950 fields in the > index. Every document would probably have values for around 20 - 50 fields. > *What is the total size of the index?* > 1 GB > > *What's the amout of RAM on your box? How big is the JVM heap (and how > much free memory is left on your system)?* > I have 4 GB RAM. I am using Weblogic 10, 32 Bit. Since it is a windows box, > I am able to allocate only 1 GB to the JVM. No other applications are > running on the system. So the entire 4GB is at the disposal of the > application. I am simulating load using a load tool (15 users) > > *Can you show what this slow query looks like (the whole request)?* > > q=*%3A*&rows=0&facet=true&facet.mincount=1&facet.limit=2&f.S9942.facet.limit=100&facet.field=S9942&facet.field=S6878&facet.field=S9156&facet.field=S0369&facet.field=S9926&facet.field=S1421&facet.field=S8990&facet.field=S6881&facet.field=S3552&debugQuery=true > > > q=*%3A*&fq=S9942%3A%22TEXAS+INSTRUMENTS%22&rows=0&facet=true&facet.mincount=1&facet.limit=2&facet.field=S9942&facet.field=S6878&facet.field=S9156&facet.field=S0369&facet.field=S9926&facet.field=S1421&facet.field=S8990&facet.field=S6881&facet.field=S3552&debugQuery=true > > Other information > Solr 1.3, JDK 1.5.0_14 > > regards > Rahul > > On Mon, Sep 28, 2009 at 6:48 PM, Yonik Seeley < > yo...@lucidimagination.com> wrote: > >> On Mon, Sep 28, 2009 at 7:51 AM, Rahul R wrote: >> > Yonik, >> > I understand that the network can be a bottle-neck but I am pretty sure >> that >> > it is not. I am operating on a 100 MBPS intranet... How do I ensure >> that >> > stored fields are cached by the OS ? Only the Solr caches within the JVM >> are >> > under my control.. The result set has around 10K documents of which >> I am >> > retrieving only 10..I am displaying a max of only 3 fields per >> document >> > in my result set. Can the reading time for these stored fields be so >> long ? >> >> It could be a seek per document if the index is too big to fit in the >> OS cache - but that still wouldn't be as slow as you report. >> Something is fishy here. >> >> How big are your documents? >> What is the total size of the index? >> What's the amout of RAM on your box? >> How big is the JVM heap (and how much free memory is left on your system)? >> Can you show what this slow query looks like (the whole request)? >> >> > I have totally around 1 million documents in my index Any >> thoughts >> > on why the FacetComponent does not take any time while the >> QueryComponent >> > takes around 2.4s. >> >> It could be a field that has very few unique values and faceting just >> completes quickly. >> Make sure you're actually getting faceting data back (that it's >> correctly turned on). >> >> -Yonik >> http://www.lucidimagination.com >> >> > I am doing a faceted and keyword query ie I have both 'q' >> > and 'fq' params in my query Thank you for your response. >> > >> > Regards >> > Rahul >> > >> > On Mon, Sep 28, 2009 at 1:20 AM, Yonik Seeley < >> yo...@lucidimagination.com> >> > wrote: >> >> >> >> The response times in a Solr request don't include the time to read >> >> stored fields (since the response is streamed) and doesn't include the >> >> time to transfer/read the response (which can be increased by a >> >> slow/congested network link, or a slow client that doesn't read the >> >> response immediately). >> >> >> >> How many documents are you retrieving? Reading stored fields fo
Trouble Configuring WordDelimiterFilterFactory
Hello, In our application we have a catch-all field (the 'text' field) which is cofigured as the default search field. Now this field will have a combination of numbers, alphabets, special characters etc. I have a requirement wherein the WordDelimiterFilterFactory does not work on numbers, especially those with decimal points. Accuracy of results with relevance to numerical data is quite important, So if the text field of a document has data like "Bridge-Diode 3.55 Volts", I want to make sure that a search for "355" or "35.5" does not retrieve this document. So I found the following setting for the WordDelimiterFilterFactory to work for me (for most parts): I am using the same setting for both index and query. Now the only problem is, if I have data like ".355". With the above setting, the analysis jsp shows me that WordDelimiterFilterFactory is creating term texts as both ".355' and "355". So a search for ".355" retrieves documents containing both ".355" and "355". A search for "355" also has the same effect. I noticed that when the entry for the WordDelimiterFilterFactory was completely removed (both index and query), then the above problem was resolved. But this seems too harsh a measure. Is there a way by which I can prevent the WordDelimiterFilterFactory from totally acting on numerical data ? Regards Rahul
Re: Trouble Configuring WordDelimiterFilterFactory
Hello, Would really appreciate any inputs/suggestions on this. Thank you. On Tue, Nov 24, 2009 at 10:59 PM, Rahul R wrote: > Hello, > In our application we have a catch-all field (the 'text' field) which is > cofigured as the default search field. Now this field will have a > combination of numbers, alphabets, special characters etc. I have a > requirement wherein the WordDelimiterFilterFactory does not work on numbers, > especially those with decimal points. Accuracy of results with relevance to > numerical data is quite important, So if the text field of a document has > data like "Bridge-Diode 3.55 Volts", I want to make sure that a search for > "355" or "35.5" does not retrieve this document. So I found the following > setting for the WordDelimiterFilterFactory to work for me (for most parts): > generateNumberParts="0" catenateWords="1" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > preserveOriginal="1"/> > > I am using the same setting for both index and query. > > Now the only problem is, if I have data like ".355". With the above > setting, the analysis jsp shows me that WordDelimiterFilterFactory is > creating term texts as both ".355' and "355". So a search for ".355" > retrieves documents containing both ".355" and "355". A search for "355" > also has the same effect. I noticed that when the entry for the > WordDelimiterFilterFactory was completely removed (both index and query), > then the above problem was resolved. But this seems too harsh a measure. > > Is there a way by which I can prevent the WordDelimiterFilterFactory from > totally acting on numerical data ? > > Regards > Rahul >
Re: Trouble Configuring WordDelimiterFilterFactory
Steve, My settings for both index and query are : Let me give an example. Suppose I have the following 2 documents: Document 1(Text Field): Bridge-Diode .355 Volts Document 2(Text Field): Bridge-Diode 355 Volts Requirement : Search for ".355" should retrieve only document 1 (Not happening now) Requirement: Search for "Bridge" should retrieve both documents (Works as expected) The reason why a search for ".355" is retrieving both documents is that term texts for .355 in the document are created as .355 and 355. Even if I set generateWordParts and catenateWords to "0", the way term texts are created for ".355" does not change. Thank you for your time. Regards Rahul On Sun, Nov 29, 2009 at 1:07 AM, Steven A Rowe wrote: > Hi Rahul, > > On 11/26/2009 at 12:53 AM, Rahul R wrote: > > Is there a way by which I can prevent the WordDelimiterFilterFactory > > from totally acting on numerical data ? > > "prevent ... from totally acting on" is pretty vague, and nowhere AFAICT do > you say precisely what it is you want. > > It would help if you could give example text and the terms you think should > be the result of analysis of the text. If you want different index/query > time behavior, please provide this info for both. > > Steve > >
IndexSearcher and Caches
Hello all, I have a few questions w.r.t the caches and the IndexSearcher available in solr. I am using solr 1.3. - The solr wiki states that the caches are per IndexSearcher object i.e if I set my filterCache size to 1000 it means that 1000 entries can be assigned for every IndexSearcher object. Is this true for queryResultsCache, filterCache and documentCache ? For the document cache, the wiki states that the value should be greater than (number of records) * (max number of queries). If the document cache is also sized per IndexSearcher object, then why do we need the (max number of queries) parameter in the formula ? - In a web application, where multiple users may log into the system and query concurrently, should we assign a new IndexSearcher object for every user ? I tried sharing the IndexSearcher object but noticed that the search criteria and filters of one user gets carried over to another ? Or is there some way to get over that ? - Combining the above two, if the caches are per IndexSearcher objects, and if we have to assign a new IndexSearcher for every new user (in a web application), will the total cache size not explode ? Apologies if these seem really basic. Thank you. Regards Rahul
Re: IndexSearcher and Caches
Mitch, Thank you for your response. A few follow up questions for clarification : <> In my case, I have an index which will not be modified after creation. Does this mean that in a multi-user scenario, I can have a static IndexSearcher object that can be shared by multiple users ? <> If the IndexSearcher object is threadsafe, then only issues related to concurrency are addressed. What about the case where the IndexSearcher is static? User 1 logs in to the system, queries with the static IndexSearcher, logs out; and then User 2 logs in to the system, queries with the same static IndexSearcher, logs out. In this case, the users 1 and 2 are not querying concurrently but one after another. Will the query information (filters or any other data) of User 1 be retained when User 2 uses this ? Understand your point about the filter cache but appreciate if you could throw some light on how these caches are tied to the IndexSearcher object. Pasting my initial question here : The solr wiki states that the caches are per IndexSearcher object i.e if I set my filterCache size to 1000 it means that 1000 entries can be assigned for every IndexSearcher object. Is this true for queryResultsCache, filterCache and documentCache ? For the document cache, the wiki states that the value should be greater than (number of records) * (max number of queries). If the document cache is also sized per IndexSearcher object, then why do we need the (max number of queries) parameter in the formula ? Thank you. Regards Rahul On Fri, May 21, 2010 at 3:03 PM, MitchK wrote: > > Rahul, > > the IndexSearcher of Solr gets shared with every request within two > commits. > That means one IndexSearcher + its caches got a lifetime of one commit. > After every commit, there will be a new one created. > > The cache does not mean, that they are applied automatically. They mean, > that a filter from a query will be cached and whenever an user-query > requieres the same filtering-criteria, they will use the cached filter > instead of creating a new one on the fly. > > I.e: fq=inStock:true > The result of this filtering-criteria gets cached one time. If another user > asks again for a query with fq=inStock:true, Solr reuses the already > existing filter. > Since such filters are cached as byteVectors, they are not large. > In this case it does not care for what the user is querying in his q-param. > > BTW: The IndexSearcher is threadsafe. So there is no problem with > concurrent > usage. > > Hope this helps??? > > Kind regards > - Mitch > -- > View this message in context: > http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p833841.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: IndexSearcher and Caches
<> I have an application deployed on an application server (Weblogic). This application uses solr to query an index. Users (sessions) will log in to the application, query and then log out. This login and logout has nothing to do with solr but the application manages them separately. I am using EmbeddedSolrServer here. I think I know where my mistake is. From what you say, it looks to me as though that I should not create a new SolrIndexSearcher object because Solr will do this automatically. In my current implementation, I am explicitly creating a new SolrIndexSearcher object for every new user who logs into the application. Let me provide a code snippet to explain further. This is how I initialize the solr handles required for searching. I am using EmbeddedSolrServer. SolrConfig solrConfig = new SolrConfig(configHome+"/solrconfig.xml"); IndexSchema indexSchema = new IndexSchema(solrConfig, configHome+"/schema.xml", null); File corefile = new File(coreHome, "solr.xml"); CoreContainer coreContainer = new CoreContainer(coreHome, corefile); CoreDescriptor coreDescriptor = new CoreDescriptor(coreContainer, coreName, solrConfig.getResourceLoader().getInstanceDir()); coreDescriptor.setConfigName(solrConfig.getResourceName()); coreDescriptor.setSchemaName(indexSchema.getResourceName()); SolrCore solrCore = new SolrCore(coreName, indexHome, solrConfig, indexSchema, coreDescriptor); coreContainer.register(coreName, solrCore, false); SolrServer solrServer = new EmbeddedSolrServer( coreContainer, coreName ); //Next two lines executed for every user SolrIndexSearcher solrSearcher = solrCore.newSearcher("s1"); SolrRequestParsers solrRequestParsers = new SolrRequestParsers(solrConfig); Many thanks for the response(s). Regards Rahul On Mon, May 24, 2010 at 1:55 AM, MitchK wrote: > > > > > In my case, I have an index which will not be modified after creation. > > Does > > this mean that in a multi-user scenario, I can have a static > IndexSearcher > > object that can be shared by multiple users ? > > > I am not sure, what you mean with "multi-user"-scenario. Can you tell me > what you got in mind? > If your index never changes, your IndexSearcher won't change. > > > > > > If the IndexSearcher object is threadsafe, then only issues related to > > concurrency are addressed. What about the case where the IndexSearcher is > > static? User 1 logs in to the system, queries with the static > > IndexSearcher, > > logs out; and then User 2 logs in to the system, queries with the same > > static IndexSearcher, logs out. In this case, the users 1 and 2 are not > > querying concurrently but one after another. Will the query information > > (filters or any other data) of User 1 be retained when User 2 uses this ? > > > I am not sure about the benefit of a static IndexSearcher. What do you > hope??? > > If user 1 uses a filter like "fq=name:Samuel&q=somethingIWantToKnow" and > user 2 queries for "fq=name:Samuel&q=whatIReallyWantToKnow" than they use > the same cached filter-object, retrived from Solr's internal cache (of > course you need to have a cache-size that allows cacheing). > > > > > The solr wiki states that the caches are per IndexSearcher object i.e if > I > > set my filterCache size to 1000 it means that 1000 entries can be > assigned > > for every IndexSearcher object. > > > Yes. If a new searcher is created than the new Cache is built on the old > one. > > > > > Is this true for queryResultsCache, > > filterCache and documentCache ? > > > For FilterCache it's true. For queryResultsCache (if I understand the wiki > right), too. > Please note, that the documentCache's behaviour is different from the > already mentioned ones. > The wiki says: > > > > Note: This cache cannot be used as a source for autowarming because > > document IDs will change when anything in the index changes so they can't > > be used by a new searcher. > > > > The wiki says that the number of the document cache should not be bigger > than the number of _results_ * number of _concurrent_ queries. > I never worked with the document cache, so maybe someone else can throw > some > light into the dark. > But from what I have understood it means the following: > > If you show 10 results per request and you think of up to 500 concurrent > queries: > 10 * 500 => 5000 > > But I want to emphasize, that this is only a gues. I actually don't exactly > know more about this topic. > > Kind regards > - Mitch > -- > View this message in context: > http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p838367.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: IndexSearcher and Caches
Thank you I found the API to get the existing SolrIndexSearcher to be present in SolrCore: SolrCore.getSearcher().get() So if now the Index changes (a commit is done) in between, will I automatically get the new SolrIndexSearcher from this call ? Regards Rahul On Mon, May 24, 2010 at 11:25 PM, MitchK wrote: > > Ahh, now I understand. > > No, you need no second IndexSearcher as long as the Server is alive. > You can reuse your searcher for every user. > > The only commands you are executing per user are those to create a > search-query. > > Kind regards, > - Mitch > -- > View this message in context: > http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p840228.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: IndexSearcher and Caches
Chris, I am using SolrIndexSearcher to get a handle to the total number of records in the index. I am doing it like this : int num = Integer.parseInt((String)solrSearcher.getStatistics().get("numDocs").toString()); Please let me know if there is a better way to do this. Mark, I can tell you what I do in my applicaiton. We provide a tool to do the index update and assume that the user will always use it to create/update the index. Whenever an update happens, we notify the querying application and it creates a new instance of SolrCore, SolrServer etc. These continue to be shared across multiple users (as statics) till the next update happens. Thank you. Regards Rahul On Tue, May 25, 2010 at 4:18 AM, Chris Hostetter wrote: > > : Thank you I found the API to get the existing SolrIndexSearcher to be > : present in SolrCore: > : SolrCore.getSearcher().get() > > I think perhaps you need to take 5 big steps back and explain what your > goal is. 99.999% of all solr users should never care about that method -- > even the 99.9% of the folks writing java code and using "EmbeddedSolr" > should never ever have a need to call those -- so what exactly is it you > are doing, and how did you get along hte path you find yourself on? > > this thread started with some fairly innoculous questions about how caches > worked in regardes to new searchers -- which is all fine and dandy, those > concepts that solr users should be aware of ... in the abstract. you > should almost never be instantiating those IndexSearchers or Caches > yourself. > > Stick with teh SolrServer abstraction provided by SolrJ... > > http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer > > http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html > > > -Hoss > >