Simulate facet.exists for json query facets
Hi, I use json facets of type 'query'. As these queries are pretty slow and I'm only interested in whether there is a match or not, I'd like to restrict the query execution similar to the standard facetting (like with the facet.exists parameter). My simplified query looks something like this (in reality *:* may be replaced by a complex edismax query and multiple subfacets similar to "tour" occur): curl http://localhost:8983/solr/portal/select -d \ "q=*:*\ &json.facet={ tour:{ type : query, q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\" } }\ &rows=0" Is there any possibility to modify my request to ensure that the facet query stops as soon as it matches a hit for the first time? Thanks! Michael
Tangent: old Solr versions
On Tue, Oct 27, 2020 at 04:25:54PM -0500, Mike Drob wrote: > Based on the questions that we've seen over the past month on this list, > there are still users with Solr on 6, 7, and 8. I suspect there are still > Solr 5 users out there too, although they don't appear to be asking for > help - likely they are in set it and forget it mode. Oh, there are quite a few instances of Solr 4 out there as well. Many of them will be moving to v7 or v8, probably starting in the next 6-12 months. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu signature.asc Description: PGP signature
Re: Simulate facet.exists for json query facets
This really sounds like an XY problem. The whole point of facets is to count the number of documents that have a value in some number of buckets. So trying to stop your facet query as soon as it matches a hit for the first time seems like an odd thing to do. So what’s the “X”? In other words, what is the problem you’re trying to solve at a high level? Perhaps there’s a better way to figure this out. Best, Erick > On Oct 28, 2020, at 3:48 AM, michael dürr wrote: > > Hi, > > I use json facets of type 'query'. As these queries are pretty slow and I'm > only interested in whether there is a match or not, I'd like to restrict > the query execution similar to the standard facetting (like with the > facet.exists parameter). My simplified query looks something like this (in > reality *:* may be replaced by a complex edismax query and multiple > subfacets similar to "tour" occur): > > curl http://localhost:8983/solr/portal/select -d \ > "q=*:*\ > &json.facet={ > tour:{ >type : query, > q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\" > } > }\ > &rows=0" > > Is there any possibility to modify my request to ensure that the facet > query stops as soon as it matches a hit for the first time? > > Thanks! > Michael
RE: SOLR uses too much CPU and GC is also weird on Windows server
Hi all, Its me again. Anyway, I did a little research and we tried different things and well, some questions I want to ask and some things that I found. Well after monitoring my system with VirtualVM, I found that GC jumping is from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an issue anymore or what? But will observe it a bit as it might rise I guess a bit. Next thing we found or are thinking about is that writing on a disk might be an issue, we turned off the indexing and some other stuff, but I would say, it did not save much still. I also did go through all the schema fields, not that much really. They are all docValues=true. Also I must say they are all automatically generated, so no manual working there except one field, but this also has docValue=true. Just curious, if the field is not a string/text, can it be docValue=false or still better to have true? And as for uninversion, then we are not using much facets nor other specific things in query, just simple queries. Though I must say we are updating documents quite a bunch, but usage of CPU for being so high, not sure about that. Older version seemed not using CPU so much... I am a bit running out of ideas and hoping that this will continue to work, but I dont like the CPU usage even over night, when nobody uses it. We will try to figure out the issue here and I hope I can ask more questions when in doubt or out of ideas. Also I must admit, solr is really new for me personally. Jaan -Original Message- From: Walter Underwood Sent: 27 October 2020 18:44 To: solr-user@lucene.apache.org Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server That first graph shows a JVM that does not have enough heap for the program it is running. Look at the bottom of the dips. That is the amount of memory still in use after a full GC. You want those dips to drop to about half of the available heap, so I’d immediately increase that heap to 4G. That might not be enough, so you’ll need to to watch that graph after the increase. I’ve been using 8G heaps with Solr since version 1.2. We run this config with Java 8 on over 100 machines. We do not do any faceting, which can take more memory. SOLR_HEAP=8g # Use G1 GC -- wunder 2017-01-23 # Settings from https://wiki.apache.org/solr/ShawnHeisey GC_TUNE=" \ -XX:+UseG1GC \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ " wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 27, 2020, at 12:48 AM, Jaan Arjasepp wrote: > > Hello, > > We have been using SOLR for quite some time. We used 6.0 and now we did a > little upgrade to our system and servers and we started to use 8.6.1. > We use it on a Windows Server 2019. > Java version is 11 > Basically using it in a default setting, except giving SOLR 2G of heap. It > used 512, but it ran out of memory and stopped responding. Not sure if it was > the issue. When older version, it managed fine with 512MB. > SOLR is not in a cloud mode, but in solo mode as we use it internally and it > does not have too many request nor indexing actually. > Document sizes are not big, I guess. We only use one core. > Document stats are here: > Num Docs: 3627341 > Max Doc: 4981019 > Heap Memory Usage: 434400 > Deleted Docs: 1353678 > Version: 15999036 > Segment Count: 30 > > The size of index is 2.66GB > > While making upgrade we had to modify one field and a bit of code that uses > it. Thats basically it. It works. > If needed more information about background of the system, I am happy to help. > > > But now to the issue I am having. > If SOLR is started, at first 40-60 minutes it works just fine. CPU is not > high, heap usage seem normal. All is good, but then suddenly, the heap usage > goes crazy, going up and down, up and down and CPU rises to 50-60% of the > usage. Also I noticed over the weekend, when there are no writing usage, the > CPU remains low and decent. I can try it this weekend again to see if and how > this works out. > Also it seems to me, that after 4-5 days of working like this, it stops > responding, but needs to be confirmed with more heap also. > > Heap memory usage via JMX and jconsole -> > https://drive.google.com/file/d/1Zo3B_xFsrrt-WRaxW-0A0QMXDNscXYih/view > ?usp=sharing As you can see, it starts of normal, but then goes crazy > and it has been like this over night. > > This is overall monitoring graphs, as you can see CPU is working hard > or hardly working. -> > https://drive.google.com/file/d/1_Gtz-Bi7LUrj8UZvKfmNMr-8gF_lM2Ra/view > ?usp=sharing VM summary can be found here -> > https://drive.google.com/file/d/1FvdCz0N5pFG1fmX_5OQ2855MVkaL048w/view > ?usp=sharing And finally to have better and quick overview of the SOLR > executing parameters that I have -> > https://drive.google.com/file/d/10VCtYDxflJcvb1aOoxt0u3Nb5JzTjrAI/view > ?usp=shari
Re: Tangent: old Solr versions
Chegg is running a 4.10.2 master/slave cluster for textbook search and several other collections. 1. None of the features past 4.x are needed. 2. We depend on the extended edismax (SOLR-629). 3. Ain’t broke. We are moving our Solr Cloud clusters to 8.x, even though there are no features we need that aren’t in 6.6.2. Moving the Solr 4 cluster is way at the bottom of the list. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 28, 2020, at 5:37 AM, Mark H. Wood wrote: > > On Tue, Oct 27, 2020 at 04:25:54PM -0500, Mike Drob wrote: >> Based on the questions that we've seen over the past month on this list, >> there are still users with Solr on 6, 7, and 8. I suspect there are still >> Solr 5 users out there too, although they don't appear to be asking for >> help - likely they are in set it and forget it mode. > > Oh, there are quite a few instances of Solr 4 out there as well. Many > of them will be moving to v7 or v8, probably starting in the next 6-12 > months. > > -- > Mark H. Wood > Lead Technology Analyst > > University Library > Indiana University - Purdue University Indianapolis > 755 W. Michigan Street > Indianapolis, IN 46202 > 317-274-0749 > www.ulib.iupui.edu
Re: SOLR uses too much CPU and GC is also weird on Windows server
Double the heap. All that CPU is the GC trying to free up space. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 28, 2020, at 6:29 AM, Jaan Arjasepp wrote: > > Hi all, > > Its me again. Anyway, I did a little research and we tried different things > and well, some questions I want to ask and some things that I found. > > Well after monitoring my system with VirtualVM, I found that GC jumping is > from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an > issue anymore or what? But will observe it a bit as it might rise I guess a > bit. > > Next thing we found or are thinking about is that writing on a disk might be > an issue, we turned off the indexing and some other stuff, but I would say, > it did not save much still. > I also did go through all the schema fields, not that much really. They are > all docValues=true. Also I must say they are all automatically generated, so > no manual working there except one field, but this also has docValue=true. > Just curious, if the field is not a string/text, can it be docValue=false or > still better to have true? And as for uninversion, then we are not using much > facets nor other specific things in query, just simple queries. > > Though I must say we are updating documents quite a bunch, but usage of CPU > for being so high, not sure about that. Older version seemed not using CPU so > much... > > I am a bit running out of ideas and hoping that this will continue to work, > but I dont like the CPU usage even over night, when nobody uses it. We will > try to figure out the issue here and I hope I can ask more questions when in > doubt or out of ideas. Also I must admit, solr is really new for me > personally. > > Jaan > > -Original Message- > From: Walter Underwood > Sent: 27 October 2020 18:44 > To: solr-user@lucene.apache.org > Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server > > That first graph shows a JVM that does not have enough heap for the program > it is running. Look at the bottom of the dips. That is the amount of memory > still in use after a full GC. > > You want those dips to drop to about half of the available heap, so I’d > immediately increase that heap to 4G. That might not be enough, so you’ll > need to to watch that graph after the increase. > > I’ve been using 8G heaps with Solr since version 1.2. We run this config with > Java 8 on over 100 machines. We do not do any faceting, which can take more > memory. > > SOLR_HEAP=8g > # Use G1 GC -- wunder 2017-01-23 > # Settings from https://wiki.apache.org/solr/ShawnHeisey > GC_TUNE=" \ > -XX:+UseG1GC \ > -XX:+ParallelRefProcEnabled \ > -XX:G1HeapRegionSize=8m \ > -XX:MaxGCPauseMillis=200 \ > -XX:+UseLargePages \ > -XX:+AggressiveOpts \ > " > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Oct 27, 2020, at 12:48 AM, Jaan Arjasepp wrote: >> >> Hello, >> >> We have been using SOLR for quite some time. We used 6.0 and now we did a >> little upgrade to our system and servers and we started to use 8.6.1. >> We use it on a Windows Server 2019. >> Java version is 11 >> Basically using it in a default setting, except giving SOLR 2G of heap. It >> used 512, but it ran out of memory and stopped responding. Not sure if it >> was the issue. When older version, it managed fine with 512MB. >> SOLR is not in a cloud mode, but in solo mode as we use it internally and it >> does not have too many request nor indexing actually. >> Document sizes are not big, I guess. We only use one core. >> Document stats are here: >> Num Docs: 3627341 >> Max Doc: 4981019 >> Heap Memory Usage: 434400 >> Deleted Docs: 1353678 >> Version: 15999036 >> Segment Count: 30 >> >> The size of index is 2.66GB >> >> While making upgrade we had to modify one field and a bit of code that uses >> it. Thats basically it. It works. >> If needed more information about background of the system, I am happy to >> help. >> >> >> But now to the issue I am having. >> If SOLR is started, at first 40-60 minutes it works just fine. CPU is not >> high, heap usage seem normal. All is good, but then suddenly, the heap usage >> goes crazy, going up and down, up and down and CPU rises to 50-60% of the >> usage. Also I noticed over the weekend, when there are no writing usage, the >> CPU remains low and decent. I can try it this weekend again to see if and >> how this works out. >> Also it seems to me, that after 4-5 days of working like this, it stops >> responding, but needs to be confirmed with more heap also. >> >> Heap memory usage via JMX and jconsole -> >> https://drive.google.com/file/d/1Zo3B_xFsrrt-WRaxW-0A0QMXDNscXYih/view >> ?usp=sharing As you can see, it starts of normal, but then goes crazy >> and it has been like this over night. >> >> This is overall monitoring graphs, as you can see CPU is working hard >> or hardly working.
Re: SOLR uses too much CPU and GC is also weird on Windows server
DocValues=true are usually only used for “primitive” types, string, numerics, booleans and the like, specifically _not_ text-based. I say “usually” because there’s a special “SortableTextField” where it does make some sense to have a text-based field have docValues, but that’s intended for relatively short fields. For example you want to sort on a title field. And probably not something you’re working with. There’s not much we can say from this distance I’m afraid. I think I’d focus on the memory requirements, maybe take a heap dump and see what’s using memory. Did you restart Solr _after_ turning off indexing? I ask because that would help determine which side the problem is on, indexing or querying. It does sound like querying though. As for docValues in general, if you want to be really brave, you can set uninvertible=false for all your fields where docValues=false. When you facet on such a field, you won’t get anything back. If you sort on such a field, you’ll get an error message back. That should test if somehow not having docValues is the root of your problem. Do this on a test system of course ;) I think this is a low-probability issue, but it’s a mystery anyway so... Updating shouldn’t be that much of a problem either, and if you still see high CPU with indexing turned off, that eliminates indexing as a candidate. Is there any chance you changed your schema at all and didn’t delete your entire index and add all your documents back? There are a lot of ways things can go wrong if that’s the case. You had to reindex from scratch when you went to 8x from 6x, I’m wondering if during that process the schema changed without starting over. I’m grasping at straws here… I’d also seriously consider going to 8.6.3. We only make point releases when there’s something serious. Looking through lucene/CHANGES.txt, there is one memory leak fix in 8.6.2. I’d expect a gradual buildup of heap if that were what you’re seeing, but you never know. As for having docValues=false, that would cut down on the size of the index on disk and speed up indexing some, but in terms of memory usage or CPU usage when querying, unless the docValues structures are _needed_, they’re never read into OS RAM by MMapDirectory… The question really is whether you ever, intentionally or not, do “something” that would be more efficient with docValues. That’s where setting uninvertible=false whenever you set docValues=false makes sense, things will show up if your assumption that you don’t need docValues is false. Best, Erick > On Oct 28, 2020, at 9:29 AM, Jaan Arjasepp wrote: > > Hi all, > > Its me again. Anyway, I did a little research and we tried different things > and well, some questions I want to ask and some things that I found. > > Well after monitoring my system with VirtualVM, I found that GC jumping is > from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an > issue anymore or what? But will observe it a bit as it might rise I guess a > bit. > > Next thing we found or are thinking about is that writing on a disk might be > an issue, we turned off the indexing and some other stuff, but I would say, > it did not save much still. > I also did go through all the schema fields, not that much really. They are > all docValues=true. Also I must say they are all automatically generated, so > no manual working there except one field, but this also has docValue=true. > Just curious, if the field is not a string/text, can it be docValue=false or > still better to have true? And as for uninversion, then we are not using much > facets nor other specific things in query, just simple queries. > > Though I must say we are updating documents quite a bunch, but usage of CPU > for being so high, not sure about that. Older version seemed not using CPU so > much... > > I am a bit running out of ideas and hoping that this will continue to work, > but I dont like the CPU usage even over night, when nobody uses it. We will > try to figure out the issue here and I hope I can ask more questions when in > doubt or out of ideas. Also I must admit, solr is really new for me > personally. > > Jaan > > -Original Message- > From: Walter Underwood > Sent: 27 October 2020 18:44 > To: solr-user@lucene.apache.org > Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server > > That first graph shows a JVM that does not have enough heap for the program > it is running. Look at the bottom of the dips. That is the amount of memory > still in use after a full GC. > > You want those dips to drop to about half of the available heap, so I’d > immediately increase that heap to 4G. That might not be enough, so you’ll > need to to watch that graph after the increase. > > I’ve been using 8G heaps with Solr since version 1.2. We run this config with > Java 8 on over 100 machines. We do not do any faceting, which can take more > memory. > > SOLR_HEAP=8g > # Use G1
How to remove special characters from suggestion in Solr
Hello, We are using below suggest component in our solr implementation. analyzinginfixsuggester analyzinginfixlookupfactory documentdictionaryfactory text_auto prefix_text true true FreeTextSuggester FreeTextLookupFactory DocumentDictionaryFactory text 5 text_general true true For one of document, we have large data and while syncing this document using SolrNet library. We are getting below exception. SuggestComponent Exception in building suggester index for: AnalyzingInfixSuggester java.lang.IllegalArgumentException: Document contains at least one immense term in field="exacttext" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[77, 101, 100, 105, 99, 97, 108, 32, 108, 97, 117, 110, 99, 104, 32, 112, 97, 99, 107, 10, 65, 98, 105, 114, 97, 116, 101, 114, 111, 110]...', original message: bytes can be at most 32766 in length; got 95994 Please help to resolve this issue. Any help to remove special characters from suggestion result will also work. Thanks. Abhay Confidentiality Notice This email message, including any attachments, is for the sole use of the intended recipient and may contain confidential and privileged information. Any unauthorized view, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
Re: Solr LockObtainFailedException and NPEs for CoreAdmin STATUS
Hi, after reading some Solr source code, I might have found the cause: There was indeed a change in Solr 8.6 that leads to the NullPointerException for the CoreAdmin STATUS request in CoreAdminOperation#getCoreStatus. The instancePath is not retrieved from the ResourceLoader anymore, but from the registered CoreDescriptor. See commit [1]. SolrCore.getInstancePath(SolrCore.java:333) throws an NPE because the CoreContainer does not have a CoreDescriptor for the name, even though a SolrCore is available in the CoreContainer under that name (retrieved some lines above). This inconsistency is persistent: All STATUS requests keep failing until Solr is restarted. IIUC, the underlying problem is that CoreContainer#create does not correctly handle concurrent requests to create the same core. There's a race condition (see TODO comment [2]), and CoreContainer#createFromDescriptor may be called subsequently for the same core. The second call then fails to create an IndexWriter (LockObtainFailedException), and this causes a call to SolrCores#removeCoreDescriptor [3]. This mean, the second call removes the CoreDescriptor for the SolrCore created with the first call. This is the inconsistency that causes the NPE in CoreAdminOperation#getCoreStatus. Does this sound reasonable? I'll create a JIRA ticket tomorrow, if that's okay. Thank you, Andreas [1] https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324 [2] https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1242 [3] https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1407 -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Simulate facet.exists for json query facets
Separately, and in parallel to Erick's question: indeed I'm not aware of any way to do this currently, but I *can* imagine cases where this would be useful. I have a sense this could be cleanly implemented as a stat facet function (https://lucene.apache.org/solr/guide/8_6/json-facet-api.html#stat-facet-functions), e.g.: curl http://localhost:8983/solr/portal/select -d \ "q=*:*\ &json.facet={ tour: \"exists(+categoryId:6000 -categoryId:(6061 21493 8510))\" }\ &rows=0" The return value of the `exists` function could be boolean, which would be semantically clearer than capping count to 1, as I gather `facet.exists` does. For the same reason, implementing this as a function would probably be better than adding this functionality to the `query` facet type, which carries certain useful assumptions (the meaning of the "count" attribute in the response, the ability to nest stats and subfacets, etc.) ... just thinking out loud at the moment ... On Wed, Oct 28, 2020 at 9:17 AM Erick Erickson wrote: > > This really sounds like an XY problem. The whole point of facets is > to count the number of documents that have a value in some > number of buckets. So trying to stop your facet query as soon > as it matches a hit for the first time seems like an odd thing to do. > > So what’s the “X”? In other words, what is the problem you’re trying > to solve at a high level? Perhaps there’s a better way to figure this > out. > > Best, > Erick > > > On Oct 28, 2020, at 3:48 AM, michael dürr wrote: > > > > Hi, > > > > I use json facets of type 'query'. As these queries are pretty slow and I'm > > only interested in whether there is a match or not, I'd like to restrict > > the query execution similar to the standard facetting (like with the > > facet.exists parameter). My simplified query looks something like this (in > > reality *:* may be replaced by a complex edismax query and multiple > > subfacets similar to "tour" occur): > > > > curl http://localhost:8983/solr/portal/select -d \ > > "q=*:*\ > > &json.facet={ > > tour:{ > >type : query, > > q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\" > > } > > }\ > > &rows=0" > > > > Is there any possibility to modify my request to ensure that the facet > > query stops as soon as it matches a hit for the first time? > > > > Thanks! > > Michael >
Avoiding duplicate entry for a multivalued field
Hello, Say, I have a schema field which is multivalued. Is there a way to maintain distinct values for that field though I continue to add duplicate values through atomic update via solrj? Is there some property setting to have only unique values in a multi valued fields? Thanks, Srinivas DISCLAIMER: E-mails and attachments from Bamboo Rose, LLC are confidential. If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.