Re: ApacheCon at Home 2020 starts tomorrow!
On 30/09/2020 05:14, Rahul Goswami wrote: > Thanks for sharing this Anshum. Day 1 had some really interesting sessions. > Missed out on a couple that I would have liked to listen to. Are the > recordings of these sessions available anywhere? The ASF will be uploading the recordings of all sessions "soon", which probably means about a week or two. - Bram
Re: Slow Solr 8 response for long query
Increasing the number of rows should not have this kind of impact in either version of Solr, so I think there’s something fundamentally strange in your setup. Whether returning 10 or 300 documents, every document has to be scored. There are two differences between 10 and 300 rows: 1> when returning 10 rows, Solr keeps a sorted list of 10 doc, just IDs and score (assuming you’re sorting by relevance), when returning 300 the list is 300 long. I find it hard to believe that keeping a list 300 items long is making that much of a difference. 2> Solr needs to fetch/decompress/assemble 300 documents .vs. 10 documents for the response. Regardless of the fields returned, the entire document will be decompresses if you return any fields that are not docValues=true. So it’s possible that what you’re seeing is related. Try adding, as Alexandre suggests, &debug to the query. Pay particular attention to the “timings” section too, that’ll show you the time each component took _exclusive_ of step <2> above and should give a clue. All that said, fq clauses don’t score, so scoring is certainly involved in why the query takes so long to return even 10 rows but gets faster when you move the clause to a filter query, but my intuition is that there’s something else going on as well to account for the difference when you return 300 rows. Best, Erick > On Sep 29, 2020, at 8:52 PM, Alexandre Rafalovitch wrote: > > What do the debug versions of the query show between two versions? > > One thing that changed is sow (split on whitespace) parameter among > many. It is unlikely to be the cause, but I am mentioning just in > case. > https://lucene.apache.org/solr/guide/8_6/the-standard-query-parser.html#standard-query-parser-parameters > > Regards, > Alex > > On Tue, 29 Sep 2020 at 20:47, Permakoff, Vadim > wrote: >> >> Hi Solr Experts! >> We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long >> query, which has a search text plus many OR and AND conditions (all in one >> place, the query is about 20KB long). >> For the same set of data (about 500K docs) and the same schema the query in >> Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to >> get 10 results. If I increase the number of rows to 300, in Solr 6 it takes >> about 10 sec, in Solr 8 it takes more than 1 min. The results are small, >> just IDs. It looks like the relevancy scoring plays role, because if I move >> this query to filter query - both Solr versions work pretty fast. >> The right way should be to change the query, but unfortunately it is >> difficult to modify the application which creates these queries, so I want >> to find some temporary workaround. >> >> What was changed from Solr 6 to Solr 8 in terms of scoring with many >> conditions, which affects the search speed negatively? >> Is there anything to configure in Solr 8 to get the same performance for >> such query like it was in Solr 6? >> >> Thank you, >> Vadim >> >> >> >> This email is intended solely for the recipient. It may contain privileged, >> proprietary or confidential information or material. If you are not the >> intended recipient, please delete this email and any attachments and notify >> the sender of the error.
Re: How to Resolve : "The request took too long to iterate over doc values"?
I went through other queries for which we are getting `The request took too long to iterate over doc values` warning. As pointed by Erick I have cross check all the fields that are being used in query and there is no such field against which we are searching and it as index=false and docValues=true. Few observations I would like to share here: - We are performing a load test on our system and the above timeout warning is occurring for only those queries which are fetching a large number of documents. - I had stopped all the load on the system and fired same queries (for which we were getting timeout warning). Here is solr response: Solr Response: response: { numFound: 6082251, start: 0, maxScore: 4709.594, docs: [ ] } The response was quite weird (header is saying there are `6082251` docs found but `docs` array is empty) also there was no timeout warning in logs. Then I increased `timeAllowed` to 5000ms (default is 1000ms). This time `docs` array was not empty and in fact there was an increase in numFound count. This clearly points that query was not able to complete in 1000ms (default timeAllowed). I have following question: 1. Is doc value is as effiecient as ExternalFileField for functional query? 2. Why I got warning message when system was under load but no when there was no laod? When we were performing load test (load scale is same) with ExternalFileField type were not getting any warning messages in our logs. raj.yadav wrote > Hey Erick, > > In cases for which we are getting this warning, I'm not able to extract > the > `exact solr query`. Instead logger is logging `parsedquery ` for such > cases. > Here is one example: > > > 2020-09-29 13:09:41.279 WARN (qtp926837661-82461) [c:mycollection > s:shard1_0 r:core_node5 x:mycollection_shard1_0_replica_n3] > o.a.s.s.SolrIndexSearcher Query: [+FunctionScoreQuery(+*:*, scored by > boost(product(if(max(const(0), > sub(float(my_doc_value_field1),const(500))),const(0.01), > > if(max(const(0),sub(float(my_doc_value_field2),const(290))),const(0.2),const(1))), > > sqrt(product(sum(const(1),float(my_doc_value_field3),float(my_doc_value_field4)), > sqrt(sum(const(1),float(my_doc_value_field5 > #BitSetDocTopFilter]; The request took too long to iterate over doc > values. > Timeout: timeoutAt: 1635297585120522 (System.nanoTime(): > 1635297690311384), > DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@7df12bf1 > > > > As per my understanding query in the above case is `q=*:*`. And then there > is boost function which uses functional query on my_doc_value_field* > (fieldtype doc_value_field i.e having index=false and docValue = true) to > reorder matched docs. If docValue works efficiently for _function queries_ > then why this warning are coming? > > > Also, we do use frange queries on doc_value_field (having index=false and > docValue = true). > example: > {!frange l=1.0}my_doc_value_field1 > > > Erick Erickson wrote >> Let’s see the query. My bet is that you are _searching_ against the field >> and have indexed=false. >> >> Searching against a docValues=true indexed=false field results in the >> equivalent of a “table scan” in the RDBMS world. You may use >> the docValues efficiently for _function queries_ to mimic some >> search behavior. >> >> Best, >> Erick > > > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to Resolve : "The request took too long to iterate over doc values"?
Hi, I went through other queries for which we are getting `The request took too long to iterate over doc values` warning. As pointed by Erick I have cross check all the fields that are being used in query and there is no such field against which we are searching and it as index=false and docValues=true. Few observations I would like to share here: - We are performing a load test on our system and the above timeout warning is occurring for only those queries which are fetching a large number of documents. - I had stopped all the load on the system and fired same queries (for which we were getting timeout warning). Here is solr response: Solr Response: response: { numFound: 6082251, start: 0, maxScore: 4709.594, docs: [ ] } The response was quite weird (header is saying there are `6082251` docs found but `docs` array is empty) also there was no timeout warning in logs. Then I increased `timeAllowed` to 5000ms (default is 1000ms). This time `docs` array was not empty and in fact there was an increase in numFound count. This clearly points that query was not able to complete in 1000ms (default timeAllowed). I have following question: 1. Is doc value is as effiecient as ExternalFileField for functional query? 2. Why I got warning message when system was under load but no when there was no laod? When we were performing load test (load scale is same) with ExternalFileField type were not getting any warning messages in our logs. Regards, Raj -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to Resolve : "The request took too long to iterate over doc values"?
Hi, I went through other queries for which we are getting `The request took too long to iterate over doc values` warning. As pointed by Erick I have cross check all the fields that are being used in query and there is no such field against which we are searching and it as index=false and docValues=true. Few observations I would like to share here: - We are performing a load test on our system and the above timeout warning is occurring for only those queries which are fetching a large number of documents. - I had stopped all the load on the system and fired same queries (for which we were getting timeout warning). Here is solr response: Solr Response: response: { numFound: 6082251, start: 0, maxScore: 4709.594, docs: [ ] } The response was quite weird (header is saying there are `6082251` docs found but `docs` array is empty) also there was no timeout warning in logs. Then I increased `timeAllowed` to 5000ms (default is 1000ms). This time `docs` array was not empty and in fact there was an increase in numFound count. This clearly points that query was not able to complete in 1000ms (default timeAllowed). I have following question: 1. Is doc value is as effiecient as ExternalFileField for functional query? 2. Why I got warning message when system was under load but no warning message was thrown when there was no not under laod? When we were performing load test (load scale is same) with ExternalFileField type were not getting any warning messages in our logs. Regards, Raj -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: advice on whether to use stopwords for use case
You may also want to look at something like: https://docs.querqy.org/index.html ApacheCon had (is having..) a presentation on it that seemed quite relevant to your needs. The videos should be live in a week or so. Regards, Alex. On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch wrote: > > I am not sure why you think stop words are your first choice. Maybe I > misunderstand the question. I read it as that you need to exclude > completely a set of documents that include specific keywords when > called from specific module. > > If I wanted to differentiate the searches from specific module, I > would give that module a different end-point (Request Query Handler), > instead of /select. So, /nocigs or whatever. > > Then, in that end-point, you could do all sorts of extra things, such > as setting appends or even invariants parameters, which would include > filter query to exclude any documents matching specific keywords. I > assume it is ok to return documents that are matching for other > reasons. > > Ideally, you would mark the cigs documents during indexing with a > binary or enumeration flag and then during search you just need to > check against that flag. In that case, you could copyField your text > and run it against something like > https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter > combined with Shingles for multiwords. Or similar. And just transform > it as index-only so that the result is basically a yes/no flag. > Similar thing could be done with UpdateRequestProcessor pipeline if > you want to end up with a true boolean flag. The idea is the same, > just to have an index-only flag that you force lock into for any > request from specific module. > > Or even with something like ElevationSearchComponent. Same idea. > > Hope this helps. > > Regards, >Alex. > > On Tue, 29 Sep 2020 at 22:28, Derek Poh wrote: > > > > Hi > > > > I have read in the mailings list that we should try to avoid using stop > > words. > > > > I have a use case where I would like to know if there is other > > alternative solutions beside using stop words. > > > > There is business requirement to return zero result when the search is > > cigarette related words and the search is coming from a particular > > module on our site. It does not apply to all searches from our site. > > There is a list of these cigarette related words. This list contains > > single word, multiple words (Electronic cigar), multiple words with > > punctuation (e-cigarette case). > > I am planning to copy a different set of search fields, that will > > include the stopword filter in the index and query stage, for this > > module to use. > > > > For this use case, other than using stop words to handle it, is there > > any alternative solution? > > > > Derek > > > > -- > > CONFIDENTIALITY NOTICE > > > > This e-mail (including any attachments) may contain confidential and/or > > privileged information. If you are not the intended recipient or have > > received this e-mail in error, please inform the sender immediately and > > delete this e-mail (including any attachments) from your computer, and you > > must not use, disclose to anyone else or copy this e-mail (including any > > attachments), whether in whole or in part. > > > > This e-mail and any reply to it may be monitored for security, legal, > > regulatory compliance and/or other appropriate reasons.
Re: advice on whether to use stopwords for use case
I’m not clear on the requirements. It sounds like the query “cigar” or “cuban cigar” should return zero results. Is that right? If so, then check for those words in the query before sending it to Solr. But the stopwords approach seems like the requirement is different. Could you give some examples? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch > wrote: > > You may also want to look at something like: > https://docs.querqy.org/index.html > > ApacheCon had (is having..) a presentation on it that seemed quite > relevant to your needs. The videos should be live in a week or so. > > Regards, > Alex. > > On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch > wrote: >> >> I am not sure why you think stop words are your first choice. Maybe I >> misunderstand the question. I read it as that you need to exclude >> completely a set of documents that include specific keywords when >> called from specific module. >> >> If I wanted to differentiate the searches from specific module, I >> would give that module a different end-point (Request Query Handler), >> instead of /select. So, /nocigs or whatever. >> >> Then, in that end-point, you could do all sorts of extra things, such >> as setting appends or even invariants parameters, which would include >> filter query to exclude any documents matching specific keywords. I >> assume it is ok to return documents that are matching for other >> reasons. >> >> Ideally, you would mark the cigs documents during indexing with a >> binary or enumeration flag and then during search you just need to >> check against that flag. In that case, you could copyField your text >> and run it against something like >> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter >> combined with Shingles for multiwords. Or similar. And just transform >> it as index-only so that the result is basically a yes/no flag. >> Similar thing could be done with UpdateRequestProcessor pipeline if >> you want to end up with a true boolean flag. The idea is the same, >> just to have an index-only flag that you force lock into for any >> request from specific module. >> >> Or even with something like ElevationSearchComponent. Same idea. >> >> Hope this helps. >> >> Regards, >> Alex. >> >> On Tue, 29 Sep 2020 at 22:28, Derek Poh wrote: >>> >>> Hi >>> >>> I have read in the mailings list that we should try to avoid using stop >>> words. >>> >>> I have a use case where I would like to know if there is other >>> alternative solutions beside using stop words. >>> >>> There is business requirement to return zero result when the search is >>> cigarette related words and the search is coming from a particular >>> module on our site. It does not apply to all searches from our site. >>> There is a list of these cigarette related words. This list contains >>> single word, multiple words (Electronic cigar), multiple words with >>> punctuation (e-cigarette case). >>> I am planning to copy a different set of search fields, that will >>> include the stopword filter in the index and query stage, for this >>> module to use. >>> >>> For this use case, other than using stop words to handle it, is there >>> any alternative solution? >>> >>> Derek >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> >>> This e-mail (including any attachments) may contain confidential and/or >>> privileged information. If you are not the intended recipient or have >>> received this e-mail in error, please inform the sender immediately and >>> delete this e-mail (including any attachments) from your computer, and you >>> must not use, disclose to anyone else or copy this e-mail (including any >>> attachments), whether in whole or in part. >>> >>> This e-mail and any reply to it may be monitored for security, legal, >>> regulatory compliance and/or other appropriate reasons.
Master/Slave
Based on the thread below (reading "legacy" as meaning "likely to be deprecated in later versions"), we have been working to extract ourselves from Master/Slave replication Most of our collections need to be in two data centers (a read/write copy in one local data center: the disaster-recovery-site SolrCloud could be read-only). We also need redundancy within each data center for when one host or another is unavailable. We implemented this by having different SolrClouds in the different data centers; with Master/Slave replication pulling data from one of the read/write replicas to each of the Slave replicas in the disaster-recovery-site read-only SolrCloud. Additionally, for some collections, there is a desire to have local read-only replicas remain unchanged for querying during the loading process: for these collections, there is a local read/write loading SolrCloud, a local read-only querying SolrCloud (normally configured for Master/Slave replication from one of the replicas of the loader SolrCloud to both replicas of the query SolrCloud, but with Master/Slave disabled when the load was in progress on the loader SolrCloud, and with Master/Slave resumed after the loaded data passes QA checks). Based on the thread below, we made an attempt to switch to CDCR. The main reason for wanting to change was that CDCR was said to be the supported mechanism, and the replacement for Master/Slave replication. After multiple unsuccessful attempts to get CDCR to work, we ended up with reproducible cases of CDCR loosing data in transit. In June, I initiated a thread in this group asking for clarification of how/whether CDCR could be made reliable. This seemed to me to be met with deafening silence until the announcement in July of the release of Solr8.6 and the deprecation of CDCR. So we are left with the question whether we should expect Master/Slave replication also to be deprecated; and if so, with what is it expected to be replaced (since not with CDCR)? Or is it now sufficiently safe to assume that Master/Slave replication will continue to be supported after all (since the assertion that it would be replaced by CDCR has been discredited)? In either case, are there other suggested implementations of having a read-only SolrCloud receive data from a read/write SolrCloud? Thanks -Original Message- From: Shawn Heisey Sent: Tuesday, May 21, 2019 11:15 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud (7.3) and Legacy replication slaves On 5/21/2019 8:48 AM, Michael Tracey wrote: > Is it possible set up an existing SolrCloud cluster as the master for > legacy replication to a slave server or two? It looks like another option > is to use Uni-direction CDCR, but not sure what is the best option in this > case. You're asking for problems if you try to combine legacy replication with SolrCloud. The two features are not guaranteed to work together. CDCR is your best bet. This replicates from one SolrCloud cluster to another. Thanks, Shawn
Re: Master/Slave
>whether we should expect Master/Slave replication also to be deprecated it better not ever be depreciated. it has been the most reliable mechanism for its purpose, solr cloud isnt going to replace standalone, if it does, thats when I guess I stop upgrading or move to elastic On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C] wrote: > Based on the thread below (reading "legacy" as meaning "likely to be > deprecated in later versions"), we have been working to extract ourselves > from Master/Slave replication > > Most of our collections need to be in two data centers (a read/write copy > in one local data center: the disaster-recovery-site SolrCloud could be > read-only). We also need redundancy within each data center for when one > host or another is unavailable. We implemented this by having different > SolrClouds in the different data centers; with Master/Slave replication > pulling data from one of the read/write replicas to each of the Slave > replicas in the disaster-recovery-site read-only SolrCloud. Additionally, > for some collections, there is a desire to have local read-only replicas > remain unchanged for querying during the loading process: for these > collections, there is a local read/write loading SolrCloud, a local > read-only querying SolrCloud (normally configured for Master/Slave > replication from one of the replicas of the loader SolrCloud to both > replicas of the query SolrCloud, but with Master/Slave disabled when the > load was in progress on the loader SolrCloud, and with Master/Slave resumed > after the loaded data passes QA checks). > > Based on the thread below, we made an attempt to switch to CDCR. The main > reason for wanting to change was that CDCR was said to be the supported > mechanism, and the replacement for Master/Slave replication. > > After multiple unsuccessful attempts to get CDCR to work, we ended up with > reproducible cases of CDCR loosing data in transit. In June, I initiated a > thread in this group asking for clarification of how/whether CDCR could be > made reliable. This seemed to me to be met with deafening silence until the > announcement in July of the release of Solr8.6 and the deprecation of CDCR. > > So we are left with the question whether we should expect Master/Slave > replication also to be deprecated; and if so, with what is it expected to > be replaced (since not with CDCR)? Or is it now sufficiently safe to assume > that Master/Slave replication will continue to be supported after all > (since the assertion that it would be replaced by CDCR has been > discredited)? In either case, are there other suggested implementations of > having a read-only SolrCloud receive data from a read/write SolrCloud? > > > Thanks > > -Original Message- > From: Shawn Heisey > Sent: Tuesday, May 21, 2019 11:15 AM > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud (7.3) and Legacy replication slaves > > On 5/21/2019 8:48 AM, Michael Tracey wrote: > > Is it possible set up an existing SolrCloud cluster as the master for > > legacy replication to a slave server or two? It looks like another > option > > is to use Uni-direction CDCR, but not sure what is the best option in > this > > case. > > You're asking for problems if you try to combine legacy replication with > SolrCloud. The two features are not guaranteed to work together. > > CDCR is your best bet. This replicates from one SolrCloud cluster to > another. > > Thanks, > Shawn >
Re: Master/Slave
We do this sort of thing outside of Solr. The indexing process includes creating a feed file with one JSON object per line. The feed files are stored in S3 with names that are ISO 8601 timestamps. Those files are picked up and loaded into Solr. Because S3 is cross-region in AWS, those files are also our disaster recovery method for indexing. And of course, two clusters could be loaded from the same file. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 30, 2020, at 12:09 PM, David Hastings > wrote: > >> whether we should expect Master/Slave replication also to be deprecated > > it better not ever be depreciated. it has been the most reliable mechanism > for its purpose, solr cloud isnt going to replace standalone, if it does, > thats when I guess I stop upgrading or move to elastic > > On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C] > wrote: > >> Based on the thread below (reading "legacy" as meaning "likely to be >> deprecated in later versions"), we have been working to extract ourselves >> from Master/Slave replication >> >> Most of our collections need to be in two data centers (a read/write copy >> in one local data center: the disaster-recovery-site SolrCloud could be >> read-only). We also need redundancy within each data center for when one >> host or another is unavailable. We implemented this by having different >> SolrClouds in the different data centers; with Master/Slave replication >> pulling data from one of the read/write replicas to each of the Slave >> replicas in the disaster-recovery-site read-only SolrCloud. Additionally, >> for some collections, there is a desire to have local read-only replicas >> remain unchanged for querying during the loading process: for these >> collections, there is a local read/write loading SolrCloud, a local >> read-only querying SolrCloud (normally configured for Master/Slave >> replication from one of the replicas of the loader SolrCloud to both >> replicas of the query SolrCloud, but with Master/Slave disabled when the >> load was in progress on the loader SolrCloud, and with Master/Slave resumed >> after the loaded data passes QA checks). >> >> Based on the thread below, we made an attempt to switch to CDCR. The main >> reason for wanting to change was that CDCR was said to be the supported >> mechanism, and the replacement for Master/Slave replication. >> >> After multiple unsuccessful attempts to get CDCR to work, we ended up with >> reproducible cases of CDCR loosing data in transit. In June, I initiated a >> thread in this group asking for clarification of how/whether CDCR could be >> made reliable. This seemed to me to be met with deafening silence until the >> announcement in July of the release of Solr8.6 and the deprecation of CDCR. >> >> So we are left with the question whether we should expect Master/Slave >> replication also to be deprecated; and if so, with what is it expected to >> be replaced (since not with CDCR)? Or is it now sufficiently safe to assume >> that Master/Slave replication will continue to be supported after all >> (since the assertion that it would be replaced by CDCR has been >> discredited)? In either case, are there other suggested implementations of >> having a read-only SolrCloud receive data from a read/write SolrCloud? >> >> >> Thanks >> >> -Original Message- >> From: Shawn Heisey >> Sent: Tuesday, May 21, 2019 11:15 AM >> To: solr-user@lucene.apache.org >> Subject: Re: SolrCloud (7.3) and Legacy replication slaves >> >> On 5/21/2019 8:48 AM, Michael Tracey wrote: >>> Is it possible set up an existing SolrCloud cluster as the master for >>> legacy replication to a slave server or two? It looks like another >> option >>> is to use Uni-direction CDCR, but not sure what is the best option in >> this >>> case. >> >> You're asking for problems if you try to combine legacy replication with >> SolrCloud. The two features are not guaranteed to work together. >> >> CDCR is your best bet. This replicates from one SolrCloud cluster to >> another. >> >> Thanks, >> Shawn >>
Transaction not closed on ms sql
Can some one help in troubleshooting some issues that happening from DIH?? Solr version: 8.2; zookeeper 3.4 Solr cloud with 4 nodes and 3 zookeepers 1. Configured DIH for ms sql with mssql jdbc driver, and when trying to pull the data from mssql it’s connecting and fetching records but we do see the connection that was opened on the other end mssql was not closed even though the full import was completed .. need some help in troubleshooting why it’s leaving connections open 2. The way I have scheduled this import api call is like a util that will be hitting DIH api every min with a solr pool url and with this it looks like multiple calls are going from different solr nodes which I don’t want .. I always need the call to be taken by only one node.. can we control this with any config?? Or is this happening because I have three zoo’s?? Please suggest the best approach 3. I do see some records are shown as failed while doing import, is there a way to track these failures?? Like why a minimal no of records are failing?? Sent from my iPhone
Solr 7.7 Indexing issue
Hello all We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr using Solr.Net commit. The data is being synced to SOLR in batches. The document size is very huge (~0.5GB average) and solr indexing is taking long time. Total document size is ~200GB. As the solr commit is done as a part of API, the API calls are failing as document indexing is not completed. 1. What is your advise on syncing such a large volume of data to Solr KB. 2. Because of the search fields requirements, almost 8 fields are defined as Text fields. 3. Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large volume of data? ( IF "%SOLR_JAVA_MEM%"=="" set SOLR_JAVA_MEM=-Xms2g -Xmx2g) 4. How to set up Solr in production on Windows? Currently it's set up as a standalone engine and client is requested to take the backup of the drive. Is there any other better way to do? How to set up for the disaster recovery? Thanks in advance. Regards Manisha Rahatadkar Confidentiality Notice This email message, including any attachments, is for the sole use of the intended recipient and may contain confidential and privileged information. Any unauthorized view, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
Re: How to Resolve : "The request took too long to iterate over doc values"?
raj.yadav wrote > In cases for which we are getting this warning, I'm not able to extract > the > `exact solr query`. Instead logger is logging `parsedquery ` for such > cases. > Here is one example: > > > 2020-09-29 13:09:41.279 WARN (qtp926837661-82461) [c:mycollection > s:shard1_0 r:core_node5 x:mycollection_shard1_0_replica_n3] > o.a.s.s.SolrIndexSearcher Query: [+FunctionScoreQuery(+*:*, scored by > boost(product(if(max(const(0), > sub(float(my_doc_value_field1),const(500))),const(0.01), > > if(max(const(0),sub(float(my_doc_value_field2),const(290))),const(0.2),const(1))), > > sqrt(product(sum(const(1),float(my_doc_value_field3),float(my_doc_value_field4)), > sqrt(sum(const(1),float(my_doc_value_field5 > #BitSetDocTopFilter]; The request took too long to iterate over doc > values. > Timeout: timeoutAt: 1635297585120522 (System.nanoTime(): > 1635297690311384), > DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@7df12bf1 > > Hi Community members, In my previous mail, I had mentioned that solr is not logging actual `solr_query` and instead its only logging parsedquery. Actually, solr is logging the solr_query just after logging above warning message. Coming back to the above query for which we are getting above warning: QUERY => retrieve all docs (i.e q = *:*) and ordered them using multiplicative boost function (i.e boost functional query). So this clearly rules out the possibility mentioned by Erick (i.e query might be searching against field which has indexed=false and docValue=true). Is this expected on using doc values for the functional query? This is only happening when the query is retrieving large number of documents (in millions). Has anyone else faced this issue before? We are experiencing this issue even when there is no load on the system. Regards, Raj -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: advice on whether to use stopwords for use case
Hi Alex The business requirement (for now) is not to return any result when the search keywords are cigarette related. The business user team will provide the list of the cigarette related keywords. Will digest, explore and research on your suggestions. Thank you. On 30/9/2020 10:56 am, Alexandre Rafalovitch wrote: I am not sure why you think stop words are your first choice. Maybe I misunderstand the question. I read it as that you need to exclude completely a set of documents that include specific keywords when called from specific module. If I wanted to differentiate the searches from specific module, I would give that module a different end-point (Request Query Handler), instead of /select. So, /nocigs or whatever. Then, in that end-point, you could do all sorts of extra things, such as setting appends or even invariants parameters, which would include filter query to exclude any documents matching specific keywords. I assume it is ok to return documents that are matching for other reasons. Ideally, you would mark the cigs documents during indexing with a binary or enumeration flag and then during search you just need to check against that flag. In that case, you could copyField your text and run it against something like https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter combined with Shingles for multiwords. Or similar. And just transform it as index-only so that the result is basically a yes/no flag. Similar thing could be done with UpdateRequestProcessor pipeline if you want to end up with a true boolean flag. The idea is the same, just to have an index-only flag that you force lock into for any request from specific module. Or even with something like ElevationSearchComponent. Same idea. Hope this helps. Regards, Alex. On Tue, 29 Sep 2020 at 22:28, Derek Poh wrote: Hi I have read in the mailings list that we should try to avoid using stop words. I have a use case where I would like to know if there is other alternative solutions beside using stop words. There is business requirement to return zero result when the search is cigarette related words and the search is coming from a particular module on our site. It does not apply to all searches from our site. There is a list of these cigarette related words. This list contains single word, multiple words (Electronic cigar), multiple words with punctuation (e-cigarette case). I am planning to copy a different set of search fields, that will include the stopword filter in the index and query stage, for this module to use. For this use case, other than using stop words to handle it, is there any alternative solution? Derek -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons. -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
Re: advice on whether to use stopwords for use case
Yes, the requirements (for now) is not to return any results. I think they may change the requirements,pending their return from the holidays. If so, then check for those words in the query before sending it to Solr. That is what I think so too. Thinking further, using stopwords for this, there will still be results return when the number of words in the search keywords is more than the stopwords. On 1/10/2020 2:57 am, Walter Underwood wrote: I’m not clear on the requirements. It sounds like the query “cigar” or “cuban cigar” should return zero results. Is that right? If so, then check for those words in the query before sending it to Solr. But the stopwords approach seems like the requirement is different. Could you give some examples? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch wrote: You may also want to look at something like: https://docs.querqy.org/index.html ApacheCon had (is having..) a presentation on it that seemed quite relevant to your needs. The videos should be live in a week or so. Regards, Alex. On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch wrote: I am not sure why you think stop words are your first choice. Maybe I misunderstand the question. I read it as that you need to exclude completely a set of documents that include specific keywords when called from specific module. If I wanted to differentiate the searches from specific module, I would give that module a different end-point (Request Query Handler), instead of /select. So, /nocigs or whatever. Then, in that end-point, you could do all sorts of extra things, such as setting appends or even invariants parameters, which would include filter query to exclude any documents matching specific keywords. I assume it is ok to return documents that are matching for other reasons. Ideally, you would mark the cigs documents during indexing with a binary or enumeration flag and then during search you just need to check against that flag. In that case, you could copyField your text and run it against something like https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter combined with Shingles for multiwords. Or similar. And just transform it as index-only so that the result is basically a yes/no flag. Similar thing could be done with UpdateRequestProcessor pipeline if you want to end up with a true boolean flag. The idea is the same, just to have an index-only flag that you force lock into for any request from specific module. Or even with something like ElevationSearchComponent. Same idea. Hope this helps. Regards, Alex. On Tue, 29 Sep 2020 at 22:28, Derek Poh wrote: Hi I have read in the mailings list that we should try to avoid using stop words. I have a use case where I would like to know if there is other alternative solutions beside using stop words. There is business requirement to return zero result when the search is cigarette related words and the search is coming from a particular module on our site. It does not apply to all searches from our site. There is a list of these cigarette related words. This list contains single word, multiple words (Electronic cigar), multiple words with punctuation (e-cigarette case). I am planning to copy a different set of search fields, that will include the stopword filter in the index and query stage, for this module to use. For this use case, other than using stop words to handle it, is there any alternative solution? Derek -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons. -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.