Re: Async exceptions during distributed update
Hi Jay, This is low ingestion rate. What is the size of your index? What is heap size? I am guessing that this is not a huge index, so I am leaning toward what Shawn mentioned - some combination of DBQ/merge/commit/optimise that is blocking indexing. Though, it is strange that it is happening only on one node if you are sending updates randomly to both nodes. Do you monitor your hosts/Solr? Do you see anything different at the time when timeouts happen? Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 8 May 2018, at 03:23, Jay Potharaju wrote: > > I have about 3-5 updates per second. > > >> On May 7, 2018, at 5:02 PM, Shawn Heisey wrote: >> >>> On 5/7/2018 5:05 PM, Jay Potharaju wrote: >>> There are some deletes by query. I have not had any issues with DBQ, >>> currently have 5.3 running in production. >> >> Here's the big problem with DBQ. Imagine this sequence of events with >> these timestamps: >> >> 13:00:00: A commit for change visibility happens. >> 13:00:00: A segment merge is triggered by the commit. >> (It's a big merge that takes exactly 3 minutes.) >> 13:00:05: A deleteByQuery is sent. >> 13:00:15: An update to the index is sent. >> 13:00:25: An update to the index is sent. >> 13:00:35: An update to the index is sent. >> 13:00:45: An update to the index is sent. >> 13:00:55: An update to the index is sent. >> 13:01:05: An update to the index is sent. >> 13:01:15: An update to the index is sent. >> 13:01:25: An update to the index is sent. >> {time passes, more updates might be sent} >> 13:03:00: The merge finishes. >> >> Here's what would happen in this scenario: The DBQ and all of the >> update requests sent *after* the DBQ will block until the merge >> finishes. That means that it's going to take up to three minutes for >> Solr to respond to those requests. If the client that is sending the >> request is configured with a 60 second socket timeout, which inter-node >> requests made by Solr are by default, then it is going to experience a >> timeout error. The request will probably complete successfully once the >> merge finishes, but the connection is gone, and the client has already >> received an error. >> >> Now imagine what happens if an optimize (forced merge of the entire >> index) is requested on an index that's 50GB. That optimize may take 2-3 >> hours, possibly longer. A deleteByQuery started on that index after the >> optimize begins (and any updates requested after the DBQ) will pause >> until the optimize is done. A pause of 2 hours or more is a BIG problem. >> >> This is why deleteByQuery is not recommended. >> >> If the deleteByQuery were changed into a two-step process involving a >> query to retrieve ID values and then one or more deleteById requests, >> then none of that blocking would occur. The deleteById operation can >> run at the same time as a segment merge, so neither it nor subsequent >> update requests will have the significant pause. From what I >> understand, you can even do commits in this scenario and have changes be >> visible before the merge completes. I haven't verified that this is the >> case. >> >> Experienced devs: Can we fix this problem with DBQ? On indexes with a >> uniqueKey, can DBQ be changed to use the two-step process I mentioned? >> >> Thanks, >> Shawn >>
Re: Howto disable PrintGCTimeStamps in Solr
On 5/7/2018 8:22 AM, Bernd Fehling wrote: > thanks for asking, I figured it out this morning. > If setting -Xloggc= the option -XX:+PrintGCTimeStamps will be set > as default and can't be disabled. It's inside JAVA. > > Currently using Solr 6.4.2 with > Java HotSpot(TM) 64-Bit Server VM (25.121-b13) for linux-amd64 JRE > (1.8.0_121-b13) What is the end goal that has you trying to disable PrintGCTimeStamps? Is it to reduce the size of the GC log by only including one timestamp, or something else? Running java 1.8.0_144, I cannot seem to actually do it. I tried removing the parameter from the start script, and I also tried *changing* the parameter to explicitly disable it: -XX:-PrintGCTimeStamps Both times, I verified that the commandline had changed. GC logging still includes both the full date stamp, which PrintGCDateStamps enables, and seconds since JVM start, which PrintGCTimeStamps enables. For the attempt where I changed the parameter instead of removing it, this is the full commandline on the running java process that the start script executed: "C:\Program Files\Java\jdk1.8.0_144\bin\java" -server -Xms512m -Xmx512m -Duser.timezone=UTC -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:-OmitStackTraceInFastThrow -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:-PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime "-Xloggc:C:\Users\sheisey\Downloads\solr-7.3.0\server\logs\solr_gc.log" -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -Xss256k -Dsolr.log.dir="C:\Users\sheisey\Downloads\solr-7.3.0\server\logs" -Dlog4j.configuration="file:C:\Users\sheisey\Downloads\solr-7.3.0\server\resources\log4j.properties" -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dsolr.log.muteconsole -Dsolr.solr.home="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr" -Dsolr.install.dir="C:\Users\sheisey\Downloads\solr-7.3.0" -Dsolr.default.confdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr\configsets\_default\conf" -Djetty.host=0.0.0.0 -Djetty.port=8983 -Djetty.home="C:\Users\sheisey\Downloads\solr-7.3.0\server" -Djava.io.tmpdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\tmp" -jar start.jar "--module=http" "" That change should have done it. I think we're dealing with a Java bug/misfeature. Solr 5.5.5 with Java 1.7.0_80, 1.7.0_45, and 1.7.0_04 behave the same as 7.3.0 with Java 8. I have also verified that Solr 4.7.2 with Java 1.7.0_72 has the same issue. I do not have any information for Java 6 versions. All java versions examined are from Sun/Oracle. I filed a bug with Oracle. They have accepted it and it is now visible publicly. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8202752 Thanks, Shawn
Re: Must clause with filter queries
On 5/7/2018 9:51 AM, manuj singh wrote: > I am kind of confused how must clause(+) behaves with the filter queries. > e.g i have below query: > q=*:*&fq=+{!frange cost=200 l=NOW-179DAYS u=NOW/DAY+1DAY incl=true > incu=false}date > > So i am filtering documents which are less then 179 old days. > So e.g if now is May 7th, 10.23 cst,2018, i should only see documents which > have date > Nov 9th, 10.23 cst, 2017. > > However with the above query i am also seeing documents which are done on > Nov 5th,2017 (which seems like it is returning some docs from filter cache. > which is wired because in my date range for the start date i am using > NOW-179DAYS and > Now is changing every time, so it shouldn't go to filtercache as every new > request will have a different time stamp. ) > > However if i remove the + from the filter query it seems to work fine. I'm not sure that trying to use the + with the frange query makes any sense. For one thing, putting anything before the localparams (which is the {!stuff otherstuff} syntax) probably causes Solr to not correctly interpret the localparams syntax. Typically localparams must be at the very beginning of the query. Adding a plus to a single-clause query like that is not necessary. Queries with one clause will effectively be interpreted as having the +/MUST on that clause. > I am mostly thinking it seems to be a filtercache issue but not sure how i > prove that. > > Our auto soft commit is 500 ms , so every 0.5 second we should have a new > searcher open and cache should be flushed. A commit interval that low could result in some big problems. I hope the autowarmCount setting on all your caches is zero. If it's not, you're going to want to have a much longer interval than 500 milliseconds. > Something is not right and i am not able to figure out what. Has some one > seen this kind of issue before ? > > If i move the query from fq to q then also it works fine. > > One more thing when i put debug query i see the following in the parse query > > *"QParser": "LuceneQParser", "filter_queries": [ "+{!frange cost=200 > l=NOW-179DAYS u=NOW/DAY+1DAY incl=true incu=false}date", "-_parent_:F" ], > "parsed_filter_queries": [ > "+FunctionRangeQuery(ConstantScore(frange(date(date)):[NOW-179DAYS TO > NOW/DAY+1DAY}))", "-_parent_:false" ]* > > So in the above i do not see the date getting resolved to an actual time > stamp. > > However if i change the syntax of the query to not use frange and local > params i see the transaction date resolving into correct timestamp. > > So for the following query > q=*:*&fq=+date:[NOW-179DAYS TO NOW/DAY+1DAY] > > i see the following in the debug query, and see the actualy timestamp: > "QParser": "LuceneQParser", "filter_queries": [ "date:[NOW-179DAYS TO > NOW/DAY+1DAY]", "-_parent_:F" ], "parsed_filter_queries": [ > "date:[1510242067383 > TO 152573760]", "-_parent_:false" ], If the filter you're trying to use is this kind of simple date range, I would stick with lucene and not use localparams to switch to another parser. I would also set the low value of the range to NOW/DAY-179DAYS so there's at least a chance that caching will be effective. Also, as mentioned, because this example only has one query clause, adding + is unnecessary. It might become necessary if you have multiple query clauses ... but in that case, you're not likely to be using something like frange. Thanks, Shawn
Re: Determine Solr Core Creation Timestamp
On 5/7/2018 3:50 PM, Atita Arora wrote: > I noticed the same and hence overruled the idea to use it. > Further , while exploring the V2 api (as we're currently in Solr 6.6 and > will soon be on Solr 7.X) ,I came across the shards API which has > "property.index.version": "1525453818563" > > Which is listed for each of the shards. I wonder if I should be leveraging > this as this seem to be the index version & I dont think this number should > vary on restart. The index version is a number that is milliseconds since the epoch -- 1970-01-01 00:00:00 UTC. This is how Java represents timestamps internally. All Lucene indexes have this information. The index version value appears to update every time the index changes, probably when a new searcher is opened. For SolrCloud collections, this information is actually already available, although getting to it may not be obvious. ZooKeeeper itself keeps track of when all znodes are created, so the /collections/x znode creation time is effectively what you're after. This can be seen in Cloud->Tree in the admin UI, which means that there is a way to obtain the information with an HTTP API. When cores are created or manipulated by API calls, the core.properties file will have a comment with a timestamp of the last time Solr wrote/changed the file. CoreAdmin operations like CREATE, SWAP, RENAME, and others will update or create the timestamp in that comment, but if the properties file doesn't ever get changed by Solr, then the comment would reflect the creation time. That makes it not entirely reliable. Also, I do not know of a way to access that information with any Solr API -- access to the filesystem would probably be required. The core.properties file could be a place to store a true creation time, using a new property that Solr doesn't need for any other purpose. Solr could look for a creation time in that file when the core is started and update it to include the current time as the creation time if it is not present, and certain CoreAdmin operations could also write that property. Retrieving the value would needed to be added to the CoreAdmin API. Thanks, Shawn
Re: Determine Solr Core Creation Timestamp
Thank you Shawn for looking into this to such a depth. Let me try getting hold of someway to grab this information and use it and I may reach back to you or list for further thoughts. Thanks again, Atita On Tue, May 8, 2018, 3:11 PM Shawn Heisey wrote: > On 5/7/2018 3:50 PM, Atita Arora wrote: > > I noticed the same and hence overruled the idea to use it. > > Further , while exploring the V2 api (as we're currently in Solr 6.6 and > > will soon be on Solr 7.X) ,I came across the shards API which has > > "property.index.version": "1525453818563" > > > > Which is listed for each of the shards. I wonder if I should be > leveraging > > this as this seem to be the index version & I dont think this number > should > > vary on restart. > > The index version is a number that is milliseconds since the epoch -- > 1970-01-01 00:00:00 UTC. This is how Java represents timestamps > internally. All Lucene indexes have this information. > > The index version value appears to update every time the index changes, > probably when a new searcher is opened. > > For SolrCloud collections, this information is actually already > available, although getting to it may not be obvious. ZooKeeeper itself > keeps track of when all znodes are created, so the /collections/x > znode creation time is effectively what you're after. This can be seen > in Cloud->Tree in the admin UI, which means that there is a way to > obtain the information with an HTTP API. > > When cores are created or manipulated by API calls, the core.properties > file will have a comment with a timestamp of the last time Solr > wrote/changed the file. CoreAdmin operations like CREATE, SWAP, RENAME, > and others will update or create the timestamp in that comment, but if > the properties file doesn't ever get changed by Solr, then the comment > would reflect the creation time. That makes it not entirely reliable. > Also, I do not know of a way to access that information with any Solr > API -- access to the filesystem would probably be required. > > The core.properties file could be a place to store a true creation time, > using a new property that Solr doesn't need for any other purpose. Solr > could look for a creation time in that file when the core is started and > update it to include the current time as the creation time if it is not > present, and certain CoreAdmin operations could also write that > property. Retrieving the value would needed to be added to the > CoreAdmin API. > > Thanks, > Shawn > >
Filter Must/Must not clauses and parenthesis
Hi everyone, I found solr 5.5.4 is doing some unexpected behavior (at least unexpected for me) when using Must and Must not operator and parenthesis for filtering and it would be great if someone can confirm if this is unexpected or not and why. To clarify I will write an example: The following problematic query should give results but it is actually not giving anyi q=*:*&defType=edismax&fq=NOT(status:"DELETED")+AND+(NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) which is parsed as -dynamic_multi_stored_facet_string_static_status:DELETED +(-dynamic_multi_stored_facet_long_core_length:[186 TO 365] -dynamic_multi_stored_facet_long_core_length:[366 TO *]) If I rewrite the query removing the enclosing parentheses as q=*:*&defType=edismax&fq=NOT(status:"DELETED")+AND+NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) is parsed as -dynamic_multi_stored_facet_string_static_status:DELETED -dynamic_multi_stored_facet_long_core_length:[186 TO 365] -dynamic_multi_stored_facet_long_core_length:[366 TO *] and it gives the expected results. Again if the parenthesis enclosed condition is alone as q=*:*&defType=edismax&fq=(NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) it is pased as (-dynamic_multi_stored_facet_long_core_length:[186 TO 365] -dynamic_multi_stored_facet_long_core_length:[366 TO *]) and giving more results. Do you have any idea why is this happening? Thanks for your help, Alfonso. -- Alfonso Noriega Software engineer Redlink GmbH e: alfonso.nori...@redlink.co w: http://redlink.co
Re: Solr Slave failed to initialize collection
Hi Shawn , Thanks for the info!! As I mentioned master index was fine, only for one of the collection in salve index was corrupted. Yes, we fixed the issue by removing corrupted index and replicated again. The error message shared we have received from Admin UI of Solr. Replication strategy seems fine as it is happening properly from Master to Slave. Is this issue happened due to the size of the index? or any recommendations to not happen in future. Please let me know. Regards, Aji Viswanadhan -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Howto disable PrintGCTimeStamps in Solr
Hi Shawn, the goal is that some GCviewer get confused if both DateStamps and TimeStamps are present in solr_gc.log file. And _not_ to reduce the GC log size, that would be stupid. Now I have a Perl-Script which will remove the TimeStamps (and only leaf the DateStamps) for Analysis of solr_gc.log for some GCviewers. Problem solved :-) Generally I can understand that DateStamps or TimeStamps are added as default when logging to a file, but it should be only one type and not both at once possible. Thanks for filing the bug report, I missed that. Regards Bernd Am 08.05.2018 um 11:32 schrieb Shawn Heisey: > On 5/7/2018 8:22 AM, Bernd Fehling wrote: >> thanks for asking, I figured it out this morning. >> If setting -Xloggc= the option -XX:+PrintGCTimeStamps will be set >> as default and can't be disabled. It's inside JAVA. >> >> Currently using Solr 6.4.2 with >> Java HotSpot(TM) 64-Bit Server VM (25.121-b13) for linux-amd64 JRE >> (1.8.0_121-b13) > > What is the end goal that has you trying to disable PrintGCTimeStamps? > Is it to reduce the size of the GC log by only including one timestamp, > or something else? > > Running java 1.8.0_144, I cannot seem to actually do it. I tried > removing the parameter from the start script, and I also tried > *changing* the parameter to explicitly disable it: > > -XX:-PrintGCTimeStamps > > Both times, I verified that the commandline had changed. GC logging > still includes both the full date stamp, which PrintGCDateStamps > enables, and seconds since JVM start, which PrintGCTimeStamps enables. > > For the attempt where I changed the parameter instead of removing it, > this is the full commandline on the running java process that the start > script executed: > > "C:\Program Files\Java\jdk1.8.0_144\bin\java" -server -Xms512m -Xmx512m > -Duser.timezone=UTC -XX:NewRatio=3 -XX:SurvivorRatio=4 > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 > -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark > -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSInitiatingOccupancyFraction=50 > -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled > -XX:+ParallelRefProcEnabled -XX:-OmitStackTraceInFastThrow > -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails > -XX:-PrintGCTimeStamps -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime > "-Xloggc:C:\Users\sheisey\Downloads\solr-7.3.0\server\logs\solr_gc.log" > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M > -Xss256k > -Dsolr.log.dir="C:\Users\sheisey\Downloads\solr-7.3.0\server\logs" > -Dlog4j.configuration="file:C:\Users\sheisey\Downloads\solr-7.3.0\server\resources\log4j.properties" > -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dsolr.log.muteconsole > -Dsolr.solr.home="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr" > -Dsolr.install.dir="C:\Users\sheisey\Downloads\solr-7.3.0" > -Dsolr.default.confdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr\configsets\_default\conf" > > -Djetty.host=0.0.0.0 -Djetty.port=8983 > -Djetty.home="C:\Users\sheisey\Downloads\solr-7.3.0\server" > -Djava.io.tmpdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\tmp" -jar > start.jar "--module=http" "" > > That change should have done it. I think we're dealing with a Java > bug/misfeature. > > Solr 5.5.5 with Java 1.7.0_80, 1.7.0_45, and 1.7.0_04 behave the same as > 7.3.0 with Java 8. I have also verified that Solr 4.7.2 with Java > 1.7.0_72 has the same issue. I do not have any information for Java 6 > versions. All java versions examined are from Sun/Oracle. > > I filed a bug with Oracle. They have accepted it and it is now visible > publicly. > > https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8202752 > > Thanks, > Shawn >
Re:LTR performance issues
Hello ilayaraja, I think it would be good to move this discussion on the Jira item: https://issues.apache.org/jira/browse/SOLR-8776?attachmentOrder=asc You can add your comments there, and also in the page I explained how it works. On the performance you are right: at the moment it is slow. We recently improved the performance a lot for the particular use case where you are interested only in one document per group ( first part of the change has been upstreamed in the las vegas patch [1] ). For the general case, my opinion is that we could speed up by allowing the user to rerank only the groups (without affecting the order of the documents **within** each group). 1. How many top groups are actually re-ranked, is it exactly what we pass in reRankDocs? > rerankDocs will rerank the top $rerankDocs groups, so if your groups contain > many documents you will rerank much more documents 2. How many documents within each group is re-ranked? Can we control it with group.limit or some other parameter? > $rerankDocs documents will be reranked inside each group - please double > check on the jira and add your comments there. Cheers, Diego [1] https://issues.apache.org/jira/browse/SOLR-11831?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16316605#comment-16316605 From: solr-user@lucene.apache.org At: 05/08/18 07:07:01To: solr-user@lucene.apache.org Subject: LTR performance issues LTR with grouping results in very high latency (3x) even while re-ranking 24 top groups. How is re-ranking implemented in Solr? Is it expected that it would result in 3x more query time. Need clarifications on: 1. How many top groups are actually re-ranked, is it exactly what we pass in reRankDocs? 2. How many documents within each group is re-ranked? Can we control it with group.limit or some other parameter? What causes LTR take more time when grouping is performed? Is it scoring the documents again or merging the re-ranked docs with rest of the docs? Is there anyway to optimize this? - --Ilay -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Filter Must/Must not clauses and parenthesis
Just skimmed, but perhaps related to : https://issues.apache.org/jira/browse/SOLR-12212? Best, Erick On Tue, May 8, 2018 at 3:02 AM, Alfonso Noriega wrote: > Hi everyone, > I found solr 5.5.4 is doing some unexpected behavior (at least unexpected > for me) when using Must and Must not operator and parenthesis for filtering > and it would be great if someone can confirm if this is unexpected or not > and why. > > To clarify I will write an example: > The following problematic query should give results but it is actually not > giving anyi > q=*:*&defType=edismax&fq=NOT(status:"DELETED")+AND+(NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) > which is parsed as -dynamic_multi_stored_facet_string_static_status:DELETED > +(-dynamic_multi_stored_facet_long_core_length:[186 TO 365] > -dynamic_multi_stored_facet_long_core_length:[366 TO *]) > > If I rewrite the query removing the enclosing parentheses as > q=*:*&defType=edismax&fq=NOT(status:"DELETED")+AND+NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) > is parsed as -dynamic_multi_stored_facet_string_static_status:DELETED > -dynamic_multi_stored_facet_long_core_length:[186 TO 365] > -dynamic_multi_stored_facet_long_core_length:[366 TO *] > and it gives the expected results. > > Again if the parenthesis enclosed condition is alone as > q=*:*&defType=edismax&fq=(NOT(length:[186+TO+365])+AND+NOT(length:[366+TO+*])) > it is pased as (-dynamic_multi_stored_facet_long_core_length:[186 TO > 365] -dynamic_multi_stored_facet_long_core_length:[366 TO *]) and > giving more results. > > Do you have any idea why is this happening? > > Thanks for your help, > Alfonso. > > -- > Alfonso Noriega > Software engineer > Redlink GmbH > e: alfonso.nori...@redlink.co > w: http://redlink.co
Re: Solr Slave failed to initialize collection
On 5/8/2018 4:32 AM, Aji Viswanadhan wrote: Is this issue happened due to the size of the index? or any recommendations to not happen in future. Please let me know. I have no idea why it happened. Running out of disk space could cause any number of problems. Program operation becomes unpredictable if resources run out. Thanks, Shawn
Re: Async exceptions during distributed update
Hi Emir, I was seeing this error as long as the indexing was running. Once I stopped the indexing the errors also stopped. Yes, we do monitor both hosts & solr but have not seen anything out of the ordinary except for a small network blip. In my experience solr generally recovers after a network blip and there are a few errors for streaming solr client...but have never seen this error before. Thanks Jay Thanks Jay Potharaju On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Jay, > This is low ingestion rate. What is the size of your index? What is heap > size? I am guessing that this is not a huge index, so I am leaning toward > what Shawn mentioned - some combination of DBQ/merge/commit/optimise that > is blocking indexing. Though, it is strange that it is happening only on > one node if you are sending updates randomly to both nodes. Do you monitor > your hosts/Solr? Do you see anything different at the time when timeouts > happen? > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 8 May 2018, at 03:23, Jay Potharaju wrote: > > > > I have about 3-5 updates per second. > > > > > >> On May 7, 2018, at 5:02 PM, Shawn Heisey wrote: > >> > >>> On 5/7/2018 5:05 PM, Jay Potharaju wrote: > >>> There are some deletes by query. I have not had any issues with DBQ, > >>> currently have 5.3 running in production. > >> > >> Here's the big problem with DBQ. Imagine this sequence of events with > >> these timestamps: > >> > >> 13:00:00: A commit for change visibility happens. > >> 13:00:00: A segment merge is triggered by the commit. > >> (It's a big merge that takes exactly 3 minutes.) > >> 13:00:05: A deleteByQuery is sent. > >> 13:00:15: An update to the index is sent. > >> 13:00:25: An update to the index is sent. > >> 13:00:35: An update to the index is sent. > >> 13:00:45: An update to the index is sent. > >> 13:00:55: An update to the index is sent. > >> 13:01:05: An update to the index is sent. > >> 13:01:15: An update to the index is sent. > >> 13:01:25: An update to the index is sent. > >> {time passes, more updates might be sent} > >> 13:03:00: The merge finishes. > >> > >> Here's what would happen in this scenario: The DBQ and all of the > >> update requests sent *after* the DBQ will block until the merge > >> finishes. That means that it's going to take up to three minutes for > >> Solr to respond to those requests. If the client that is sending the > >> request is configured with a 60 second socket timeout, which inter-node > >> requests made by Solr are by default, then it is going to experience a > >> timeout error. The request will probably complete successfully once the > >> merge finishes, but the connection is gone, and the client has already > >> received an error. > >> > >> Now imagine what happens if an optimize (forced merge of the entire > >> index) is requested on an index that's 50GB. That optimize may take 2-3 > >> hours, possibly longer. A deleteByQuery started on that index after the > >> optimize begins (and any updates requested after the DBQ) will pause > >> until the optimize is done. A pause of 2 hours or more is a BIG > problem. > >> > >> This is why deleteByQuery is not recommended. > >> > >> If the deleteByQuery were changed into a two-step process involving a > >> query to retrieve ID values and then one or more deleteById requests, > >> then none of that blocking would occur. The deleteById operation can > >> run at the same time as a segment merge, so neither it nor subsequent > >> update requests will have the significant pause. From what I > >> understand, you can even do commits in this scenario and have changes be > >> visible before the merge completes. I haven't verified that this is the > >> case. > >> > >> Experienced devs: Can we fix this problem with DBQ? On indexes with a > >> uniqueKey, can DBQ be changed to use the two-step process I mentioned? > >> > >> Thanks, > >> Shawn > >> > >
Re: Filter Must/Must not clauses and parenthesis
On 5/8/2018 4:02 AM, Alfonso Noriega wrote: I found solr 5.5.4 is doing some unexpected behavior (at least unexpected for me) when using Must and Must not operator and parenthesis for filtering and it would be great if someone can confirm if this is unexpected or not and why. Do you have any idea why is this happening? I'm surprised ANY of those examples are working. While the bug that Erick mentioned could be a problem, I think this is happening because you've got a multi-clause pure negative query. All query clauses have NOT attached to them. Purely negative queries do not actually work. The reason negative queries don't work is that if you start with nothing and then start subtracting things, you end up with nothing. To properly work, the first example would need to be written like this: *:* AND NOT(status:"DELETED") AND (*:* AND NOT(length:[186+TO+365]) AND NOT(length:[366+TO+*])) I have added the all documents query as the starting point for both major clauses, so that the subtraction (AND NOT) has something to subtract from. Some of those parentheses are unnecessary, but I have preserved them in the rewritten query.Without unnecessary parentheses/quotes, the query would look like this: *:* AND NOT status:DELETED AND (*:* AND NOT length:[186+TO+365] AND NOT length:[366+TO+*]) You might be wondering why something like "fq=-status:DELETED" will work even though it's a purely negative query. This works because with a super-simple query like that, Solr is able to detect the unworkable situation and automatically fix it by adding the all-docs starting point behind the scenes. The example you gave is too complicated for Solr's detection to work, so it doesn't get fixed. Thanks, Shawn
Re: Must clause with filter queries
Hi Shawn, Thanks for the repsonse. We have multiple clauses. I was just giving an bare bone example. Usually all our queries will have more then one clause. In case of frange query how do we specify the Must clause ? the reason we are using frange instead of the normal syntax is that we need to add a cost to this clause. Since this will return a lot of documents, we want to calculate at the end of all the clauses. That is why we are using frange with a cost of 200. We have near real time requirements and that is the reason we are using 500 ms in the autosoft commit. We have autowarmCount="60%" for filter cache. We are using solr 6. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Filter Must/Must not clauses and parenthesis
Thanks Shawn! I was not thinking of it as a subtraction but it makes all the sense put like that. On 8 May 2018 at 17:55, Shawn Heisey wrote: > On 5/8/2018 4:02 AM, Alfonso Noriega wrote: > >> I found solr 5.5.4 is doing some unexpected behavior (at least >> unexpected >> for me) when using Must and Must not operator and parenthesis for >> filtering >> and it would be great if someone can confirm if this is unexpected or not >> and why. >> > > > > Do you have any idea why is this happening? >> > > I'm surprised ANY of those examples are working. While the bug that Erick > mentioned could be a problem, I think this is happening because you've got > a multi-clause pure negative query. All query clauses have NOT attached to > them. Purely negative queries do not actually work. > > The reason negative queries don't work is that if you start with nothing > and then start subtracting things, you end up with nothing. > > To properly work, the first example would need to be written like this: > > *:* AND NOT(status:"DELETED") AND (*:* AND NOT(length:[186+TO+365]) > AND NOT(length:[366+TO+*])) > > I have added the all documents query as the starting point for both major > clauses, so that the subtraction (AND NOT) has something to subtract from. > Some of those parentheses are unnecessary, but I have preserved them in the > rewritten query.Without unnecessary parentheses/quotes, the query would > look like this: > > *:* AND NOT status:DELETED AND (*:* AND NOT length:[186+TO+365] > AND NOT length:[366+TO+*]) > > You might be wondering why something like "fq=-status:DELETED" will work > even though it's a purely negative query. This works because with a > super-simple query like that, Solr is able to detect the unworkable > situation and automatically fix it by adding the all-docs starting point > behind the scenes. The example you gave is too complicated for Solr's > detection to work, so it doesn't get fixed. > > Thanks, > Shawn > > -- -- Alfonso Noriega Software engineer Redlink GmbH e: alfonso.nori...@redlink.co w: http://redlink.co
Solr Json Facet
Hello, recently I have changed the way I get facet data from Solr. I was using GET method on request but due to the limit of the query I changed to POST method. Bellow is a sample of the data I send to Solr, in order to get facets. But there is something here that I don´t understand. If I do not tag the fq query, it woks fine: {'q':'*:*', 'fl': '*', 'fq':'city_colaboration:"College Station"', 'json.facet': '{city_colaboration:{type:terms, field: city_colaboration ,limit:5000}}'} If I tag the fq query and I query for a simple word it works fine too. But if query a multi word with space in the middle it breaks: {'q':'*:*', 'fl': '*', 'fq':'{!tag=city_colaboration_tag}city_colaboration:"College Station"', 'json.facet': '{city_colaboration:{type:terms, field: city_colaboration ,limit:5000, domain:{excludeTags:city_ colaboration_tag}}}'} All of this works fine for GET method, but breks on POST method. Below is the portion of the log. I really appreciate your help. Regards, Koji 01:49 ERROR true RequestHandlerBase org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse 'city_colaboration:"College': Lexical error at line 1, column 34. Encountered: after : "\"College" org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse 'cidade_colaboracao_exact:"College': Lexical error at line 1, column 34. Encountered: after : "\"College" at org.apache.solr.handler.component.QueryComponent. prepare(QueryComponent.java:219) at org.apache.solr.handler.component.SearchHandler.handleRequestBody( SearchHandler.java:270) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain. doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle( ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle( ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle( SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler. doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler. doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler. doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler. doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle( ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle( ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection. handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle( HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle( RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle( HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable( HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded( AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run( SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume. executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume. produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run( ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run( QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748)
managed resources and SolrJ
Hi, we are looking into using manged resources for synonyms via the ManagedSynonymGraphFilterFactory. It seems like there is no SolrJ API for that. I would be especially interested in one via the CloudSolrClient. I found http://lifelongprogrammer.blogspot.de/2017/01/build-rest-apis-to-update-solrs-managed-resources.html. Is there a better solution? regards, Hendrik
Re: Solr Json Facet
Single backslash escaping works for me. On Tue, May 8, 2018 at 8:36 PM, Kojo wrote: > Hello, > recently I have changed the way I get facet data from Solr. I was using GET > method on request but due to the limit of the query I changed to POST > method. > > Bellow is a sample of the data I send to Solr, in order to get facets. But > there is something here that I don´t understand. > > If I do not tag the fq query, it woks fine: > {'q':'*:*', 'fl': '*', 'fq':'city_colaboration:"College Station"', > 'json.facet': '{city_colaboration:{type:terms, field: city_colaboration > ,limit:5000}}'} > > If I tag the fq query and I query for a simple word it works fine too. But > if query a multi word with space in the middle it breaks: > > {'q':'*:*', 'fl': '*', > 'fq':'{!tag=city_colaboration_tag}city_colaboration:"College > Station"', 'json.facet': '{city_colaboration:{type:terms, field: > city_colaboration ,limit:5000, domain:{excludeTags:city_ > colaboration_tag}}}'} > > > All of this works fine for GET method, but breks on POST method. > > > Below is the portion of the log. I really appreciate your help. > > Regards, > Koji > > > > 01:49 > ERROR true > RequestHandlerBase > org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: > Cannot parse 'city_colaboration:"College': Lexical error at line 1, > column 34. Encountered: after : "\"College" > org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: > Cannot parse 'cidade_colaboracao_exact:"College': Lexical error at line 1, > column 34. Encountered: after : "\"College" > at org.apache.solr.handler.component.QueryComponent. > prepare(QueryComponent.java:219) > at org.apache.solr.handler.component.SearchHandler.handleRequestBody( > SearchHandler.java:270) > at org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:361) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:305) > at org.eclipse.jetty.servlet.ServletHandler$CachedChain. > doFilter(ServletHandler.java:1691) > at org.eclipse.jetty.servlet.ServletHandler.doHandle( > ServletHandler.java:582) > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:143) > at org.eclipse.jetty.security.SecurityHandler.handle( > SecurityHandler.java:548) > at org.eclipse.jetty.server.session.SessionHandler. > doHandle(SessionHandler.java:226) > at org.eclipse.jetty.server.handler.ContextHandler. > doHandle(ContextHandler.java:1180) > at org.eclipse.jetty.servlet.ServletHandler.doScope( > ServletHandler.java:512) > at org.eclipse.jetty.server.session.SessionHandler. > doScope(SessionHandler.java:185) > at org.eclipse.jetty.server.handler.ContextHandler. > doScope(ContextHandler.java:1112) > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:141) > at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle( > ContextHandlerCollection.java:213) > at org.eclipse.jetty.server.handler.HandlerCollection. > handle(HandlerCollection.java:119) > at org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:134) > at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle( > RewriteHandler.java:335) > at org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at org.eclipse.jetty.server.HttpConnection.onFillable( > HttpConnection.java:251) > at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded( > AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at org.eclipse.jetty.io.SelectChannelEndPoint$2.run( > SelectChannelEndPoint.java:93) > at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume. > executeProduceConsume(ExecuteProduceConsume.java:303) > at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume. > produceConsume(ExecuteProduceConsume.java:148) > at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run( > ExecuteProduceConsume.java:136) > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( > QueuedThreadPool.java:671) > at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run( > QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > -- Sincerely yours Mikhail Khludnev
Re: Must clause with filter queries
On 5/8/2018 9:58 AM, root23 wrote: > In case of frange query how do we specify the Must clause ? Looking at how frange works, I'm pretty sure that all queries with frange are going to be effectively single-clause. So you don't need to specify MUST -- it's implied. > the reason we are using frange instead of the normal syntax is that we need > to add a cost to this clause. Since this will return a lot of documents, we > want to calculate at the end of all the clauses. That is why we are using > frange with a cost of 200. Ah, you want it to be a postFilter, which frange supports, but the standard lucene parser doesn't. FYI, to actually achieve a postFilter, you need to set cache=false in addition to a cost of 100 or higher. It's not possible to cache postFilters because of how they work, so they must be uncached. Which also means you don't need to worry about using NOW/DAY date rounding. See the "Expensive Filters" section on this blog post for an example with frange that includes cache=false and cost=200: https://lucidworks.com/2012/02/10/advanced-filter-caching-in-solr/ The requirement for cache=false is not mentioned on the blog post above. It was this post that alerted me to that requirement: https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ > We have near real time requirements and that is the reason we are using 500 > ms in the autosoft commit. > We have autowarmCount="60%" for filter cache. What is the size of the filterCache? Chances are very good that this translates to a fairly high autowarmCount, and that it is making your automatic soft commits take far longer than 500 milliseconds. If the warming is slow, then you're not getting the half-second latency anyway, so configuring it is at best a waste of resources, and at worst a big performance problem. Achieving NRT indexing requires turning off all warming. To see how long it took to warm the searcher on the last commit, go to the admin UI. Choose your index from the dropdown, click on Plugins/Stats, click on CORE, then open the "searcher" entry. In the displayed information will be "warmupTime", with a value in milliseconds. I'm betting that this number will be larger than 500. If I'm wrongabout that, then you might not have anything to worry about. You can also see warmup times for the individual caches with the CACHE entry in Plugins/Stats. Typically it's filterCache that takes the longest. https://www.dropbox.com/s/izwad4h2vl1z752/solr-filtercache-stats.png?dl=0 A long time ago, I was having issues on my servers with commits taking a minute or more. I discovered that it was autowarming on the filterCache that caused it. So I reduced autowarmCount on that cache. Eventually I got to an autowarmCount of *four*. Not 4 percent, I am literally doing warming from the top 4 cache entries. Even with the count that low, commits still sometimes take 10 seconds or more, and the vast majority of that time is spent executing those four warming queries from the filterCache. Thanks, Shawn
Re: Solr Json Facet
On 5/8/2018 11:36 AM, Kojo wrote: > If I tag the fq query and I query for a simple word it works fine too. But > if query a multi word with space in the middle it breaks: > > {'q':'*:*', 'fl': '*', > 'fq':'{!tag=city_colaboration_tag}city_colaboration:"College > Station"', 'json.facet': '{city_colaboration:{type:terms, field: > city_colaboration ,limit:5000, domain:{excludeTags:city_ > colaboration_tag}}}'} Best guess is that this is happening because your JSON fails validation. One of the rules is that quotes must be escaped if you want to use a literal quote. Putting your JSON into a validator, it gets flagged with a BUNCH of errors. https://jsonformatter.curiousconcept.com/ I think I managed to fix it. Here's a new version that passes strict validation. The paste will expire one month from now: https://apaste.info/M46c I also fixed/validated the inner json in the json.facet parameter before I escaped it. As you can see, nested json is messy when it is correctly formed. This is the tool I used for the escaping: https://codebeautify.org/json-escape-unescape Development libraries for constructing JSON data would probably handle the escaping automatically. The JSON parser that Solr uses can handle some deviations from the strict standard, but not ALL deviations. Using data that passes strict validation will make success more likely. It's not what I would do, but you could probably also get this working just by escaping the quotes around the query text: \"College Station\" Thanks, Shawn
Re: Solr Json Facet
On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: > If I tag the fq query and I query for a simple word it works fine too. But > if query a multi word with space in the middle it breaks: Most likely the full query is not getting to Solr because of an HTTP protocol error (i.e. the request is not encoded correctly). How are you sending your request to Solr (with curl, or with some other method?) -Yonik
Rule based replica placement solr cloud 6.2.1
Hi, Would like to have below rule set up in solr cloud 6.2.1. Not sure how to model this with default snitch. Any suggestions? Don’t assign more than 1 replica of this collection to a host Regards, Rajeswari
Re: Solr Json Facet
Thank you all. I tried escaping but still not working Yonik, I am using Python Requests. It works if my fq is a single word, even if I use double quotes on this single word without escaping. This is the HTTP response: response.content '\n\n400 Bad Request\n\nBad Request\nYour browser sent a request that this server could not understand.\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port 80\n\n' Thank you, 2018-05-08 18:46 GMT-03:00 Yonik Seeley : > On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: > > If I tag the fq query and I query for a simple word it works fine too. > But > > if query a multi word with space in the middle it breaks: > > Most likely the full query is not getting to Solr because of an HTTP > protocol error (i.e. the request is not encoded correctly). > How are you sending your request to Solr (with curl, or with some other > method?) > > -Yonik >
Re: Solr Json Facet
Looks like some sort of proxy server inbetween the python client and solr server. I would still check first if the output from the python client is correctly escaped/encoded HTTP. One easy way is to use netcat to pretend to be a server: $ nc -l 8983 And then send point the python client at that and send the request. -Yonik On Tue, May 8, 2018 at 9:17 PM, Kojo wrote: > Thank you all. I tried escaping but still not working > > Yonik, I am using Python Requests. It works if my fq is a single word, even > if I use double quotes on this single word without escaping. > > This is the HTTP response: > > response.content > > ' 2.0//EN">\n\n400 Bad > Request\n\nBad Request\nYour browser sent > a request that this server could not understand. />\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port > 80\n\n' > > > Thank you, > > > > 2018-05-08 18:46 GMT-03:00 Yonik Seeley : > >> On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: >> > If I tag the fq query and I query for a simple word it works fine too. >> But >> > if query a multi word with space in the middle it breaks: >> >> Most likely the full query is not getting to Solr because of an HTTP >> protocol error (i.e. the request is not encoded correctly). >> How are you sending your request to Solr (with curl, or with some other >> method?) >> >> -Yonik >>
How to do indexing on remote location
Please take this as no joking! Any suggestion is welcome and appreciated. I have data on remote WORM drive on a cluster that include 3 hosts, each host contains same copy of data. I have Solr server on a different host and need to do the indexing on the WORM drive. It is said the indexing can only be done on local host, or hdfs if in the same cluster. I was proposing to create a mapped drive/mount so Solr server would see the WORM drive as its local location. The proposal was returned today by management saying cross mount potentially introduces risk and I was asked to figure out a workaround to do the indexing on a remote host without the cross mount. Thank you very much. ** *Sincerely yours,* *Raymond*
How to do multi-threading indexing on huge volume of JSON files?
I have a huge amount of JSON files to be indexed in Solr, it costs me 22 minutes to index 300,000 JSON files which were generated from 1 single bz2 file, this is only 0.25% of the total amount of data from the same business flow, there are 100+ business flow to be index'ed. I absolutely need a good solution on this, at the moment I use the post.jar to work on folder and I am running the post.jar in single thread. I wonder what is the best practice to do multi-threading indexing? Can anyone provide detailed example? ** *Sincerely yours,* *Raymond*
Re: Solr Json Facet
Everything working now. The code is not that clean and I am rewriting, so I don't know exactly what was wrong, but something malformed. I would like to ask another question regarding json facet. With GET method, i was used to use many fq on the same query, each one with it's own tag. It was working wondefully. With POST method, to post more than one fq parameter is a little complicated, so I am joining all queries in one fq with all the tags. When I select the first facet everything seems to be ok, but when I select the second facet it is "cleaning" the first filter for the facets which shows all the original values for this second facet, even though the result-set is filtering as expected. I will make more tests to understand the mechanics of this, but if someone has some advise on this subject I appreciate a lot. Thank you, 2018-05-08 23:54 GMT-03:00 Yonik Seeley : > Looks like some sort of proxy server inbetween the python client and > solr server. > I would still check first if the output from the python client is > correctly escaped/encoded HTTP. > > One easy way is to use netcat to pretend to be a server: > $ nc -l 8983 > And then send point the python client at that and send the request. > > -Yonik > > > On Tue, May 8, 2018 at 9:17 PM, Kojo wrote: > > Thank you all. I tried escaping but still not working > > > > Yonik, I am using Python Requests. It works if my fq is a single word, > even > > if I use double quotes on this single word without escaping. > > > > This is the HTTP response: > > > > response.content > > > > ' > 2.0//EN">\n\n400 Bad > > Request\n\nBad Request\nYour browser > sent > > a request that this server could not understand. > />\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port > > 80\n\n' > > > > > > Thank you, > > > > > > > > 2018-05-08 18:46 GMT-03:00 Yonik Seeley : > > > >> On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: > >> > If I tag the fq query and I query for a simple word it works fine too. > >> But > >> > if query a multi word with space in the middle it breaks: > >> > >> Most likely the full query is not getting to Solr because of an HTTP > >> protocol error (i.e. the request is not encoded correctly). > >> How are you sending your request to Solr (with curl, or with some other > >> method?) > >> > >> -Yonik > >> >
Re: Solr Json Facet
unsubscribe On Tue, May 8, 2018 at 9:19 PM, Kojo wrote: > Everything working now. The code is not that clean and I am rewriting, so I > don't know exactly what was wrong, but something malformed. > > I would like to ask another question regarding json facet. > > With GET method, i was used to use many fq on the same query, each one with > it's own tag. It was working wondefully. > > With POST method, to post more than one fq parameter is a little > complicated, so I am joining all queries in one fq with all the tags. When > I select the first facet everything seems to be ok, but when I select the > second facet it is "cleaning" the first filter for the facets which shows > all the original values for this second facet, even though the result-set > is filtering as expected. I will make more tests to understand the > mechanics of this, but if someone has some advise on this subject I > appreciate a lot. > > Thank you, > > > > > > 2018-05-08 23:54 GMT-03:00 Yonik Seeley : > >> Looks like some sort of proxy server inbetween the python client and >> solr server. >> I would still check first if the output from the python client is >> correctly escaped/encoded HTTP. >> >> One easy way is to use netcat to pretend to be a server: >> $ nc -l 8983 >> And then send point the python client at that and send the request. >> >> -Yonik >> >> >> On Tue, May 8, 2018 at 9:17 PM, Kojo wrote: >> > Thank you all. I tried escaping but still not working >> > >> > Yonik, I am using Python Requests. It works if my fq is a single word, >> even >> > if I use double quotes on this single word without escaping. >> > >> > This is the HTTP response: >> > >> > response.content >> > >> > '> > 2.0//EN">\n\n400 Bad >> > Request\n\nBad Request\nYour browser >> sent >> > a request that this server could not understand.> > />\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port >> > 80\n\n' >> > >> > >> > Thank you, >> > >> > >> > >> > 2018-05-08 18:46 GMT-03:00 Yonik Seeley : >> > >> >> On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: >> >> > If I tag the fq query and I query for a simple word it works fine too. >> >> But >> >> > if query a multi word with space in the middle it breaks: >> >> >> >> Most likely the full query is not getting to Solr because of an HTTP >> >> protocol error (i.e. the request is not encoded correctly). >> >> How are you sending your request to Solr (with curl, or with some other >> >> method?) >> >> >> >> -Yonik >> >> >>
Re: How to do multi-threading indexing on huge volume of JSON files?
I'd seriously consider a SolrJ program rather than posting, posting files is really intended to be a simple way to get started, when it comes to indexing large volumes it's not very efficient. As a comparison, I index 3-4K docs/second (Wikipedia dump) on my macbook pro. Note that if each of your businesses has that many documents, you're talking 12 billion, hope you're sharding! Here's some SolrJ to get you started. Note you'll pretty much throw out the Tika and RDBMS in favor of constructing the SolrInputDocuments from parsing your data with your favorite JSON parser. https://lucidworks.com/2012/02/14/indexing-with-solrj/ Then you can rack N of these SolrJ programs (each presumably working on a separate subset of the data) to get your indexing speed up to what you need. 95% of the time slow indexing is because of the ETL pipeline. One key is to check the CPU usage on your Solr server and see if it's running hot or not. If not, then you aren't feeding docs fast enough to Solr. Do batch docs together as in the program, I typically start with batches of 1,000 docs. Best, Erick On Tue, May 8, 2018 at 8:25 PM, Raymond Xie wrote: > I have a huge amount of JSON files to be indexed in Solr, it costs me 22 > minutes to index 300,000 JSON files which were generated from 1 single bz2 > file, this is only 0.25% of the total amount of data from the same business > flow, there are 100+ business flow to be index'ed. > > I absolutely need a good solution on this, at the moment I use the post.jar > to work on folder and I am running the post.jar in single thread. > > I wonder what is the best practice to do multi-threading indexing? Can > anyone provide detailed example? > > > > ** > *Sincerely yours,* > > > *Raymond*
Re: Solr Json Facet
Follow the instructions here: http://lucene.apache.org/solr/community.html#mailing-lists-irc. You must use the _exact_ same e-mail as you used to subscribe. If the initial try doesn't work and following the suggestions at the "problems" link doesn't work for you, let us know. But note you need to show us the _entire_ return header to allow anyone to diagnose the problem. On Tue, May 8, 2018 at 9:23 PM, Asher Shih wrote: > unsubscribe > > On Tue, May 8, 2018 at 9:19 PM, Kojo wrote: >> Everything working now. The code is not that clean and I am rewriting, so I >> don't know exactly what was wrong, but something malformed. >> >> I would like to ask another question regarding json facet. >> >> With GET method, i was used to use many fq on the same query, each one with >> it's own tag. It was working wondefully. >> >> With POST method, to post more than one fq parameter is a little >> complicated, so I am joining all queries in one fq with all the tags. When >> I select the first facet everything seems to be ok, but when I select the >> second facet it is "cleaning" the first filter for the facets which shows >> all the original values for this second facet, even though the result-set >> is filtering as expected. I will make more tests to understand the >> mechanics of this, but if someone has some advise on this subject I >> appreciate a lot. >> >> Thank you, >> >> >> >> >> >> 2018-05-08 23:54 GMT-03:00 Yonik Seeley : >> >>> Looks like some sort of proxy server inbetween the python client and >>> solr server. >>> I would still check first if the output from the python client is >>> correctly escaped/encoded HTTP. >>> >>> One easy way is to use netcat to pretend to be a server: >>> $ nc -l 8983 >>> And then send point the python client at that and send the request. >>> >>> -Yonik >>> >>> >>> On Tue, May 8, 2018 at 9:17 PM, Kojo wrote: >>> > Thank you all. I tried escaping but still not working >>> > >>> > Yonik, I am using Python Requests. It works if my fq is a single word, >>> even >>> > if I use double quotes on this single word without escaping. >>> > >>> > This is the HTTP response: >>> > >>> > response.content >>> > >>> > '>> > 2.0//EN">\n\n400 Bad >>> > Request\n\nBad Request\nYour browser >>> sent >>> > a request that this server could not understand.>> > />\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port >>> > 80\n\n' >>> > >>> > >>> > Thank you, >>> > >>> > >>> > >>> > 2018-05-08 18:46 GMT-03:00 Yonik Seeley : >>> > >>> >> On Tue, May 8, 2018 at 1:36 PM, Kojo wrote: >>> >> > If I tag the fq query and I query for a simple word it works fine too. >>> >> But >>> >> > if query a multi word with space in the middle it breaks: >>> >> >>> >> Most likely the full query is not getting to Solr because of an HTTP >>> >> protocol error (i.e. the request is not encoded correctly). >>> >> How are you sending your request to Solr (with curl, or with some other >>> >> method?) >>> >> >>> >> -Yonik >>> >> >>>