SolrCloud replicas out of sync
I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen permanently out of sync. Users started to complain that the same search, executed twice, sometimes returned different result counts. Sure enough, our replicas are not identical: >> shard1_replica1: 89867 documents / version 1453479763194 >> shard1_replica2: 89866 documents / version 1453479763194 >> shard1_replica3: 89867 documents / version 1453479763191 I do not think this discrepancy is going to resolve itself. The Solr Admin screen reports all 3 replicas as “Current”. The last modification to this collection was 2 hours before I captured this information, and our auto commit time is 60 seconds. I have a lot of concerns here, but my first question is if anyone else has had problems with out of sync replicas, and if so, what they have done to correct this? Kind Regards, David
Re: SolrCloud replicas out of sync
Thanks Jeff! A few comments >> >> Although you could probably bounce a node and get your document counts back >> in sync (by provoking a check) >> If the check is a simple doc count, that will not work. We have found that replica1 and replica3, although they contain the same doc count, don’t have the SAME docs. They each missed at least one update, but of different docs. This also means none of our three replicas are complete. >> >>it’s interesting that you’re in this situation. It implies to me that at some >>point the leader couldn’t write a doc to one of the replicas, >> That is our belief as well. We experienced a datacenter-wide network disruption of a few seconds, and user complaints started the first workday after that event. The most interesting log entry during the outage is this: "1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is coming from leader, but we are the leader: update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2"; >> >> You might watch the achieved replication factor of your updates and see if >> it ever changes >> This is a good tip. I’m not sure I like the implication that any failure to write all 3 of our replicas must be retried at the app layer. Is this really how SolrCloud applications must be built to survive network partitions without data loss? Regards, David On 1/26/16, 12:20 PM, "Jeff Wartes" wrote: > >My understanding is that the "version" represents the timestamp the searcher >was opened, so it doesn’t really offer any assurances about your data. > >Although you could probably bounce a node and get your document counts back in >sync (by provoking a check), it’s interesting that you’re in this situation. >It implies to me that at some point the leader couldn’t write a doc to one of >the replicas, but that the replica didn’t consider itself down enough to check >itself. > >You might watch the achieved replication factor of your updates and see if it >ever changes: >https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance > (See Achieved Replication Factor/min_rf) > >If it does, that might give you clues about how this is happening. Also, it >might allow you to work around the issue by trying the write again. > > > > > > >On 1/22/16, 10:52 AM, "David Smith" wrote: > >>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen >>permanently out of sync. Users started to complain that the same search, >>executed twice, sometimes returned different result counts. Sure enough, our >>replicas are not identical: >> >>>> shard1_replica1: 89867 documents / version 1453479763194 >>>> shard1_replica2: 89866 documents / version 1453479763194 >>>> shard1_replica3: 89867 documents / version 1453479763191 >> >>I do not think this discrepancy is going to resolve itself. The Solr Admin >>screen reports all 3 replicas as “Current”. The last modification to this >>collection was 2 hours before I captured this information, and our auto >>commit time is 60 seconds. >> >>I have a lot of concerns here, but my first question is if anyone else has >>had problems with out of sync replicas, and if so, what they have done to >>correct this? >> >>Kind Regards, >> >>David >>
Re: SolrCloud replicas out of sync
Jeff, again, very much appreciate your feedback. It is interesting — the article you linked to by Shalin is exactly why we picked SolrCloud over ES, because (eventual) consistency is critical for our application and we will sacrifice availability for it. To be clear, after the outage, NONE of our three replicas are correct or complete. So we definitely don’t have CP yet — our very first network outage resulted in multiple overlapped lost updates. As a result, I can’t pick one replica and make it the new “master”. I must rebuild this collection from scratch, which I can do, but that requires downtime which is a problem in our app (24/7 High Availability with few maintenance windows). So, I definitely need to “fix” this somehow. I wish I could outline a reproducible test case, but as the root cause is likely very tight timing issues and complicated interactions with Zookeeper, that is not really an option. I’m happy to share the full logs of all 3 replicas though if that helps. I am curious though if the thoughts have changed since https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a “majority quorum” model, with rollback? Done properly, this should be free of all lost update problems, at the cost of availability. Some SolrCloud users (like us!!!) would gladly accept that tradeoff. Regards David On 1/26/16, 4:32 PM, "Jeff Wartes" wrote: > >Ah, perhaps you fell into something like this then? >https://issues.apache.org/jira/browse/SOLR-7844 > >That says it’s fixed in 5.4, but that would be an example of a split-brain >type incident, where different documents were accepted by different replicas >who each thought they were the leader. If this is the case, and you actually >have different data on each replica, I’m not aware of any way to fix the >problem short of reindexing those documents. Before that, you’ll probably need >to choose a replica and just force the others to get in sync with it. I’d >choose the current leader, since that’s slightly easier. > >Typically, a leader writes an update to it’s transaction log, then sends the >request to all replicas, and when those all finish it acknowledges the update. >If a replica gets restarted, and is less than N documents behind, the leader >will only replay that transaction log. (Where N is the numRecordsToKeep >configured in the updateLog section of solrconfig.xml) > >What you want is to provoke the heavy-duty process normally invoked if a >replica has missed more than N docs, which essentially does a checksum and >file copy on all the raw index files. FetchIndex would probably work, but it’s >a replication handler API originally designed for master/slave replication, so >take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API >Probably a lot easier would be to just delete the replica and re-create it. >That will also trigger a full file copy of the index from the leader onto the >new replica. > >I think design decisions around Solr generally use CP as a goal. (I sometimes >wish I could get more AP behavior!) See posts like this: >http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/ > >So the fact that you encountered this sounds like a bug to me. >That said, another general recommendation (of mine) is that you not use Solr >as your primary data source, so you can rebuild your index from scratch if you >really need to. > > > > > > >On 1/26/16, 1:10 PM, "David Smith" wrote: > >>Thanks Jeff! A few comments >> >>>> >>>> Although you could probably bounce a node and get your document counts >>>> back in sync (by provoking a check) >>>> >> >> >>If the check is a simple doc count, that will not work. We have found that >>replica1 and replica3, although they contain the same doc count, don’t have >>the SAME docs. They each missed at least one update, but of different docs. >>This also means none of our three replicas are complete. >> >>>> >>>>it’s interesting that you’re in this situation. It implies to me that at >>>>some point the leader couldn’t write a doc to one of the replicas, >>>> >> >>That is our belief as well. We experienced a datacenter-wide network >>disruption of a few seconds, and user complaints started the first workday >>after that event. >> >>The most interesting log entry during the outage is this: >> >>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it >>is coming from leader, but we are the leader: >>update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version
Re: SolrCloud replicas out of sync
Sure. Here is our SolrCloud cluster: + Three (3) instances of Zookeeper on three separate (physical) servers. The ZK servers are beefy and fairly recently built, with 2x10 GigE (bonded) Ethernet connectivity to the rest of the data center. We recognize importance of the stability and responsiveness of ZK to the stability of SolrCloud as a whole. + 364 collections, all with single shards and a replication factor of 3. Currently housing only 100,000,000 documents in aggregate. Expected to grow to 25 billion+. The size of a single document would be considered “large”, by the standards of what I’ve seen posted elsewhere on this mailing list. We are always open to ZK recommendations from you or anyone else, particularly for running a SolrCloud cluster of this size. Kind Regards, David On 1/27/16, 12:46 PM, "Jeff Wartes" wrote: > >If you can identify the problem documents, you can just re-index those after >forcing a sync. Might save a full rebuild and downtime. > >You might describe your cluster setup, including ZK. it sounds like you’ve >done your research, but improper ZK node distribution could certainly >invalidate some of Solr’s assumptions. > > > > >On 1/27/16, 7:59 AM, "David Smith" wrote: > >>Jeff, again, very much appreciate your feedback. >> >>It is interesting — the article you linked to by Shalin is exactly why we >>picked SolrCloud over ES, because (eventual) consistency is critical for our >>application and we will sacrifice availability for it. To be clear, after >>the outage, NONE of our three replicas are correct or complete. >> >>So we definitely don’t have CP yet — our very first network outage resulted >>in multiple overlapped lost updates. As a result, I can’t pick one replica >>and make it the new “master”. I must rebuild this collection from scratch, >>which I can do, but that requires downtime which is a problem in our app >>(24/7 High Availability with few maintenance windows). >> >> >>So, I definitely need to “fix” this somehow. I wish I could outline a >>reproducible test case, but as the root cause is likely very tight timing >>issues and complicated interactions with Zookeeper, that is not really an >>option. I’m happy to share the full logs of all 3 replicas though if that >>helps. >> >>I am curious though if the thoughts have changed since >>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a >>“majority quorum” model, with rollback? Done properly, this should be free >>of all lost update problems, at the cost of availability. Some SolrCloud >>users (like us!!!) would gladly accept that tradeoff. >> >>Regards >> >>David >> >>
Re: SolrCloud replicas out of sync
Tomás, Good find, but I don’t think the rate of updates was high enough during the network outage to create the overrun situation described in the ticket. I did notice that one of the proposed fixes, https://issues.apache.org/jira/browse/SOLR-8586, is an entire-index consistency check between leader and replica. I really hope they are able to get this to work. Ideally, the replicas would never become (permanently) inconsistent, but given that they do, it is crucial that SolrCloud can internally detect and fix, no matter what caused it or how long ago it happened. Regards, David On 1/28/16, 1:08 PM, "Tomás Fernández Löbbe" wrote: >Maybe you are hitting the reordering issue described in SOLR-8129? > >Tomás > >On Wed, Jan 27, 2016 at 11:32 AM, David Smith >wrote: > >> Sure. Here is our SolrCloud cluster: >> >>+ Three (3) instances of Zookeeper on three separate (physical) >> servers. The ZK servers are beefy and fairly recently built, with 2x10 >> GigE (bonded) Ethernet connectivity to the rest of the data center. We >> recognize importance of the stability and responsiveness of ZK to the >> stability of SolrCloud as a whole. >> >>+ 364 collections, all with single shards and a replication factor of >> 3. Currently housing only 100,000,000 documents in aggregate. Expected to >> grow to 25 billion+. The size of a single document would be considered >> “large”, by the standards of what I’ve seen posted elsewhere on this >> mailing list. >> >> We are always open to ZK recommendations from you or anyone else, >> particularly for running a SolrCloud cluster of this size. >> >> Kind Regards, >> >> David >> >> >> >> On 1/27/16, 12:46 PM, "Jeff Wartes" wrote: >> >> > >> >If you can identify the problem documents, you can just re-index those >> after forcing a sync. Might save a full rebuild and downtime. >> > >> >You might describe your cluster setup, including ZK. it sounds like >> you’ve done your research, but improper ZK node distribution could >> certainly invalidate some of Solr’s assumptions. >> > >> > >> > >> > >> >On 1/27/16, 7:59 AM, "David Smith" wrote: >> > >> >>Jeff, again, very much appreciate your feedback. >> >> >> >>It is interesting — the article you linked to by Shalin is exactly why >> we picked SolrCloud over ES, because (eventual) consistency is critical for >> our application and we will sacrifice availability for it. To be clear, >> after the outage, NONE of our three replicas are correct or complete. >> >> >> >>So we definitely don’t have CP yet — our very first network outage >> resulted in multiple overlapped lost updates. As a result, I can’t pick >> one replica and make it the new “master”. I must rebuild this collection >> from scratch, which I can do, but that requires downtime which is a problem >> in our app (24/7 High Availability with few maintenance windows). >> >> >> >> >> >>So, I definitely need to “fix” this somehow. I wish I could outline a >> reproducible test case, but as the root cause is likely very tight timing >> issues and complicated interactions with Zookeeper, that is not really an >> option. I’m happy to share the full logs of all 3 replicas though if that >> helps. >> >> >> >>I am curious though if the thoughts have changed since >> https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering >> a “majority quorum” model, with rollback? Done properly, this should be >> free of all lost update problems, at the cost of availability. Some >> SolrCloud users (like us!!!) would gladly accept that tradeoff. >> >> >> >>Regards >> >> >> >>David >> >> >> >> >> >>
Re: Replicas for same shard not in sync
Erick, So that my understanding is correct, let me ask, if one or more replicas are down, updates presented to the leader still succeed, right? If so, tedsolr is correct that the Solr client app needs to re-issue updates, if it wants stronger guarantees on replica consistency than what Solr provides. The “Write Fault Tolerance” section of the Solr Wiki makes what I believe is the same point: "On the client side, if the achieved replication factor is less than the acceptable level, then the client application can take additional measures to handle the degraded state. For instance, a client application may want to keep a log of which update requests were sent while the state of the collection was degraded and then resend the updates once the problem has been resolved." https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance Kind Regards, David On 4/25/16, 11:57 AM, "Erick Erickson" wrote: >bq: I also read that it's up to the >client to keep track of updates in case commits don't happen on all the >replicas. > >This is not true. Or if it is it's a bug. > >The update cycle is this: >1> updates get to the leader >2> updates are sent to all followers and indexed on the leader as well >3> each replica writes the updates to the local transaction log >4> all the replicas ack back to the leader >5> the leader responds to the client. > >At this point, all the replicas for the shard have the docs locally >and can take over as leader. > >You may be confusing indexing in batches and having errors with >updates getting to replicas. When you send a batch of docs to Solr, >if one of them fails indexing some of the rest of the docs may not >be indexed. See SOLR-445 for some work on this front. > >That said, bouncing servers willy-nilly during heavy indexing, especially >if the indexer doesn't know enough to retry if an indexing attempt fails may >be the root cause here. Have you verified that your indexing program >retries in the event of failure? > >Best, >Erick > >On Mon, Apr 25, 2016 at 6:13 AM, tedsolr wrote: >> I've done a bit of reading - found some other posts with similar questions. >> So I gather "Optimizing" a collection is rarely a good idea. It does not >> need to be condensed to a single segment. I also read that it's up to the >> client to keep track of updates in case commits don't happen on all the >> replicas. Solr will commit and return success as long as one replica gets >> the update. >> >> I have a state where the two replicas for one collection are out of sync. >> One has some updates that the other does not. And I don't have log data to >> tell me what the differences are. This happened during a maintenance window >> when the servers got restarted while a large index job was running. Normally >> this doesn't cause a problem, but it did last Thursday. >> >> What I plan to do is select the replica I believe is incomplete and delete >> it. Then add a new one. I was just hoping Solr had a solution for this - >> maybe using the ZK transaction logs to replay some updates, or force a >> resync between the replicas. >> >> I will also implement a fix to prevent Solr from restarting unless one of >> its config files has changed. No need to bounce Solr just for kicks. >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Replicas-for-same-shard-not-in-sync-tp4272236p4272602.html >> Sent from the Solr - User mailing list archive at Nabble.com.
Trouble getting "langid.map.individual" setting to work in Solr 5.0.x
I am trying to use “languid.map.individual” setting to allow field “a” to detect as, say, English, and be mapped to “a_en”, while in the same document, field “b” detects as, say, German and is mapped to “b_de”. What happens in my tests is that the global language is detected (for example, German), but BOTH fields are mapped to “_de” as a result. I cannot get individual detection or mapping to work. Am I mis-understanding the purpose of this setting? Here is the resulting document from my test: { "id": "1005!22345", "language": [ "de" ], "a_de": "A title that should be detected as English with high confidence", "b_de": "Die Einführung einer anlasslosen Speicherung von Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten, Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler, Grüne und Linke halten die geplante Richtlinie für eine andere Form der anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen mache.", "_version_": 1508494723734569000 } I expected “a_de” to be “a_en”, and the “language” multi-valued field to have “en” and “de”. Here is my configuration in solrconfig.xml: true a,b true true language af:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns en The debug output of lang detect, during indexing, is as follows: --- DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.964723182276 DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected main document language from fields [a, b]: de DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field a DEBUG - 2015-08-03 14:37:54.451; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field b DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.964723182276 DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field a using individually detected language de DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing mapping from a with language de to field a_de DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field 1005!22345 to de DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; loaded class org.apache.solr.common.SolrInputField from WebAppClassLoader=525571@80503 DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing old field a DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field a DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field b DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.980402022373 DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field b using individually detected language de DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing mapping from b with language de to field b_de DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field 1005!22345 to de DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing old field b - From this, my takeaway is that every time the LangDetectLanguageIdentifierUpdateProcessor is asked to detect the language, it is using field a AND b. But I can’t quite tell from this output. Any insight appreciated. Regards, David
Identical query returning different aggregate results
I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6. The very first app test case I wrote is failing intermittently in this environment, when I only have 4 documents ingested into the cloud. I dug in and found when I query against multiple collections, using the "collection=" parameter, the aggregates I request are correct about 50% of the time. The other 50% of the time, the aggregate returned by Solr is not correct. Note this is for the identical query. In other words, I can run the same query multiple times in a row, and get different answers. The simplest version of the query that still exhibits the odd behavior is as follows: http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true When it SUCCEEDS, the aggregate correctly appears like this: "facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_dates":{}, "facet_ranges":{ "eventDate":{ "counts":[ "2014-04-01T00:00:00Z",3], "gap":"+1DAY", "start":"2014-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}}} When it FAILS, note that the counts[] array is empty: "facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_dates":{}, "facet_ranges":{ "eventDate":{ "counts":[], "gap":"+1DAY", "start":"2014-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}}} If I further simplify the query, by removing range options or reducing to one (1) collection name, then the problem goes away. The solr logs are clean at INFO level, and there is no substantive difference in log output when the query succeeds vs fails, leaving me stumped where to look next. Suggestions welcome. Regards, David
Re: Identical query returning different aggregate results
Alex, Good suggestion, but in this case, no. This example is from a cleanroom type test environment where the collections have very recently been created, there are only 4 documents total across all collections, and no delete's have been issued. Kind regards, David On Tuesday, December 16, 2014 12:01 PM, Alexandre Rafalovitch wrote: Facet counts include deleted documents until the segments merge. Could that be an issue? Regards, Alex On 16/12/2014 12:18 pm, "David Smith" wrote: > I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 > replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6. > The very first app test case I wrote is failing intermittently in this > environment, when I only have 4 documents ingested into the cloud. > I dug in and found when I query against multiple collections, using the > "collection=" parameter, the aggregates I request are correct about 50% of > the time. The other 50% of the time, the aggregate returned by Solr is not > correct. Note this is for the identical query. In other words, I can run > the same query multiple times in a row, and get different answers. > > The simplest version of the query that still exhibits the odd behavior is > as follows: > > http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true > > When it SUCCEEDS, the aggregate correctly appears like this: > > "facet_counts":{ "facet_queries":{}, "facet_fields":{}, > "facet_dates":{}, "facet_ranges":{ "eventDate":{ "counts":[ > "2014-04-01T00:00:00Z",3], "gap":"+1DAY", > "start":"2014-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, > "facet_intervals":{}}} > > When it FAILS, note that the counts[] array is empty: > "facet_counts":{ "facet_queries":{}, "facet_fields":{}, > "facet_dates":{}, "facet_ranges":{ "eventDate":{ > "counts":[], "gap":"+1DAY", "start":"2014-01-01T00:00:00Z", > "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}}} > > If I further simplify the query, by removing range options or reducing to > one (1) collection name, then the problem goes away. > > The solr logs are clean at INFO level, and there is no substantive > difference in log output when the query succeeds vs fails, leaving me > stumped where to look next. Suggestions welcome. > Regards, > David > > > > >
Re: Identical query returning different aggregate results
Hi Erick, Thanks for your reply. My test environment only has one shard and one replica per collection. So, I think there is no possibility of replicas getting out of sync. Here is how I create each (month-based) collection: http://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_01&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_confhttp://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_02&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_confhttp://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_03&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_conf...etc, etc... Still, I think you are on to something. I had already noticed that querying one collection at a time works. For example, if I change my query oh-so-slightly from this: "collection=2014_04,2014_03" to this "...collection=2014_04" Then, the results are correct 100% of the time. I think substantively this is the same as specifying the name of the shard since, again, in my test environment I only have one shard per collection anyway. I should mention that the "2014_03" collection is empty. 0 documents. All 3 documents which satisfy the facet range are in the "2014_04" collection. So, it's a real head-scratcher that introducing that collection name into the query makes the results misbehave. Kind regards,David On Tuesday, December 16, 2014 2:25 PM, Erick Erickson wrote: bq: Facet counts include deleted documents until the segments merge Whoa! Facet counts do _not_ require segment merging to be accurate. What merging does is remove the _term_ information associated with deleted documents, and removes their contribution to the TF/IDF scores. David: Hmmm, what happens if you direct the query not only to a single collection, but to a single shard? Add &distrib=false to the query and point it to each of your replicas. (one collection at a time). The expectation is that each replica for a slice within a collection has identical documents. One possibility is that somehow your shards are out of sync on a collection. So the internal load balancing that happens sometimes sends the query to one replica and sometime to another. 2 replicas (leader and follower) and 50% failure, coincidence? That just bumps the question up another level of course, the next question is _why_ is the shard out of sync. So in that case I'd issue a commit to all the collections on the off chance that somehow that didn't happen and try again (very low probability that this is the root cause, but you never know). but it sure sounds like one replica doesn't agree with another, so the above will give us place to look. Best, Erick On Tue, Dec 16, 2014 at 12:12 PM, David Smith wrote: > Alex, > Good suggestion, but in this case, no. This example is from a cleanroom type > test environment where the collections have very recently been created, there > are only 4 documents total across all collections, and no delete's have been > issued. > Kind regards, > David > > > On Tuesday, December 16, 2014 12:01 PM, Alexandre Rafalovitch > wrote: > > > Facet counts include deleted documents until the segments merge. Could that > be an issue? > > Regards, > Alex > On 16/12/2014 12:18 pm, "David Smith" wrote: > >> I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 >> replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6. >> The very first app test case I wrote is failing intermittently in this >> environment, when I only have 4 documents ingested into the cloud. >> I dug in and found when I query against multiple collections, using the >> "collection=" parameter, the aggregates I request are correct about 50% of >> the time. The other 50% of the time, the aggregate returned by Solr is not >> correct. Note this is for the identical query. In other words, I can run >> the same query multiple times in a row, and get different answers. >> >> The simplest version of the query that still exhibits the odd behavior is >> as follows: >> >> http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true >> >> When it SUCCEEDS, the aggregate correctly appears like this: >> >> "facet_counts":{ "facet_queries":{}, "facet_fields&q
Re: Identical query returning different aggregate results
Chris, Yes, your suggestion worked. Changing the parameter in my query from ...f.eventDate.facet.mincount=1... to ...f.eventDate.facet.mincount=0... worked around the problem. And I agree that SOLR-6154 describes what I observed almost exactly. Once 5.0 is available, I'll test this again with "mincount=1". Thanks everyone for your help! It is very much appreciated. Regards, David On Tuesday, December 16, 2014 4:38 PM, Chris Hostetter wrote: sounds like this bug... https://issues.apache.org/jira/browse/SOLR-6154 ...in which case it has nothing to do with your use of multiple collections, it's just dependent on wether or not the first node to respond happens to have a doc in every "range bucket" .. any bucket missing (because of your mincount=1) from the first core to respond is then ignored in the response fro mthe subsequent cores. workarround is to set mincount=0 for your facet ranges. : Date: Tue, 16 Dec 2014 17:17:05 + (UTC) : From: David Smith : Reply-To: solr-user@lucene.apache.org, David Smith : To: Solr-user : Subject: Identical query returning different aggregate results : : I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6. : The very first app test case I wrote is failing intermittently in this environment, when I only have 4 documents ingested into the cloud. : I dug in and found when I query against multiple collections, using the "collection=" parameter, the aggregates I request are correct about 50% of the time. The other 50% of the time, the aggregate returned by Solr is not correct. Note this is for the identical query. In other words, I can run the same query multiple times in a row, and get different answers. : : The simplest version of the query that still exhibits the odd behavior is as follows: : http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true : : When it SUCCEEDS, the aggregate correctly appears like this: : : "facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_dates":{}, "facet_ranges":{ "eventDate":{ "counts":[ "2014-04-01T00:00:00Z",3], "gap":"+1DAY", "start":"2014-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}}} : : When it FAILS, note that the counts[] array is empty: : "facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_dates":{}, "facet_ranges":{ "eventDate":{ "counts":[], "gap":"+1DAY", "start":"2014-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}}} : : If I further simplify the query, by removing range options or reducing to one (1) collection name, then the problem goes away. : : The solr logs are clean at INFO level, and there is no substantive difference in log output when the query succeeds vs fails, leaving me stumped where to look next. Suggestions welcome. : Regards, : David : : : : : -Hoss http://www.lucidworks.com/
Slow faceting performance on a docValues field
I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that exhibits the following response times (via the debugQuery option in Solr Admin): "process": { "time": 24709, "query": { "time": 54 }, "facet": { "time": 24574 }, The query time of 54ms is great and exactly as expected -- this example was a single-term search that returned 3 hits. I am trying to get the facet time (24.5 seconds) to be sub-second, and am having no luck. The facet part of the query is as follows: "params": { "facet.range": "eventDate", "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z", "f.eventDate.facet.range.gap": "+1DAY", "start": "0", "rows": "10", "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z", "f.eventDate.facet.mincount": "1", "facet": "true", "debugQuery": "true", "_": "1421169383802" } And, the relevant schema definition is as follows: During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O activity detected on the drive that holds the 175GB index. I have 48GB of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM. I do NOT have any fieldValue caches configured as yet, because my (perhaps too simplistic?) reading of the documentation was that DocValues eliminates the need for a field-level cache on this facet field. Any suggestions welcome. Regards, David
Re: Slow faceting performance on a docValues field
Shawn, Thanks for the suggestion, but experimentally, in my case the same query with facet.method=enum returns in almost the same amount of time. Regards David On Tuesday, January 13, 2015 12:02 PM, Shawn Heisey wrote: On 1/13/2015 10:35 AM, David Smith wrote: > I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that > exhibits the following response times (via the debugQuery option in Solr > Admin): > "process": { > "time": 24709, > "query": { "time": 54 }, "facet": { "time": 24574 }, > > > The query time of 54ms is great and exactly as expected -- this example was a > single-term search that returned 3 hits. > I am trying to get the facet time (24.5 seconds) to be sub-second, and am > having no luck. The facet part of the query is as follows: > > "params": { "facet.range": "eventDate", > "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z", > "f.eventDate.facet.range.gap": "+1DAY", > "start": "0", > > "rows": "10", > > "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z", > > "f.eventDate.facet.mincount": "1", > > "facet": "true", > > "debugQuery": "true", > "_": "1421169383802" > } > > And, the relevant schema definition is as follows: > > multiValued="false" docValues="true"/> > > > positionIncrementGap="0"/> > > > During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O > activity detected on the drive that holds the 175GB index. I have 48GB of > RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM. > > I do NOT have any fieldValue caches configured as yet, because my (perhaps > too simplistic?) reading of the documentation was that DocValues eliminates > the need for a field-level cache on this facet field. 24GB of RAM to cache 175GB is probably not enough in the general case, but if you're seeing very little disk I/O activity for this query, then we'll leave that alone and you can worry about it later. What I would try immediately is setting the facet.method parameter to enum and seeing what that does to the facet time. I've had good luck generally with that, even in situations where the docs indicated that the default (fc) was supposed to work better. I have never explored the relationship between facet.method and docValues, though. I'm out of ideas after this. I don't have enough experience with faceting to help much. Thanks, Shawn
Re: Slow faceting performance on a docValues field
Tomás, Thanks for the response -- the performance of my query makes perfect sense in light of your information. I looked at Interval faceting. My required interval is 1 day. I cannot change that requirement. Unless I am mis-reading the doc, that means to facet a 10 year range, the query needs to specify over 3,600 intervals ?? f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]&f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]&etc,etc Each query would be 185MB in size if I structure it this way. I assume I must be mis-understanding how to use Interval faceting with dates. Are there any concrete examples you know of? A google search did not come up with much. Kind regards, Dave On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe wrote: Range Faceting won't use the DocValues even if they are there set, it translates each gap to a filter. This means that it will end up using the FilterCache, which should cause faster followup queries if you repeat the same gaps (and don't commit). You may also want to try interval faceting, it will use DocValues instead of filters. The API is different, you'll have to provide the intervals yourself. Tomás On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey wrote: > On 1/13/2015 10:35 AM, David Smith wrote: > > I have a query against a single 50M doc index (175GB) using Solr 4.10.2, > that exhibits the following response times (via the debugQuery option in > Solr Admin): > > "process": { > > "time": 24709, > > "query": { "time": 54 }, "facet": { "time": 24574 }, > > > > > > The query time of 54ms is great and exactly as expected -- this example > was a single-term search that returned 3 hits. > > I am trying to get the facet time (24.5 seconds) to be sub-second, and > am having no luck. The facet part of the query is as follows: > > > > "params": { "facet.range": "eventDate", > > "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z", > > "f.eventDate.facet.range.gap": "+1DAY", > > "start": "0", > > > > "rows": "10", > > > > "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z", > > > > "f.eventDate.facet.mincount": "1", > > > > "facet": "true", > > > > "debugQuery": "true", > > "_": "1421169383802" > > } > > > > And, the relevant schema definition is as follows: > > > > multiValued="false" docValues="true"/> > > > > > > positionIncrementGap="0"/> > > > > > > During the 25-second query, the Solr JVM pegs one CPU, with little or no > I/O activity detected on the drive that holds the 175GB index. I have 48GB > of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM. > > > > I do NOT have any fieldValue caches configured as yet, because my > (perhaps too simplistic?) reading of the documentation was that DocValues > eliminates the need for a field-level cache on this facet field. > > 24GB of RAM to cache 175GB is probably not enough in the general case, > but if you're seeing very little disk I/O activity for this query, then > we'll leave that alone and you can worry about it later. > > What I would try immediately is setting the facet.method parameter to > enum and seeing what that does to the facet time. I've had good luck > generally with that, even in situations where the docs indicated that > the default (fc) was supposed to work better. I have never explored the > relationship between facet.method and docValues, though. > > I'm out of ideas after this. I don't have enough experience with > faceting to help much. > > Thanks, > Shawn > >
Re: Slow faceting performance on a docValues field
What is stumping me is that the search result has 3 hits, yet faceting those 3 hits takes 24 seconds. The documentation for facet.method=fc is quite explicit about how Solr does faceting: "fc (stands for Field Cache) The facet counts are calculated by iterating over documents that match the query and summing the terms that appear in each document. This was the default method for single valued fields prior to Solr 1.4." If a search yielded millions of hits, I could understand 24 seconds to calculate the facets. But not for a search with only 3 hits. What am I missing? Regards, David On Tuesday, January 13, 2015 1:12 PM, Tomás Fernández Löbbe wrote: No, you are not misreading, right now there is no automatic way of generating the intervals on the server side similar to range faceting... I guess it won't work in your case. Maybe you should create a Jira to add this feature to interval faceting. Tomás On Tue, Jan 13, 2015 at 10:44 AM, David Smith wrote: > Tomás, > > > Thanks for the response -- the performance of my query makes perfect sense > in light of your information. > I looked at Interval faceting. My required interval is 1 day. I cannot > change that requirement. Unless I am mis-reading the doc, that means to > facet a 10 year range, the query needs to specify over 3,600 intervals ?? > > > f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]&f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]&etc,etc > > > Each query would be 185MB in size if I structure it this way. > > I assume I must be mis-understanding how to use Interval faceting with > dates. Are there any concrete examples you know of? A google search did > not come up with much. > > Kind regards, > Dave > > On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe < > tomasflo...@gmail.com> wrote: > > > Range Faceting won't use the DocValues even if they are there set, it > translates each gap to a filter. This means that it will end up using the > FilterCache, which should cause faster followup queries if you repeat the > same gaps (and don't commit). > You may also want to try interval faceting, it will use DocValues instead > of filters. The API is different, you'll have to provide the intervals > yourself. > > Tomás > > On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey > wrote: > > > On 1/13/2015 10:35 AM, David Smith wrote: > > > I have a query against a single 50M doc index (175GB) using Solr > 4.10.2, > > that exhibits the following response times (via the debugQuery option in > > Solr Admin): > > > "process": { > > > "time": 24709, > > > "query": { "time": 54 }, "facet": { "time": 24574 }, > > > > > > > > > The query time of 54ms is great and exactly as expected -- this example > > was a single-term search that returned 3 hits. > > > I am trying to get the facet time (24.5 seconds) to be sub-second, and > > am having no luck. The facet part of the query is as follows: > > > > > > "params": { "facet.range": "eventDate", > > > "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z", > > > "f.eventDate.facet.range.gap": "+1DAY", > > > "start": "0", > > > > > > "rows": "10", > > > > > > "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z", > > > > > > "f.eventDate.facet.mincount": "1", > > > > > > "facet": "true", > > > > > > "debugQuery": "true", > > > "_": "1421169383802" > > > } > > > > > > And, the relevant schema definition is as follows: > > > > > > > multiValued="false" docValues="true"/> > > > > > > > > > > positionIncrementGap="0"/> > > > > > > > > > During the 25-second query, the Solr JVM pegs one CPU, with little or > no > > I/O activity detected on the drive that holds the 175GB index. I have > 48GB > > of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM. > > > > > > I do NOT have any fieldValue caches configured as yet, because my > > (perhaps too simplistic?) reading of the documentation was that DocValues > > eliminates the need for a field-level cache on this facet field. > > > > 24GB of RAM to cache 175GB is probably not enough in the general case, > > but if you're seeing very little disk I/O activity for this query, then > > we'll leave that alone and you can worry about it later. > > > > What I would try immediately is setting the facet.method parameter to > > enum and seeing what that does to the facet time. I've had good luck > > generally with that, even in situations where the docs indicated that > > the default (fc) was supposed to work better. I have never explored the > > relationship between facet.method and docValues, though. > > > > I'm out of ideas after this. I don't have enough experience with > > faceting to help much. > > > > Thanks, > > Shawn > > > > > > >
Re: Slow faceting performance on a docValues field
Shawn, I've been thinking along your lines, and continued to run tests through the day. The results surprised me. For my index, Solr range faceting time is most closely related to the total number of documents in the index for the range specified. The number of "buckets" in the range is a second factor. I found NO correlation whatsoever to the number of hits in the query. Whether I have 3 hits or 1,500,000 hits, it's ~24 seconds to facet the result for that same time period. That is what surprised me. For example, if my facet range is a 10 year period for which there exists 47M docs in the index, the facet time is 24 seconds. If I switch my facet range to a different 10 year period with 1.3M docs, the facet time drops to less than 5 seconds. If I go back to my original 10 year period (with 47M docs in the index), but facet by month instead of day, my facet time drops to 2.5 seconds. Now, I can't meet my user needs this way, but it does show the relationship between # of buckets and faceting time. Regards, David