SolrCloud replicas out of sync

2016-01-22 Thread David Smith
I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen 
permanently out of sync.  Users started to complain that the same search, 
executed twice, sometimes returned different result counts.  Sure enough, our 
replicas are not identical:

>> shard1_replica1:  89867 documents / version 1453479763194
>> shard1_replica2:  89866 documents / version 1453479763194
>> shard1_replica3:  89867 documents / version 1453479763191

I do not think this discrepancy is going to resolve itself.  The Solr Admin 
screen reports all 3 replicas as “Current”.  The last modification to this 
collection was 2 hours before I captured this information, and our auto commit 
time is 60 seconds.  

I have a lot of concerns here, but my first question is if anyone else has had 
problems with out of sync replicas, and if so, what they have done to correct 
this?

Kind Regards,

David



Re: SolrCloud replicas out of sync

2016-01-26 Thread David Smith
Thanks Jeff!  A few comments

>>
>> Although you could probably bounce a node and get your document counts back 
>> in sync (by provoking a check)
>>
 

If the check is a simple doc count, that will not work. We have found that 
replica1 and replica3, although they contain the same doc count, don’t have the 
SAME docs.  They each missed at least one update, but of different docs.  This 
also means none of our three replicas are complete.

>>
>>it’s interesting that you’re in this situation. It implies to me that at some 
>>point the leader couldn’t write a doc to one of the replicas,
>>

That is our belief as well. We experienced a datacenter-wide network disruption 
of a few seconds, and user complaints started the first workday after that 
event.  

The most interesting log entry during the outage is this:

"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is 
coming from leader,​ but we are the leader: 
update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2";

>>
>> You might watch the achieved replication factor of your updates and see if 
>> it ever changes
>>

This is a good tip. I’m not sure I like the implication that any failure to 
write all 3 of our replicas must be retried at the app layer.  Is this really 
how SolrCloud applications must be built to survive network partitions without 
data loss? 

Regards,

David


On 1/26/16, 12:20 PM, "Jeff Wartes"  wrote:

>
>My understanding is that the "version" represents the timestamp the searcher 
>was opened, so it doesn’t really offer any assurances about your data.
>
>Although you could probably bounce a node and get your document counts back in 
>sync (by provoking a check), it’s interesting that you’re in this situation. 
>It implies to me that at some point the leader couldn’t write a doc to one of 
>the replicas, but that the replica didn’t consider itself down enough to check 
>itself.
>
>You might watch the achieved replication factor of your updates and see if it 
>ever changes:
>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> (See Achieved Replication Factor/min_rf)
>
>If it does, that might give you clues about how this is happening. Also, it 
>might allow you to work around the issue by trying the write again.
>
>
>
>
>
>
>On 1/22/16, 10:52 AM, "David Smith"  wrote:
>
>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen 
>>permanently out of sync.  Users started to complain that the same search, 
>>executed twice, sometimes returned different result counts.  Sure enough, our 
>>replicas are not identical:
>>
>>>> shard1_replica1:  89867 documents / version 1453479763194
>>>> shard1_replica2:  89866 documents / version 1453479763194
>>>> shard1_replica3:  89867 documents / version 1453479763191
>>
>>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>>screen reports all 3 replicas as “Current”.  The last modification to this 
>>collection was 2 hours before I captured this information, and our auto 
>>commit time is 60 seconds.  
>>
>>I have a lot of concerns here, but my first question is if anyone else has 
>>had problems with out of sync replicas, and if so, what they have done to 
>>correct this?
>>
>>Kind Regards,
>>
>>David
>>



Re: SolrCloud replicas out of sync

2016-01-27 Thread David Smith
Jeff, again, very much appreciate your feedback.  

It is interesting — the article you linked to by Shalin is exactly why we 
picked SolrCloud over ES, because (eventual) consistency is critical for our 
application and we will sacrifice availability for it.  To be clear, after the 
outage, NONE of our three replicas are correct or complete.

So we definitely don’t have CP yet — our very first network outage resulted in 
multiple overlapped lost updates.  As a result, I can’t pick one replica and 
make it the new “master”.  I must rebuild this collection from scratch, which I 
can do, but that requires downtime which is a problem in our app (24/7 High 
Availability with few maintenance windows).


So, I definitely need to “fix” this somehow.  I wish I could outline a 
reproducible test case, but as the root cause is likely very tight timing 
issues and complicated interactions with Zookeeper, that is not really an 
option.  I’m happy to share the full logs of all 3 replicas though if that 
helps.

I am curious though if the thoughts have changed since 
https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
“majority quorum” model, with rollback?  Done properly, this should be free of 
all lost update problems, at the cost of availability.  Some SolrCloud users 
(like us!!!) would gladly accept that tradeoff.  

Regards

David


On 1/26/16, 4:32 PM, "Jeff Wartes"  wrote:

>
>Ah, perhaps you fell into something like this then? 
>https://issues.apache.org/jira/browse/SOLR-7844
>
>That says it’s fixed in 5.4, but that would be an example of a split-brain 
>type incident, where different documents were accepted by different replicas 
>who each thought they were the leader. If this is the case, and you actually 
>have different data on each replica, I’m not aware of any way to fix the 
>problem short of reindexing those documents. Before that, you’ll probably need 
>to choose a replica and just force the others to get in sync with it. I’d 
>choose the current leader, since that’s slightly easier.
>
>Typically, a leader writes an update to it’s transaction log, then sends the 
>request to all replicas, and when those all finish it acknowledges the update. 
>If a replica gets restarted, and is less than N documents behind, the leader 
>will only replay that transaction log. (Where N is the numRecordsToKeep 
>configured in the updateLog section of solrconfig.xml)
>
>What you want is to provoke the heavy-duty process normally invoked if a 
>replica has missed more than N docs, which essentially does a checksum and 
>file copy on all the raw index files. FetchIndex would probably work, but it’s 
>a replication handler API originally designed for master/slave replication, so 
>take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
>Probably a lot easier would be to just delete the replica and re-create it. 
>That will also trigger a full file copy of the index from the leader onto the 
>new replica.
>
>I think design decisions around Solr generally use CP as a goal. (I sometimes 
>wish I could get more AP behavior!) See posts like this: 
>http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/
> 
>So the fact that you encountered this sounds like a bug to me.
>That said, another general recommendation (of mine) is that you not use Solr 
>as your primary data source, so you can rebuild your index from scratch if you 
>really need to. 
>
>
>
>
>
>
>On 1/26/16, 1:10 PM, "David Smith"  wrote:
>
>>Thanks Jeff!  A few comments
>>
>>>>
>>>> Although you could probably bounce a node and get your document counts 
>>>> back in sync (by provoking a check)
>>>>
>> 
>>
>>If the check is a simple doc count, that will not work. We have found that 
>>replica1 and replica3, although they contain the same doc count, don’t have 
>>the SAME docs.  They each missed at least one update, but of different docs.  
>>This also means none of our three replicas are complete.
>>
>>>>
>>>>it’s interesting that you’re in this situation. It implies to me that at 
>>>>some point the leader couldn’t write a doc to one of the replicas,
>>>>
>>
>>That is our belief as well. We experienced a datacenter-wide network 
>>disruption of a few seconds, and user complaints started the first workday 
>>after that event.  
>>
>>The most interesting log entry during the outage is this:
>>
>>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it 
>>is coming from leader,​ but we are the leader: 
>>update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version

Re: SolrCloud replicas out of sync

2016-01-27 Thread David Smith
Sure.  Here is our SolrCloud cluster:

   + Three (3) instances of Zookeeper on three separate (physical) servers.  
The ZK servers are beefy and fairly recently built, with 2x10 GigE (bonded) 
Ethernet connectivity to the rest of the data center.  We recognize importance 
of the stability and responsiveness of ZK to the stability of SolrCloud as a 
whole.

   + 364 collections, all with single shards and a replication factor of 3.  
Currently housing only 100,000,000 documents in aggregate.  Expected to grow to 
25 billion+.  The size of a single document would be considered “large”, by the 
standards of what I’ve seen posted elsewhere on this mailing list. 

We are always open to ZK recommendations from you or anyone else, particularly 
for running a SolrCloud cluster of this size.

Kind Regards,

David



On 1/27/16, 12:46 PM, "Jeff Wartes"  wrote:

>
>If you can identify the problem documents, you can just re-index those after 
>forcing a sync. Might save a full rebuild and downtime.
>
>You might describe your cluster setup, including ZK. it sounds like you’ve 
>done your research, but improper ZK node distribution could certainly 
>invalidate some of Solr’s assumptions.
>
>
>
>
>On 1/27/16, 7:59 AM, "David Smith"  wrote:
>
>>Jeff, again, very much appreciate your feedback.  
>>
>>It is interesting — the article you linked to by Shalin is exactly why we 
>>picked SolrCloud over ES, because (eventual) consistency is critical for our 
>>application and we will sacrifice availability for it.  To be clear, after 
>>the outage, NONE of our three replicas are correct or complete.
>>
>>So we definitely don’t have CP yet — our very first network outage resulted 
>>in multiple overlapped lost updates.  As a result, I can’t pick one replica 
>>and make it the new “master”.  I must rebuild this collection from scratch, 
>>which I can do, but that requires downtime which is a problem in our app 
>>(24/7 High Availability with few maintenance windows).
>>
>>
>>So, I definitely need to “fix” this somehow.  I wish I could outline a 
>>reproducible test case, but as the root cause is likely very tight timing 
>>issues and complicated interactions with Zookeeper, that is not really an 
>>option.  I’m happy to share the full logs of all 3 replicas though if that 
>>helps.
>>
>>I am curious though if the thoughts have changed since 
>>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
>>“majority quorum” model, with rollback?  Done properly, this should be free 
>>of all lost update problems, at the cost of availability.  Some SolrCloud 
>>users (like us!!!) would gladly accept that tradeoff.  
>>
>>Regards
>>
>>David
>>
>>



Re: SolrCloud replicas out of sync

2016-01-29 Thread David Smith
Tomás,

Good find, but I don’t think the rate of updates was high enough during the 
network outage to create the overrun situation described in the ticket.

I did notice that one of the proposed fixes, 
https://issues.apache.org/jira/browse/SOLR-8586, is an entire-index consistency 
check between leader and replica.  I really hope they are able to get this to 
work.  Ideally, the replicas would never become (permanently) inconsistent, but 
given that they do, it is crucial that SolrCloud can internally detect and fix, 
no matter what caused it or how long ago it happened.


Regards,

David



On 1/28/16, 1:08 PM, "Tomás Fernández Löbbe"  wrote:

>Maybe you are hitting the reordering issue described in SOLR-8129?
>
>Tomás
>
>On Wed, Jan 27, 2016 at 11:32 AM, David Smith 
>wrote:
>
>> Sure.  Here is our SolrCloud cluster:
>>
>>+ Three (3) instances of Zookeeper on three separate (physical)
>> servers.  The ZK servers are beefy and fairly recently built, with 2x10
>> GigE (bonded) Ethernet connectivity to the rest of the data center.  We
>> recognize importance of the stability and responsiveness of ZK to the
>> stability of SolrCloud as a whole.
>>
>>+ 364 collections, all with single shards and a replication factor of
>> 3.  Currently housing only 100,000,000 documents in aggregate.  Expected to
>> grow to 25 billion+.  The size of a single document would be considered
>> “large”, by the standards of what I’ve seen posted elsewhere on this
>> mailing list.
>>
>> We are always open to ZK recommendations from you or anyone else,
>> particularly for running a SolrCloud cluster of this size.
>>
>> Kind Regards,
>>
>> David
>>
>>
>>
>> On 1/27/16, 12:46 PM, "Jeff Wartes"  wrote:
>>
>> >
>> >If you can identify the problem documents, you can just re-index those
>> after forcing a sync. Might save a full rebuild and downtime.
>> >
>> >You might describe your cluster setup, including ZK. it sounds like
>> you’ve done your research, but improper ZK node distribution could
>> certainly invalidate some of Solr’s assumptions.
>> >
>> >
>> >
>> >
>> >On 1/27/16, 7:59 AM, "David Smith"  wrote:
>> >
>> >>Jeff, again, very much appreciate your feedback.
>> >>
>> >>It is interesting — the article you linked to by Shalin is exactly why
>> we picked SolrCloud over ES, because (eventual) consistency is critical for
>> our application and we will sacrifice availability for it.  To be clear,
>> after the outage, NONE of our three replicas are correct or complete.
>> >>
>> >>So we definitely don’t have CP yet — our very first network outage
>> resulted in multiple overlapped lost updates.  As a result, I can’t pick
>> one replica and make it the new “master”.  I must rebuild this collection
>> from scratch, which I can do, but that requires downtime which is a problem
>> in our app (24/7 High Availability with few maintenance windows).
>> >>
>> >>
>> >>So, I definitely need to “fix” this somehow.  I wish I could outline a
>> reproducible test case, but as the root cause is likely very tight timing
>> issues and complicated interactions with Zookeeper, that is not really an
>> option.  I’m happy to share the full logs of all 3 replicas though if that
>> helps.
>> >>
>> >>I am curious though if the thoughts have changed since
>> https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering
>> a “majority quorum” model, with rollback?  Done properly, this should be
>> free of all lost update problems, at the cost of availability.  Some
>> SolrCloud users (like us!!!) would gladly accept that tradeoff.
>> >>
>> >>Regards
>> >>
>> >>David
>> >>
>> >>
>>
>>



Re: Replicas for same shard not in sync

2016-04-25 Thread David Smith
Erick,

So that my understanding is correct, let me ask, if one or more replicas are 
down, updates presented to the leader still succeed, right?  If so, tedsolr is 
correct that the Solr client app needs to re-issue updates, if it wants 
stronger guarantees on replica consistency than what Solr provides.

The “Write Fault Tolerance” section of the Solr Wiki makes what I believe is 
the same point:

"On the client side, if the achieved replication factor is less than the 
acceptable level, then the client application can take additional measures to 
handle the degraded state. For instance, a client application may want to keep 
a log of which update requests were sent while the state of the collection was 
degraded and then resend the updates once the problem has been resolved."


https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance


Kind Regards,

David




On 4/25/16, 11:57 AM, "Erick Erickson"  wrote:

>bq: I also read that it's up to the
>client to keep track of updates in case commits don't happen on all the
>replicas.
>
>This is not true. Or if it is it's a bug.
>
>The update cycle is this:
>1> updates get to the leader
>2> updates are sent to all followers and indexed on the leader as well
>3> each replica writes the updates to the local transaction log
>4> all the replicas ack back to the leader
>5> the leader responds to the client.
>
>At this point, all the replicas for the shard have the docs locally
>and can take over as leader.
>
>You may be confusing indexing in batches and having errors with
>updates getting to replicas. When you send a batch of docs to Solr,
>if one of them fails indexing some of the rest of the docs may not
>be indexed. See SOLR-445 for some work on this front.
>
>That said, bouncing servers willy-nilly during heavy indexing, especially
>if the indexer doesn't know enough to retry if an indexing attempt fails may
>be the root cause here. Have you verified that your indexing program
>retries in the event of failure?
>
>Best,
>Erick
>
>On Mon, Apr 25, 2016 at 6:13 AM, tedsolr  wrote:
>> I've done a bit of reading - found some other posts with similar questions.
>> So I gather "Optimizing" a collection is rarely a good idea. It does not
>> need to be condensed to a single segment. I also read that it's up to the
>> client to keep track of updates in case commits don't happen on all the
>> replicas. Solr will commit and return success as long as one replica gets
>> the update.
>>
>> I have a state where the two replicas for one collection are out of sync.
>> One has some updates that the other does not. And I don't have log data to
>> tell me what the differences are. This happened during a maintenance window
>> when the servers got restarted while a large index job was running. Normally
>> this doesn't cause a problem, but it did last Thursday.
>>
>> What I plan to do is select the replica I believe is incomplete and delete
>> it. Then add a new one. I was just hoping Solr had a solution for this -
>> maybe using the ZK transaction logs to replay some updates, or force a
>> resync between the replicas.
>>
>> I will also implement a fix to prevent Solr from restarting unless one of
>> its config files has changed. No need to bounce Solr just for kicks.
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Replicas-for-same-shard-not-in-sync-tp4272236p4272602.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Trouble getting "langid.map.individual" setting to work in Solr 5.0.x

2015-08-03 Thread David Smith
I am trying to use “languid.map.individual” setting to allow field “a” to 
detect as, say, English, and be mapped to “a_en”, while in the same document, 
field “b” detects as, say, German and is mapped to “b_de”.

What happens in my tests is that the global language is detected (for example, 
German), but BOTH fields are mapped to “_de” as a result.  I cannot get 
individual detection or mapping to work.  Am I mis-understanding the purpose of 
this setting?

Here is the resulting document from my test:


  {
"id": "1005!22345",
"language": [
  "de"
],
"a_de": "A title that should be detected as English with high 
confidence",
"b_de": "Die Einführung einer anlasslosen Speicherung von 
Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt 
ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten, 
Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen 
entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler, 
Grüne und Linke halten die geplante Richtlinie für eine andere Form der 
anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen 
mache.",
"_version_": 1508494723734569000
  }


I expected “a_de” to be “a_en”, and the “language” multi-valued field to have 
“en” and “de”.

Here is my configuration in solrconfig.xml:





true
a,b
true
true
language
af:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns
en








The debug output of lang detect, during indexing, is as follows:

---
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.964723182276
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected 
main document language from fields [a, b]: de
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.451; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.964723182276
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field a using individually detected language de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from a with language de to field a_de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; 
loaded class org.apache.solr.common.SolrInputField from 
WebAppClassLoader=525571@80503
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing 
old field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.980402022373
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field b using individually detected language de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from b with language de to field b_de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing 
old field b
-

From this, my takeaway is that every time the 
LangDetectLanguageIdentifierUpdateProcessor is asked to detect the language, it 
is using field a AND b.  But I can’t quite tell from this output.

Any insight appreciated.

Regards,

David




Identical query returning different aggregate results

2014-12-16 Thread David Smith
I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 replica, 1 
shard each) and a separate 1-node Zookeeper 3.4.6.  
The very first app test case I wrote is failing intermittently in this 
environment, when I only have 4 documents ingested into the cloud.
I dug in and found when I query against multiple collections, using the 
"collection=" parameter, the aggregates I request are correct about 50% of the 
time.  The other 50% of the time, the aggregate returned by Solr is not 
correct. Note this is for the identical query.  In other words, I can run the 
same query multiple times in a row, and get different answers.

The simplest version of the query that still exhibits the odd behavior is as 
follows:
http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true

When it SUCCEEDS, the aggregate correctly appears like this:

  "facet_counts":{    "facet_queries":{},    "facet_fields":{},    
"facet_dates":{},    "facet_ranges":{      "eventDate":{        "counts":[      
    "2014-04-01T00:00:00Z",3],        "gap":"+1DAY",        
"start":"2014-01-01T00:00:00Z",        "end":"2015-01-01T00:00:00Z"}},    
"facet_intervals":{}}}

When it FAILS, note that the counts[] array is empty:
  "facet_counts":{    "facet_queries":{},    "facet_fields":{},    
"facet_dates":{},    "facet_ranges":{      "eventDate":{        "counts":[],    
    "gap":"+1DAY",        "start":"2014-01-01T00:00:00Z",        
"end":"2015-01-01T00:00:00Z"}},    "facet_intervals":{}}}

If I further simplify the query, by removing range options or reducing to one 
(1) collection name, then the problem goes away.

The solr logs are clean at INFO level, and there is no substantive difference 
in log output when the query succeeds vs fails, leaving me stumped where to 
look next.  Suggestions welcome.
Regards,
David






Re: Identical query returning different aggregate results

2014-12-16 Thread David Smith
Alex,
Good suggestion, but in this case, no.  This example is from a cleanroom type 
test environment where the collections have very recently been created, there 
are only 4 documents total across all collections, and no delete's have been 
issued.
Kind regards,
David
 

 On Tuesday, December 16, 2014 12:01 PM, Alexandre Rafalovitch 
 wrote:
   

 Facet counts include deleted documents until the segments merge. Could that
be an issue?

Regards,
    Alex
On 16/12/2014 12:18 pm, "David Smith"  wrote:

> I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1
> replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6.
> The very first app test case I wrote is failing intermittently in this
> environment, when I only have 4 documents ingested into the cloud.
> I dug in and found when I query against multiple collections, using the
> "collection=" parameter, the aggregates I request are correct about 50% of
> the time.  The other 50% of the time, the aggregate returned by Solr is not
> correct. Note this is for the identical query.  In other words, I can run
> the same query multiple times in a row, and get different answers.
>
> The simplest version of the query that still exhibits the odd behavior is
> as follows:
>
> http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true
>
> When it SUCCEEDS, the aggregate correctly appears like this:
>
>  "facet_counts":{    "facet_queries":{},    "facet_fields":{},
> "facet_dates":{},    "facet_ranges":{      "eventDate":{        "counts":[
>        "2014-04-01T00:00:00Z",3],        "gap":"+1DAY",
> "start":"2014-01-01T00:00:00Z",        "end":"2015-01-01T00:00:00Z"}},
> "facet_intervals":{}}}
>
> When it FAILS, note that the counts[] array is empty:
>  "facet_counts":{    "facet_queries":{},    "facet_fields":{},
> "facet_dates":{},    "facet_ranges":{      "eventDate":{
> "counts":[],        "gap":"+1DAY",        "start":"2014-01-01T00:00:00Z",
>      "end":"2015-01-01T00:00:00Z"}},    "facet_intervals":{}}}
>
> If I further simplify the query, by removing range options or reducing to
> one (1) collection name, then the problem goes away.
>
> The solr logs are clean at INFO level, and there is no substantive
> difference in log output when the query succeeds vs fails, leaving me
> stumped where to look next.  Suggestions welcome.
> Regards,
> David
>
>
>
>
>

   

Re: Identical query returning different aggregate results

2014-12-16 Thread David Smith
Hi Erick,
Thanks for your reply.
My test environment only has one shard and one replica per collection.  So, I 
think there is no possibility of replicas getting out of sync.  Here is how I 
create each (month-based) collection:
http://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_01&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_confhttp://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_02&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_confhttp://192.168.59.103:8983/solr/admin/collections?action=CREATE&name=2014_03&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=main_conf...etc,
 etc...

Still, I think you are on to something.  I had already noticed that querying 
one collection at a time works.  For example, if I change my query 
oh-so-slightly from this:

"collection=2014_04,2014_03"

to this

"...collection=2014_04"

Then, the results are correct 100% of the time. I think substantively this is 
the same as specifying the name of the shard since, again, in my test 
environment I only have one shard per collection anyway.
I should mention that the "2014_03" collection is empty.  0 documents.  All 3 
documents which satisfy the facet range are in the "2014_04" collection.  So, 
it's a real head-scratcher that introducing that collection name into the query 
makes the results misbehave.
Kind regards,David
 On Tuesday, December 16, 2014 2:25 PM, Erick Erickson 
 wrote:
   

 bq: Facet counts include deleted documents until the segments merge

Whoa! Facet counts do _not_ require segment merging to be accurate.
What merging does is remove the _term_ information associated with
deleted documents, and removes their contribution to the TF/IDF
scores.

David:
Hmmm, what happens if you direct the query not only to a single
collection, but to a single shard? Add &distrib=false to the query and
point it to each of your replicas. (one collection at a time). The
expectation is that each replica for a slice within a collection has
identical documents.

One possibility is that somehow your shards are out of sync on a
collection. So the internal load balancing that happens sometimes
sends the query to one replica and sometime to another. 2 replicas
(leader and follower) and 50% failure, coincidence?

That just bumps the question up another level of course, the next
question is _why_ is the shard out of sync. So in that case I'd issue
a commit to all the collections on the off chance that somehow that
didn't happen and try again (very low probability that this is the
root cause, but you never know).

but it sure sounds like one replica doesn't agree with another, so the
above will give us place to look.

Best,
Erick



On Tue, Dec 16, 2014 at 12:12 PM, David Smith
 wrote:
> Alex,
> Good suggestion, but in this case, no.  This example is from a cleanroom type 
> test environment where the collections have very recently been created, there 
> are only 4 documents total across all collections, and no delete's have been 
> issued.
> Kind regards,
> David
>
>
>      On Tuesday, December 16, 2014 12:01 PM, Alexandre Rafalovitch 
> wrote:
>
>
>  Facet counts include deleted documents until the segments merge. Could that
> be an issue?
>
> Regards,
>    Alex
> On 16/12/2014 12:18 pm, "David Smith"  wrote:
>
>> I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1
>> replica, 1 shard each) and a separate 1-node Zookeeper 3.4.6.
>> The very first app test case I wrote is failing intermittently in this
>> environment, when I only have 4 documents ingested into the cloud.
>> I dug in and found when I query against multiple collections, using the
>> "collection=" parameter, the aggregates I request are correct about 50% of
>> the time.  The other 50% of the time, the aggregate returned by Solr is not
>> correct. Note this is for the identical query.  In other words, I can run
>> the same query multiple times in a row, and get different answers.
>>
>> The simplest version of the query that still exhibits the odd behavior is
>> as follows:
>>
>> http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true
>>
>> When it SUCCEEDS, the aggregate correctly appears like this:
>>
>>  "facet_counts":{    "facet_queries":{},    "facet_fields&q

Re: Identical query returning different aggregate results

2014-12-16 Thread David Smith
Chris,

Yes, your suggestion worked.  Changing the parameter in my query from 

...f.eventDate.facet.mincount=1...


to

...f.eventDate.facet.mincount=0...


worked around the problem. And I agree that SOLR-6154 describes what I observed 
almost exactly.  Once 5.0 is available, I'll test this again with "mincount=1".

Thanks everyone for your help! It is very much appreciated.

Regards,
David 

 On Tuesday, December 16, 2014 4:38 PM, Chris Hostetter 
 wrote:
   

 
sounds like this bug...

https://issues.apache.org/jira/browse/SOLR-6154

...in which case it has nothing to do with your use of multiple 
collections, it's just dependent on wether or not the first node to 
respond happens to have a doc in every "range bucket" .. any bucket 
missing (because of your mincount=1) from the first core to 
respond is then ignored in the response fro mthe subsequent cores.

workarround is to set mincount=0 for your facet ranges.



: Date: Tue, 16 Dec 2014 17:17:05 + (UTC)
: From: David Smith 
: Reply-To: solr-user@lucene.apache.org, David Smith 
: To: Solr-user 
: Subject: Identical query returning different aggregate results
: 
: I have a prototype SolrCloud 4.10.2 setup with 13 collections (of 1 replica, 
1 shard each) and a separate 1-node Zookeeper 3.4.6.  
: The very first app test case I wrote is failing intermittently in this 
environment, when I only have 4 documents ingested into the cloud.
: I dug in and found when I query against multiple collections, using the 
"collection=" parameter, the aggregates I request are correct about 50% of the 
time.  The other 50% of the time, the aggregate returned by Solr is not 
correct. Note this is for the identical query.  In other words, I can run the 
same query multiple times in a row, and get different answers.
: 
: The simplest version of the query that still exhibits the odd behavior is as 
follows:
: 
http://192.168.59.103:8985/solr/query_handler/query?facet.range=eventDate&f.eventDate.facet.range.end=2014-12-31T23:59:59.999Z&f.eventDate.facet.range.gap=%2B1DAY&fl=eventDate,id&start=0&collection=2014_04,2014_03&rows=10&f.eventDate.facet.range.start=2014-01-01T00:00:00.000Z&q=*:*&f.eventDate.facet.mincount=1&facet=true
: 
: When it SUCCEEDS, the aggregate correctly appears like this:
: 
:   "facet_counts":{    "facet_queries":{},    "facet_fields":{},    
"facet_dates":{},    "facet_ranges":{      "eventDate":{        "counts":[      
    "2014-04-01T00:00:00Z",3],        "gap":"+1DAY",        
"start":"2014-01-01T00:00:00Z",        "end":"2015-01-01T00:00:00Z"}},    
"facet_intervals":{}}}
: 
: When it FAILS, note that the counts[] array is empty:
:   "facet_counts":{    "facet_queries":{},    "facet_fields":{},    
"facet_dates":{},    "facet_ranges":{      "eventDate":{        "counts":[],    
    "gap":"+1DAY",        "start":"2014-01-01T00:00:00Z",        
"end":"2015-01-01T00:00:00Z"}},    "facet_intervals":{}}}
: 
: If I further simplify the query, by removing range options or reducing to one 
(1) collection name, then the problem goes away.
: 
: The solr logs are clean at INFO level, and there is no substantive difference 
in log output when the query succeeds vs fails, leaving me stumped where to 
look next.  Suggestions welcome.
: Regards,
: David
: 
: 
: 
: 
: 

-Hoss
http://www.lucidworks.com/

   

Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that 
exhibits the following response times (via the debugQuery option in Solr Admin):
"process": {
 "time": 24709,
 "query": { "time": 54 }, "facet": { "time": 24574 },


The query time of 54ms is great and exactly as expected -- this example was a 
single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second, and am 
having no luck.  The facet part of the query is as follows:

"params": { "facet.range": "eventDate",
 "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z",
 "f.eventDate.facet.range.gap": "+1DAY",
 "start": "0",

 "rows": "10",

 "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z",

 "f.eventDate.facet.mincount": "1",

 "facet": "true",

 "debugQuery": "true",
 "_": "1421169383802"
 }

And, the relevant schema definition is as follows:

   

    
    


During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O 
activity detected on the drive that holds the 175GB index.  I have 48GB of RAM, 
1/2 of that dedicated to the OS and the other to the Solr JVM.

I do NOT have any fieldValue caches configured as yet, because my (perhaps too 
simplistic?) reading of the documentation was that DocValues eliminates the 
need for a field-level cache on this facet field.

Any suggestions welcome.

Regards,
David



Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Shawn,

Thanks for the suggestion, but experimentally, in my case the same query with 
facet.method=enum returns in almost the same amount of time.

Regards
David 

 On Tuesday, January 13, 2015 12:02 PM, Shawn Heisey  
wrote:
   

 On 1/13/2015 10:35 AM, David Smith wrote:
> I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that 
> exhibits the following response times (via the debugQuery option in Solr 
> Admin):
> "process": {
>  "time": 24709,
>  "query": { "time": 54 }, "facet": { "time": 24574 },
>
>
> The query time of 54ms is great and exactly as expected -- this example was a 
> single-term search that returned 3 hits.
> I am trying to get the facet time (24.5 seconds) to be sub-second, and am 
> having no luck.  The facet part of the query is as follows:
>
> "params": { "facet.range": "eventDate",
>  "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z",
>  "f.eventDate.facet.range.gap": "+1DAY",
>  "start": "0",
>
>  "rows": "10",
>
>  "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z",
>
>  "f.eventDate.facet.mincount": "1",
>
>  "facet": "true",
>
>  "debugQuery": "true",
>  "_": "1421169383802"
>  }
>
> And, the relevant schema definition is as follows:
>
>    multiValued="false" docValues="true"/>
>
>    
>    positionIncrementGap="0"/>
>
>
> During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O 
> activity detected on the drive that holds the 175GB index.  I have 48GB of 
> RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
>
> I do NOT have any fieldValue caches configured as yet, because my (perhaps 
> too simplistic?) reading of the documentation was that DocValues eliminates 
> the need for a field-level cache on this facet field.

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

What I would try immediately is setting the facet.method parameter to
enum and seeing what that does to the facet time.  I've had good luck
generally with that, even in situations where the docs indicated that
the default (fc) was supposed to work better.  I have never explored the
relationship between facet.method and docValues, though.

I'm out of ideas after this.  I don't have enough experience with
faceting to help much.

Thanks,
Shawn



   

Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Tomás,


Thanks for the response -- the performance of my query makes perfect sense in 
light of your information.
I looked at Interval faceting.  My required interval is 1 day.  I cannot change 
that requirement.  Unless I am mis-reading the doc, that means to facet a 10 
year range, the query needs to specify over 3,600 intervals ??

f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]&f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]&etc,etc
 

Each query would be 185MB in size if I structure it this way.

I assume I must be mis-understanding how to use Interval faceting with dates.  
Are there any concrete examples you know of?  A google search did not come up 
with much.

Kind regards,
Dave

 On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
 wrote:
   

 Range Faceting won't use the DocValues even if they are there set, it
translates each gap to a filter. This means that it will end up using the
FilterCache, which should cause faster followup queries if you repeat the
same gaps (and don't commit).
You may also want to try interval faceting, it will use DocValues instead
of filters. The API is different, you'll have to provide the intervals
yourself.

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey  wrote:

> On 1/13/2015 10:35 AM, David Smith wrote:
> > I have a query against a single 50M doc index (175GB) using Solr 4.10.2,
> that exhibits the following response times (via the debugQuery option in
> Solr Admin):
> > "process": {
> >  "time": 24709,
> >  "query": { "time": 54 }, "facet": { "time": 24574 },
> >
> >
> > The query time of 54ms is great and exactly as expected -- this example
> was a single-term search that returned 3 hits.
> > I am trying to get the facet time (24.5 seconds) to be sub-second, and
> am having no luck.  The facet part of the query is as follows:
> >
> > "params": { "facet.range": "eventDate",
> >  "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z",
> >  "f.eventDate.facet.range.gap": "+1DAY",
> >  "start": "0",
> >
> >  "rows": "10",
> >
> >  "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z",
> >
> >  "f.eventDate.facet.mincount": "1",
> >
> >  "facet": "true",
> >
> >  "debugQuery": "true",
> >  "_": "1421169383802"
> >  }
> >
> > And, the relevant schema definition is as follows:
> >
> >     multiValued="false" docValues="true"/>
> >
> >    
> >     positionIncrementGap="0"/>
> >
> >
> > During the 25-second query, the Solr JVM pegs one CPU, with little or no
> I/O activity detected on the drive that holds the 175GB index.  I have 48GB
> of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
> >
> > I do NOT have any fieldValue caches configured as yet, because my
> (perhaps too simplistic?) reading of the documentation was that DocValues
> eliminates the need for a field-level cache on this facet field.
>
> 24GB of RAM to cache 175GB is probably not enough in the general case,
> but if you're seeing very little disk I/O activity for this query, then
> we'll leave that alone and you can worry about it later.
>
> What I would try immediately is setting the facet.method parameter to
> enum and seeing what that does to the facet time.  I've had good luck
> generally with that, even in situations where the docs indicated that
> the default (fc) was supposed to work better.  I have never explored the
> relationship between facet.method and docValues, though.
>
> I'm out of ideas after this.  I don't have enough experience with
> faceting to help much.
>
> Thanks,
> Shawn
>
>

   

Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
What is stumping me is that the search result has 3 hits, yet faceting those 3 
hits takes 24 seconds.  The documentation for facet.method=fc is quite explicit 
about how Solr does faceting:


"fc (stands for Field Cache) The facet counts are calculated by iterating over 
documents that match the query and summing the terms that appear in each 
document. This was the default method for single valued fields prior to Solr 
1.4."

If a search yielded millions of hits, I could understand 24 seconds to 
calculate the facets.  But not for a search with only 3 hits.  


What am I missing?  

Regards,
David



 

 On Tuesday, January 13, 2015 1:12 PM, Tomás Fernández Löbbe 
 wrote:
   

 No, you are not misreading, right now there is no automatic way of
generating the intervals on the server side similar to range faceting... I
guess it won't work in your case. Maybe you should create a Jira to add
this feature to interval faceting.

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith 
wrote:

> Tomás,
>
>
> Thanks for the response -- the performance of my query makes perfect sense
> in light of your information.
> I looked at Interval faceting.  My required interval is 1 day.  I cannot
> change that requirement.  Unless I am mis-reading the doc, that means to
> facet a 10 year range, the query needs to specify over 3,600 intervals ??
>
>
> f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]&f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]&etc,etc
>
>
> Each query would be 185MB in size if I structure it this way.
>
> I assume I must be mis-understanding how to use Interval faceting with
> dates.  Are there any concrete examples you know of?  A google search did
> not come up with much.
>
> Kind regards,
> Dave
>
>      On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe <
> tomasflo...@gmail.com> wrote:
>
>
>  Range Faceting won't use the DocValues even if they are there set, it
> translates each gap to a filter. This means that it will end up using the
> FilterCache, which should cause faster followup queries if you repeat the
> same gaps (and don't commit).
> You may also want to try interval faceting, it will use DocValues instead
> of filters. The API is different, you'll have to provide the intervals
> yourself.
>
> Tomás
>
> On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey 
> wrote:
>
> > On 1/13/2015 10:35 AM, David Smith wrote:
> > > I have a query against a single 50M doc index (175GB) using Solr
> 4.10.2,
> > that exhibits the following response times (via the debugQuery option in
> > Solr Admin):
> > > "process": {
> > >  "time": 24709,
> > >  "query": { "time": 54 }, "facet": { "time": 24574 },
> > >
> > >
> > > The query time of 54ms is great and exactly as expected -- this example
> > was a single-term search that returned 3 hits.
> > > I am trying to get the facet time (24.5 seconds) to be sub-second, and
> > am having no luck.  The facet part of the query is as follows:
> > >
> > > "params": { "facet.range": "eventDate",
> > >  "f.eventDate.facet.range.end": "2015-05-13T16:37:18.000Z",
> > >  "f.eventDate.facet.range.gap": "+1DAY",
> > >  "start": "0",
> > >
> > >  "rows": "10",
> > >
> > >  "f.eventDate.facet.range.start": "2005-03-13T16:37:18.000Z",
> > >
> > >  "f.eventDate.facet.mincount": "1",
> > >
> > >  "facet": "true",
> > >
> > >  "debugQuery": "true",
> > >  "_": "1421169383802"
> > >  }
> > >
> > > And, the relevant schema definition is as follows:
> > >
> > >     > multiValued="false" docValues="true"/>
> > >
> > >    
> > >     > positionIncrementGap="0"/>
> > >
> > >
> > > During the 25-second query, the Solr JVM pegs one CPU, with little or
> no
> > I/O activity detected on the drive that holds the 175GB index.  I have
> 48GB
> > of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
> > >
> > > I do NOT have any fieldValue caches configured as yet, because my
> > (perhaps too simplistic?) reading of the documentation was that DocValues
> > eliminates the need for a field-level cache on this facet field.
> >
> > 24GB of RAM to cache 175GB is probably not enough in the general case,
> > but if you're seeing very little disk I/O activity for this query, then
> > we'll leave that alone and you can worry about it later.
> >
> > What I would try immediately is setting the facet.method parameter to
> > enum and seeing what that does to the facet time.  I've had good luck
> > generally with that, even in situations where the docs indicated that
> > the default (fc) was supposed to work better.  I have never explored the
> > relationship between facet.method and docValues, though.
> >
> > I'm out of ideas after this.  I don't have enough experience with
> > faceting to help much.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
>

   

Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Shawn,
I've been thinking along your lines, and continued to run tests through the 
day.  The results surprised me.

For my index, Solr range faceting time is most closely related to the total 
number of documents in the index for the range specified.  The number of 
"buckets" in the range is a second factor.   

I found NO correlation whatsoever to the number of hits in the query.  Whether 
I have 3 hits or 1,500,000 hits, it's ~24 seconds to facet the result for that 
same time period.  That is what surprised me.

For example, if my facet range is a 10 year period for which there exists 47M 
docs in the index, the facet time is 24 seconds.  If I switch my facet range to 
a different 10 year period with 1.3M docs, the facet time drops to less than 5 
seconds.  

If I go back to my original 10 year period (with 47M docs in the index), but 
facet by month instead of day, my facet time drops to 2.5 seconds.  Now, I 
can't meet my user needs this way, but it does show the relationship between # 
of buckets and faceting time.

Regards,

David