Solr Cloud: Duplicate documents in multiple shards

2015-07-20 Thread mesenthil1
Hi All,

We are using solr 4.2.1 cloud with 5 shards  set up ( 1 leader & 1 replica
for each shard). We are seeing the following issue in our set up.  
Few of the documents are getting returned from more than one shard for
queries. When we try to update the document, it is not updating the
documents on both and is getting updated on single shard. Even we are unable
to delete the document as well. Can you please clarify the following?

1. What happens if a shard(both leader and replica) goes down. If the
document on the "died shard" is updated, will it forward the document to the
new shard. If so, when the "died shard" comes up again, will this not be
considered for the same hask key range?  
2. Is there a way to fix this[removing duplicates across shards]?

We have 130 million documents in our set up and the routing key is set as
"compositeId".

Senthil





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-20 Thread mesenthil1
Thanks Erick for clarifying ..
We are not explicitly setting the compositeId. We are using numShards=5
alone as part of the server start up. We are using uuid as unique field.

One sample id is :

possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30


Not sure how it would have gone to multiple shards.  Do you have any
suggestion for fixing this. Or we need to completely rebuild the index.
When the routing key is compositeId, should we explicitly set "!" with shard
key? 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread mesenthil1
Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-22 Thread mesenthil1
Alessandro,
Thanks. 
see some confusion here. 
*First of all you need a smart client that will load balance the docs to 
index.  Let's say the CloudSolrClient . 
*
All these 5 shards are configured to load-balancer and requests are sent to
the load-balancer and whichever server is up, will accept the requests. 

*What do you mean with : "will this not be considered for the same hask key
range " ? *
Each shard will have the hash key range and the documents will be assigned
to the shard based on the key range it belongs to with its hashkey.  

Reitzel,
The uuid is generated during update and it is unique and not a new id for
the document. Also having shard specific routkey[env] is not possible in our
case. 


Thanks,
Senthil



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218556.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-27 Thread mesenthil1
Thanks Erick. As I understand now that the entire cluster goes down if any
one shard is down, my first confusion is clarified. 

Following are the other details 

We really need to see details since I'm guessing we're talking
past each other. So:
*1> exactly how are you indexing documents?*
 /using HTTPSolrServer and placing all update request to leader1/shard1.
Enabled autoCommit with 60 seconds and not placing any commit from client
application./
*2> exactly how are you assigning a UUID to a doc?*
 /defined an unique field in schema.xml and it is generated by the
client application, ID format is {mongoDBHostName}-{mongoDBName}-{UUID}. /
*3> do you ever re-index documents? If so, how are you
   assuring that the UUID generated for any re-indexing operations
   are the same ones used the first time? *
/Yes we are re-indexing documents. We are getting the UUID from mongodb and
the ID generated is same while we are doing update as well, using the same
code. /


We are unable to guess the root cause for having duplicate documents in
multiple shards.  Also, it looks reindexing is the only solution for
removing the duplicates.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4219251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-28 Thread mesenthil1
Thanks Erick. We could not recollect what could have happened in between.. 

Yes. We are seeing the same document in 2 shards.

"Uniquefiled" is set as uuid in schema and declared as String.  Will go with
reindexing. 

schema.xml : 

Query:
http://localhost:1004/solr/collection1/select?q=id:%22mongo.com-e25a2-11e3-8a73-0026b9414f30%22&wt=xml&shards.info=true

Response:


*1*
17.853292
3


*1*
17.850622
2


0
0.0
3


0
0.0
4


0
0.0
19








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4219458.html
Sent from the Solr - User mailing list archive at Nabble.com.


CDATA response is coming with "<:" instead of "<"

2015-04-21 Thread mesenthil1
We are using DIH for indexing XML files. As part of the xml we have xml
enclosed with CDATA. It is getting indexed but in response the CDATA content
is coming as decoded terms instead of symbols. Example:

/Feed file:
/

  
123
abc pqr xyz
 *   

Re: CDATA response is coming with "<:" instead of "<"

2015-04-21 Thread mesenthil1
Thanks.

For wt=json, it is bringing the results properly.  I understand the reason
for getting this in <.  As our solr client is expecting this to be like
within CDATA, I am looking for a way to achieve this.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/CDATA-response-is-coming-with-lt-instead-of-tp4201271p4201281.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting IO Exception while Indexing

2017-07-20 Thread mesenthil1
Hi, 
This is happening repeatedly for few documents.  When we compared with other
similar documents, we could not find any difference. 

As we are seeing 400 on apache, the request is not submitted to solr.  So
unable to find out the cause. 

Senthil



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4346930.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting IO Exception while Indexing

2017-07-20 Thread mesenthil1
While debugging following are the findings. 

When we send the same document as json, it is getting indexed without an
issue. When the same document is converted as SolrInputDocument and sent to
solr using SolrServer, it fails.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4347096.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting IO Exception while Indexing

2017-07-31 Thread mesenthil1
We printed in most of the places but could not get any significant
differences between successful and error documents.  We modified our logic
to use direct http client and posted the JSON messages  directly to solr
cloud.  Most of the ids are fine now. 

But we still see same issue with minimal documents. When we run the same
code from different linux boxes, it is fine. When enabled apache dumpio,the
payload is not completely passed to apache while executing form this
machine.  While collecting apache dump on error_log, we see the following
error message

"(70008)Partial results are valid but processing is incomplete: proxy:
prefetch request body failed to"

As the request payload [incomplete or partial json]  is not full, the
request is not forwarded to solr itself. it fails in apache level and
returned as 400.   In client side getting Connection reset exception.

Any help would be really helpful.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4348367.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggestions from different dictionaries dynamically

2017-03-15 Thread mesenthil1
Yes we are using spellcheck dictionary.  Our default search field is "text".
Following is the solrconfig snippet. Please let us know if there is more
information required.

  
 
   edismax
   true
 
 
  typeaheadspellcheck
  spellcheckresearcher  

  

  
  
typeaheadtextSpellPhrase


  spellchecker_phrase
  spellphrase
  ./spellchecker_phrase_typeahead






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggestions-from-different-dictionaries-dynamically-tp4324864p4325046.html
Sent from the Solr - User mailing list archive at Nabble.com.