Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

Michael Della Bitta Tue, 28 Oct 2014 12:45:32 -0700

No you do not, although you may consider it, because you'd be getting asort of integrated stack.

But really, the decision to switch to running Solr in HDFS should not betaken lightly. Unless you are on a team familiar with running a Hadoopstack, or you're willing to devote a lot of effort toward becomingproficient with one, I would recommend against it.


On 10/28/14 15:27, S.L wrote:

I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera

On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

We index directly from mappers using SolrJ. It does work, but you pay the
price of having to instantiate all those sockets vs. the way
MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
directly in the Reduce task.

You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
if you don't, you then have to make sure to appropriately tune your Hadoop
implementation to match what your Solr installation is capable of.

On 10/28/14 12:39, S.L wrote:

Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the
snippet
in the configuration can be seen below.

       <autoSoftCommit>
         <maxTime>${solr.
autoSoftCommit.maxTime:-1}</maxTime>
       </autoSoftCommit>

       <autoCommit>
         <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>

         <openSearcher>true</openSearcher>
       </autoCommit>

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin <wmartin...@gmail.com>
wrote:

  The easiest, and coarsest measure of response time [not service time in a

distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and
server.xml. You can add configuration to report the payload and time to
service the request without touching any code.

Queueing theory is what Otis was talking about when he said you've
saturated your environment. In AWS people just auto-scale up and don't
worry about where the load comes from; its dumb if it happens more than 2
times. Capacity planning is tough, let's hope it doesn't disappear
altogether.

G'luck


-----Original Message-----
From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
out of synch.

Good point about ZK logs , I do see the following exceptions
intermittently in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
connection from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
establish new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 10000 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid
0x14949db9da40037, likely client has closed socket
          at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
          at

org.apache.zookeeper.server.NIOServerCnxnFactory.run(
NIOServerCnxnFactory.java:208)
          at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming
multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin <wmartin...@gmail.com>
wrote:

  2 naïve comments, of course.



-          Queuing theory

-          Zookeeper logs.



From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
replicas out of synch.



Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can
be seen below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query
:http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* <
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
g=track&shards.info=true>
&fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
&debug=track&
shards.info=true



Result :

<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int><lst name="params"><str name="q">*:*</str><str
name="
shards.info">true</str><str name="distrib">false</str><str
name="debug">track</str><str name="wt">xml</str><str
name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)</str></lst></lst><
result name="response" numFound="0" start="0"/><lst
name="debug"/></response>



Shard1 replica 2 *does* return the result with distrib=false.

Query:
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* <
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
g=track&shards.info=true>
&fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
&debug=track&
shards.info=true

Result:

<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int><lst name="params"><str name="q">*:*</str><str
name="
shards.info">true</str><str name="distrib">false</str><str
name="debug">track</str><str name="wt">xml</str><str
name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)</str></lst></lst><
result name="response" numFound="1" start="0"><doc><str
name="thingURL"> http://www.xyz.com</str><str
name="id">9f4748c0-fe16-4632-b74e-4fee6b80cbf5</str><long
name="_version_">1483135330558148608</long></doc></result><lst
name="debug"/></response>



On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

On Mon, Oct 27, 2014 at 9:40 PM, S.L <simpleliving...@gmail.com> wrote:

  One is not smaller than the other, because the numDocs is same for

both "replicas" and essentially they seem to be disjoint sets.

  That is strange. Can we see your clusterstate.json? With that, please

also specify the two replicas which are out of sync.

  Also manually purging the replicas is not option , because this is

"frequently" indexed index and we need everything to be automated.

What other options do I have now.

1. Turn of the replication completely in SolrCloud 2. Use
traditional Master Slave replication model.
3. Introduce a "replica" aware field in the index , to figure out
which "replica" the request should go to from the client.
4. Try a distribution like Helios to see if it has any different

behavior.

Just think out loud here ......

On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma <
markus.jel...@openindex.io>
wrote:

  Hi - if there is a very large discrepancy, you could consider to

purge

the

smallest replica, it will then resync from the leader.


-----Original message-----

From:S.L <simpleliving...@gmail.com>
Sent: Monday 27th October 2014 16:41
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1

replicas
out of synch.

Markus,

I would like to ignore it too, but whats happening is that the
there

is a
lot of discrepancy between the replicas , queries like

q=*:*&fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail
depending on

which

replica the request goes to, because of huge amount of
discrepancy

between

the replicas.

Thank you for confirming that it is a know issue , I was
thinking I

was

the

only one facing this due to my set up.

On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma <

markus.jel...@openindex.io>

wrote:

  It is an ancient issue. One of the major contributors to the

issue

was

resolved some versions ago but we are still seeing it

sometimes

too,

there

is nothing to see in the logs. We ignore it and just reindex.

-----Original message-----

From:S.L <simpleliving...@gmail.com>
Sent: Monday 27th October 2014 16:25
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud
4.10.1

replicas

out of synch.

Thank Otis,

I have checked the logs , in my case the default
catalina.out

and I

dont

see any OOMs or , any other exceptions.

What others metrics do you suggest ?

On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

  Hi,

You may simply be overwhelming your cluster-nodes. Have
you

checked

various metrics to see if that is the case?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized
Log

Management

Solr & Elasticsearch Support * http://sematext.com/



  On Oct 26, 2014, at 9:59 PM, S.L

<simpleliving...@gmail.com>

wrote:

Folks,

I have posted previously about this , I am using
SolrCloud

4.10.1 and

have

a sharded collection with  6 nodes , 3 shards and a

replication

factor

of 2.

I am indexing Solr using a Hadoop job , I have 15 Map
fetch

tasks ,

that

can each have upto 5 threds each , so the load on the

indexing

side

can

get

to as high as 75 concurrent threads.

I am facing an issue where the replicas of a particular

shard(s)

are

consistently getting out of synch , initially I thought

this

was

beccause I

was using a custom component , but I did a fresh install
and

removed

the

custom component and reindexed using the Hadoop job , I

still

see the

same

behavior.

I do not see any exceptions in my catalina.out , like
OOM ,

or

any

other

excepitions, I suspecting thi scould be because of the

multi-threaded

indexing nature of the Hadoop job . I use

CloudSolrServer

from

my

java

code

to index and initialize the CloudSolrServer using a 3
node ZK

ensemble.

Does any one know of any known issues with a highly

multi-threaded

indexing

and SolrCloud ?

Can someone help ? This issue has been slowing things
down on

my

end

for

while now.

Thanks and much appreciated!

--
Regards,
Shalin Shekhar Mangar.

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

Reply via email to