Re: Auto-optimization for Solr indexes

2015-12-20 Thread Rahul Ramesh
Hi Erick,
We index around several million documents/ day and we optimize everyday
when the relative load is low. The reason we optimize is, we dont want the
index sizes to grow too large and auto optimzie to kick in. When auto
optimize kicks in, it results in unpredictable performance as it is CPU and
IO intensive.

In older solr (4.2), when the segment size grows too large, insertion used
to fail .  Have we seen this problem in solr cloud?

Also, we have observed, recovery takes a bit more time when it is not
optimized. We dont have any quantitative measurement for the same. Its just
an observation. Is this correct observation?

If we optimize it every day, the indexes will not be skewed right?

Please let me know if my understanding is correct.

Regards,
Rahul

On Mon, Dec 21, 2015 at 9:54 AM, Erick Erickson 
wrote:

> You'll probably have to shard before you get to the TB range. At that
> point, all the optimization is done individually on each shard so it
> really doesn't matter how many shards you have.
>
> Just issuing
> http://solr:port/solr/collection/update?optimize=true
>
> is sufficient, that'll forward the optimize command to all the shards
> in the collection.
>
> Best,
> Erick
>
> On Sun, Dec 20, 2015 at 8:19 PM, Zheng Lin Edwin Yeo
>  wrote:
> > Thanks for your information Erick.
> >
> > We have yet to decide how often we will update the index to include new
> > documents that came in. Let's say we update the index once a day, then
> when
> > the indexed is updated, we do the optimization (this will be done at
> night
> > when there are not many users using the system).
> > But my index size will probably grow quite big (potentially can go up to
> > more than 1TB in the future), so does that have to be taken into
> > consideration too?
> >
> > Regards,
> > Edwin
> >
> >
> > On 21 December 2015 at 12:12, Erick Erickson 
> > wrote:
> >
> >> Much depends on how often the index is updated. If your index only
> >> changes, say, once a day then it's probably a good idea. If you're
> >> constantly updating your index, then I'd recommend that you do _not_
> >> optimize.
> >>
> >> Optimizing will create one large segment. That segment will be
> >> unlikely to be merged since it is so large relative to other segments
> >> for quite a while, resulting in significant wasted space. So if you're
> >> regularly indexing documents that _replace_ existing documents, this
> >> will skew your index.
> >>
> >> Bottom line:
> >> If you have a relatively static index the you can build and then use
> >> for an extended time (as in 12 hours plus) it can be worth the time to
> >> optimize. Otherwise I wouldn't bother.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Dec 20, 2015 at 7:57 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Hi,
> >> >
> >> > I would like to find out, will it be good to do write a script to do
> an
> >> > auto-opitmization of the indexes at a certain time every day? Is there
> >> any
> >> > advantage to do so?
> >> >
> >> > I found that optimization can reduce the index size by quite a
> >> > signification amount, and allow the searching of the index to run
> faster.
> >> > But will there be advantage if we do the optimization every day?
> >> >
> >> > I'm using Solr 5.3.0
> >> >
> >> > Regards,
> >> > Edwin
> >>
>


Re: Auto-optimization for Solr indexes

2015-12-20 Thread Rahul Ramesh
Thanks Erick!

Rahul

On Mon, Dec 21, 2015 at 10:07 AM, Erick Erickson 
wrote:

> Rahul:
>
> bq:  we dont want the index sizes to grow too large and auto optimzie to
> kick in
>
> Not what quite what's going on. There is no "auto optimize". What
> there is is background merging that will take _some_ segments and
> merge them together. Very occasionally this will be the same as a full
> optimize if it just happens that "some" means all the segments.
>
> bq: recovery takes a bit more time when it is not optimized
>
> I'd be interested in formal measurements here. A recovery that copied
> the _entire_ index down from the leader shouldn't really have that
> much be different between an optimized and non-optimized index, but
> all things are possible. If the recovery is a "peer sync" it shouldn't
> matter at all.
>
> If you're continually adding documents that _replace_ older documents,
> optimizing will recover any "holes" left by the old updated docs. An
> update is really a mark-as-deleted for the old version and a re-index
> of the new. Since segments are write-once, the old data is left there
> until the segment is merged. Now, one of the bits of information that
> goes into deciding whether to merge a segment or not is the size.
> Another is the percentage of deleted docs. When you optimize, you get
> one huge segment. Now you have to update a lot of docs for that
> segment to have a large percentage of deleted documents and be merged,
> thus wasting space and memory.
>
> So it's a tradeoff. But if you're getting satisfactory performance
> from what you have now, there's no reason to change.
>
> Here's a wonderful video about the process. you want the third one
> down (TieredMergePolicy) as that's the default.
>
>
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
> Best,
> Erick
>
> On Sun, Dec 20, 2015 at 8:26 PM, Rahul Ramesh  wrote:
> > Hi Erick,
> > We index around several million documents/ day and we optimize everyday
> > when the relative load is low. The reason we optimize is, we dont want
> the
> > index sizes to grow too large and auto optimzie to kick in. When auto
> > optimize kicks in, it results in unpredictable performance as it is CPU
> and
> > IO intensive.
> >
> > In older solr (4.2), when the segment size grows too large, insertion
> used
> > to fail .  Have we seen this problem in solr cloud?
> >
> > Also, we have observed, recovery takes a bit more time when it is not
> > optimized. We dont have any quantitative measurement for the same. Its
> just
> > an observation. Is this correct observation?
> >
> > If we optimize it every day, the indexes will not be skewed right?
> >
> > Please let me know if my understanding is correct.
> >
> > Regards,
> > Rahul
> >
> > On Mon, Dec 21, 2015 at 9:54 AM, Erick Erickson  >
> > wrote:
> >
> >> You'll probably have to shard before you get to the TB range. At that
> >> point, all the optimization is done individually on each shard so it
> >> really doesn't matter how many shards you have.
> >>
> >> Just issuing
> >> http://solr:port/solr/collection/update?optimize=true
> >>
> >> is sufficient, that'll forward the optimize command to all the shards
> >> in the collection.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Dec 20, 2015 at 8:19 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Thanks for your information Erick.
> >> >
> >> > We have yet to decide how often we will update the index to include
> new
> >> > documents that came in. Let's say we update the index once a day, then
> >> when
> >> > the indexed is updated, we do the optimization (this will be done at
> >> night
> >> > when there are not many users using the system).
> >> > But my index size will probably grow quite big (potentially can go up
> to
> >> > more than 1TB in the future), so does that have to be taken into
> >> > consideration too?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 21 December 2015 at 12:12, Erick Erickson  >
> >> > wrote:
> >> >
> >> >> Much depends on how often the index is updated. If your index only
> >> >> changes, say, once a day then it's probably a good idea. If you're
> >> >> constantly updating your inde

Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-11 Thread Rahul Ramesh
Please have a look at this post

https://support.lucidworks.com/hc/en-us/articles/201298317-What-is-SolrCloud-And-how-does-it-compare-to-master-slave-

We dont use Master slave architecture, however we use solr cloud and
standalone solr for our documents.

Indexing is a bit slow in cloud when compared to Standalone. This is
because of replication I think. However you will get a faster query
response.

Solr Cloud also requires a slightly elaborate setup with Zookeepers
compared to master/slave or standalone.

However, once Solr cloud is setup, it runs very smoothly and you dont have
to worry about the performance / high availability.

Please check the post, a detailed analysis and comparison between the two
has been given.

-Rahul


On Mon, Jan 11, 2016 at 4:58 PM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi guys,
>
>
>
> a customer need a comprehensive list of all pro and cons of using standard
> Master Slave replica VS using Solr Cloud. I’m interested especially in
> query performance consideration, because in this specific situation the
> rate of new documents is really slow, but the amount of data is about 50
> millions of document, and the index size on disk for single core is about
> 30 GB.
>
>
>
> Such amount of data should be easily handled by a Master Slave replica
> with a  single core replicated on a certain number of slaves, but we need
> to evaluate also the option of SolrCloud, especially for fault tolerance.
>
>
>
> I’ve googled around, but did not find anything really comprehensive, so
> I’m looking for real experience from you in Mailing List. J.
>
>
>
> Thanks in advance.
>
>
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
>  [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
>  [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
>  [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
>  [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Understanding solr commit

2016-01-25 Thread Rahul Ramesh
We are facing some issue and we are finding it difficult to debug the
problem. We wanted to understand how solr commit works.
A background on our setup:
We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
use case. In peak load, we index 400-500 documents/second.
We also want these documents to be visible as quickly as possible, hence we
run an external script which commits every 3 mins.

Consider the three nodes as N1, N2, N3. Commit is an synchronous operation.
So, we will not get control till the commit operation is complete.

Consider the following scenario. Although it looks like a basic scenario in
distributed system:-) but we just wanted to eliminate this possibility.

step 1 : At time T1, commit happens to Node N1
step 2: At same time T1, we search for all the documents inserted in Node
N2.

My question is

1. Is commit an atomic operation? I mean, will commit happen on all the
nodes at the same time?
2. Can we say that, the search result will always contain the documents
before commit / or after commit . Or can it so happen that we get new
documents fron N1, N2 but old documents (i.e., before commit)  from N3?

Thank you,
Rahul


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Thanks for your replies.

A bit more detail about our setup.
The index size is close to 80Gb spread across 30 collections. The main
memory available is around 32Gb. We are always in short of memory!
Unfortunately we could not expand the memory as the server motherboard
doesnt support it.

We tried with solr auto commit features. However, sometimes we were getting
Java OOM exception and when I start digging more about it, somebody
suggested that I am not committing the collections often. So, we started
committing the collections explicitly.

Please let me know if our approach is not correct.

*Emir*,
We are committing to the collection only once. We have Node N1, N2 and N3
and for a collection Coll1, commit will happen to N1/coll1 every 3 minutes.
we are not doing it for every node. We will remove _shard<>_replica<> and
use only the collection name to commit.

*Alessandro*,
We are using Solr Cloud with replication factor of 2 and no of shards as
either 2 or 3.

Thanks,
Rahul









On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti  wrote:

> Let me answer in line :
>
> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>
> > We are facing some issue and we are finding it difficult to debug the
> > problem. We wanted to understand how solr commit works.
> > A background on our setup:
> > We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
> > use case. In peak load, we index 400-500 documents/second.
> > We also want these documents to be visible as quickly as possible, hence
> we
> > run an external script which commits every 3 mins.
> >
>
> This is weird, why not using the auto-soft commit if you want visibility
> every 3 minutes ?
> Is there any particular reason you trigger the commit from the client ?
>
> >
> > Consider the three nodes as N1, N2, N3. Commit is an synchronous
> operation.
> > So, we will not get control till the commit operation is complete.
> >
> > Consider the following scenario. Although it looks like a basic scenario
> in
> > distributed system:-) but we just wanted to eliminate this possibility.
> >
> > step 1 : At time T1, commit happens to Node N1
> > step 2: At same time T1, we search for all the documents inserted in Node
> > N2.
> >
> > My question is
> >
> > 1. Is commit an atomic operation? I mean, will commit happen on all the
> > nodes at the same time?
> >
> Which kind of architecture of Solr are you using ? Are you using SolrCloud
> ?
>
> 2. Can we say that, the search result will always contain the documents
> > before commit / or after commit . Or can it so happen that we get new
> > documents fron N1, N2 but old documents (i.e., before commit)  from N3?
> >
> With a manual cluster it could faintly happen.
> In SolrCloud it should not, but I should double check the code !
>
> >
> > Thank you,
> > Rahul
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Can you give us bit more details about Solr heap parameters.
Each node has 32Gb of RAM and we are using 8Gb for heap.
Index size in each node is around 80Gb
#of collections 30


Also can you give us info about auto commit (both hard and soft) you used
when experienced OOM.
 15000 15000 false 

soft commit is not enabled.

-Rahul



On Mon, Jan 25, 2016 at 6:00 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Rahul,
> It is good that you commit only once, but not sure how external commits
> can do something auto commit cannot.
> Can you give us bit more details about Solr heap parameters. Running Solr
> on the edge of OOM is always risk of starting snowball effect and crashing
> entire cluster. Also can you give us info about auto commit (both hard and
> soft) you used when experienced OOM.
>
> Thanks,
> Emir
>
> On 25.01.2016 12:28, Rahul Ramesh wrote:
>
>> Thanks for your replies.
>>
>> A bit more detail about our setup.
>> The index size is close to 80Gb spread across 30 collections. The main
>> memory available is around 32Gb. We are always in short of memory!
>> Unfortunately we could not expand the memory as the server motherboard
>> doesnt support it.
>>
>> We tried with solr auto commit features. However, sometimes we were
>> getting
>> Java OOM exception and when I start digging more about it, somebody
>> suggested that I am not committing the collections often. So, we started
>> committing the collections explicitly.
>>
>> Please let me know if our approach is not correct.
>>
>> *Emir*,
>> We are committing to the collection only once. We have Node N1, N2 and N3
>> and for a collection Coll1, commit will happen to N1/coll1 every 3
>> minutes.
>> we are not doing it for every node. We will remove _shard<>_replica<> and
>> use only the collection name to commit.
>>
>> *Alessandro*,
>>
>> We are using Solr Cloud with replication factor of 2 and no of shards as
>> either 2 or 3.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti <
>> abenede...@apache.org
>>
>>> wrote:
>>> Let me answer in line :
>>>
>>> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>>>
>>> We are facing some issue and we are finding it difficult to debug the
>>>> problem. We wanted to understand how solr commit works.
>>>> A background on our setup:
>>>> We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
>>>> use case. In peak load, we index 400-500 documents/second.
>>>> We also want these documents to be visible as quickly as possible, hence
>>>>
>>> we
>>>
>>>> run an external script which commits every 3 mins.
>>>>
>>>> This is weird, why not using the auto-soft commit if you want visibility
>>> every 3 minutes ?
>>> Is there any particular reason you trigger the commit from the client ?
>>>
>>> Consider the three nodes as N1, N2, N3. Commit is an synchronous
>>>>
>>> operation.
>>>
>>>> So, we will not get control till the commit operation is complete.
>>>>
>>>> Consider the following scenario. Although it looks like a basic scenario
>>>>
>>> in
>>>
>>>> distributed system:-) but we just wanted to eliminate this possibility.
>>>>
>>>> step 1 : At time T1, commit happens to Node N1
>>>> step 2: At same time T1, we search for all the documents inserted in
>>>> Node
>>>> N2.
>>>>
>>>> My question is
>>>>
>>>> 1. Is commit an atomic operation? I mean, will commit happen on all the
>>>> nodes at the same time?
>>>>
>>>> Which kind of architecture of Solr are you using ? Are you using
>>> SolrCloud
>>> ?
>>>
>>> 2. Can we say that, the search result will always contain the documents
>>>
>>>> before commit / or after commit . Or can it so happen that we get new
>>>> documents fron N1, N2 but old documents (i.e., before commit)  from N3?
>>>>
>>>> With a manual cluster it could faintly happen.
>>> In SolrCloud it should not, but I should double check the code !
>>>
>>> Thank you,
>>>> Rahul
>>>>
>>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Thank you Emir, Allesandro for the inputs. We use sematext for monitoring.
We understand that Solr needs more memory but unfortunately we have to move
towards an altogether new range of servers.
As you say eventually, we will have to upgrade our servers.

Thanks,
Rahul


On Mon, Jan 25, 2016 at 6:32 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Rahul,
> It is hard to tell without seeing metrics, but 8GB heap seems small for
> such setup - e.g. with indexing buffer of 32MB and 30 collections, it will
> eat almost 1GB memory.
> About commits, you can set auto commit to be more frequent (keep
> openSearcher=false) and add soft commits every 3 min.
> What you need to tune is your heap and heap related settings - indexing
> buffer, caches. Not sure what you use for monitoring Solr, but Sematext's
> SPM (http://sematext.com/spm) is one such tool that can give you info how
> you Solr, JVM and host handle different load. One such tool can give you
> enough info to tune your Solr.
>
> Regards,
> Emir
>
>
> On 25.01.2016 13:42, Rahul Ramesh wrote:
>
>> Can you give us bit more details about Solr heap parameters.
>> Each node has 32Gb of RAM and we are using 8Gb for heap.
>> Index size in each node is around 80Gb
>> #of collections 30
>>
>>
>> Also can you give us info about auto commit (both hard and soft) you used
>> when experienced OOM.
>>  15000 15000
>> >
>>> false 
>>>
>> soft commit is not enabled.
>>
>> -Rahul
>>
>>
>>
>> On Mon, Jan 25, 2016 at 6:00 PM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>> Hi Rahul,
>>> It is good that you commit only once, but not sure how external commits
>>> can do something auto commit cannot.
>>> Can you give us bit more details about Solr heap parameters. Running Solr
>>> on the edge of OOM is always risk of starting snowball effect and
>>> crashing
>>> entire cluster. Also can you give us info about auto commit (both hard
>>> and
>>> soft) you used when experienced OOM.
>>>
>>> Thanks,
>>> Emir
>>>
>>> On 25.01.2016 12:28, Rahul Ramesh wrote:
>>>
>>> Thanks for your replies.
>>>>
>>>> A bit more detail about our setup.
>>>> The index size is close to 80Gb spread across 30 collections. The main
>>>> memory available is around 32Gb. We are always in short of memory!
>>>> Unfortunately we could not expand the memory as the server motherboard
>>>> doesnt support it.
>>>>
>>>> We tried with solr auto commit features. However, sometimes we were
>>>> getting
>>>> Java OOM exception and when I start digging more about it, somebody
>>>> suggested that I am not committing the collections often. So, we started
>>>> committing the collections explicitly.
>>>>
>>>> Please let me know if our approach is not correct.
>>>>
>>>> *Emir*,
>>>> We are committing to the collection only once. We have Node N1, N2 and
>>>> N3
>>>> and for a collection Coll1, commit will happen to N1/coll1 every 3
>>>> minutes.
>>>> we are not doing it for every node. We will remove _shard<>_replica<>
>>>> and
>>>> use only the collection name to commit.
>>>>
>>>> *Alessandro*,
>>>>
>>>> We are using Solr Cloud with replication factor of 2 and no of shards as
>>>> either 2 or 3.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti <
>>>> abenede...@apache.org
>>>>
>>>> wrote:
>>>>> Let me answer in line :
>>>>>
>>>>> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>>>>>
>>>>> We are facing some issue and we are finding it difficult to debug the
>>>>>
>>>>>> problem. We wanted to understand how solr commit works.
>>>>>> A background on our setup:
>>>>>> We have  3 Node Solr Cluster running in version 5.3.1. Its a index
>>>>>> heavy
>>>>>> use case. In peak load, we index 400-500 documents/second.
>>>>>> We also want these documents to be visible as quickly as possible,
>>>>>> hence
>>>>>>
>

Re: Increasing Solr5 time out from 30 seconds while starting solr

2015-12-08 Thread Rahul Ramesh
Hi Debraj,
I dont think increasing the timeout will help. Are you sure solr/ any other
program is not running on 8789? Please check the output of lsof -i :8789 .

Regards,
Rahul

On Tue, Dec 8, 2015 at 11:58 PM, Debraj Manna 
wrote:

> Can someone help me on this?
> On Dec 7, 2015 7:55 PM, "D"  wrote:
>
> > Hi,
> >
> > Many time while starting solr I see the below message and then the solr
> is
> > not reachable.
> >
> > debraj@boutique3:~/solr5$ sudo bin/solr start -p 8789
> > Waiting to see Solr listening on port 8789 [-]  Still not seeing Solr
> listening on 8789 after 30 seconds!
> >
> > However when I try to start solr again by trying to execute the same
> > command. It says that *"solr is already running on port 8789. Try using a
> > different port with -p"*
> >
> > I am having two cores in my local set-up. I am guessing this is happening
> > because one of the core is a little big. So solr is timing out while
> > loading the core. If I take one of the core out of solr then everything
> > works fine.
> >
> > Can some one let me know how can I increase this timeout value from
> > default 30 seconds?
> >
> > I am using Solr 5.2.1 on Debian 7.
> >
> > Thanks,
> >
> >
>


Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Rahul Ramesh
We currently moved data from magnetic drive to SSD. We run Solr in cloud
mode. Only data is stored in the drive configuration is stored in ZK. We
start solr using the -s option specifying the data dir
Command to start solr
./bin/solr start -c -h  -p  -z  -s 

We followed the following steps to migrate data

1. Stop all new insertions .
2. Copy the solr data to the new location
3. restart the server with -s option pointing to new solr directory name.
4. We have a 3 node solr cluster. The restarted server will get in sync
with the other two servers.
5. Repeat this procedure for other two servers.

We used the similar procedure to upgrade from 5.2.1 to 5.3.1.





On Tue, Dec 15, 2015 at 5:07 AM, Jeff Wartes  wrote:

>
> Don’t set solr.data.dir. Instead, set the install dir. Something like:
> -Dsolr.solr.home=/data/solr
> -Dsolr.install.dir=/opt/solr
>
> I have many solrcloud collections, and separate data/install dirs, and
> I’ve never had to do anything with manual per-collection or per-replica
> data dirs.
>
> That said, it’s been a while since I set this up, and I may not remember
> all the pieces.
> You might need something like this too, for example:
>
> -Djetty.home=/opt/solr/server
>
>
> On 12/14/15, 3:11 PM, "Erick Erickson"  wrote:
>
> >Currently, it'll be a little tedious but here's what you can do (going
> >partly from memory)...
> >
> >When you create the collection, specify the special value EMPTY for
> >createNodeSet (Solr 5.3+).
> >Use ADDREPLICA to add each individual replica. When you do this, you
> >can add a dataDir for
> >each individual replica and thus keep them separate, i.e. for a
> >particular box the first
> >replica would get /data/solr/collection1_shard1_replica1, the second
> >/data/solr/collection1_shard2_replica1 and so forth.
> >
> >If you don't have Solr 5.3+, you can still to the same thing, except
> >you create your collection letting
> >the replicas fall where they will. Then do the ADDREPLICA as above.
> >When that's all done,
> >DELETEREPLICA for the original replicas.
> >
> >Best,
> >Erick
> >
> >On Mon, Dec 14, 2015 at 2:21 PM, Tom Evans 
> >wrote:
> >> On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey 
> >>wrote:
> >>> On 12/14/2015 10:49 AM, Tom Evans wrote:
>  When I tried this in SolrCloud mode, specifying
>  "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
>  for the first collection, but then the second collection tried to use
>  the same directory to store its index, which obviously failed. I fixed
>  this by changing solrconfig.xml in each collection to specify a
>  specific directory, like so:
> 
>    ${solr.data.dir:}products
> 
>  Looking back after the weekend, I'm not a big fan of this. Is there a
>  way to add a core.properties to ZK, or a way to specify
>  core.baseDatadir on the command line, or just a better way of handling
>  this that I'm not aware of?
> >>>
> >>> Since you're running SolrCloud, just let Solr handle the dataDir, don't
> >>> try to override it.  It will default to "data" relative to the
> >>> instanceDir.  Each instanceDir is likely to be in the solr home.
> >>>
> >>> With SolrCloud, your cores will not contain a "conf" directory (unless
> >>> you create it manually), therefore the on-disk locations will be *only*
> >>> data, there's not really any need to have separate locations for
> >>> instanceDir and dataDir.  All active configuration information for
> >>> SolrCloud is in zookeeper.
> >>>
> >>
> >> That makes sense, but I guess I was asking the wrong question :)
> >>
> >> We have our SSDs mounted on /data/solr, which is where our indexes
> >> should go, but our solr install is on /opt/solr, with the default solr
> >> home in /opt/solr/server/solr. How do we change where the indexes get
> >> put so they end up on the fast storage?
> >>
> >> Cheers
> >>
> >> Tom
>
>


Re: solrcloud used a lot of memory and memory keep increasing during long time run

2015-12-15 Thread Rahul Ramesh
You should actually decrease solr heap size. Let me explain a bit.

Solr requires very less heap memory for its operation and more memory for
storing data in main memory. This is because solr uses mmap for storing the
index files.
Please check the link
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html for
understanding how solr operates on files .

Solr has typical problem of Garbage collection once you the heap size to a
large value. It will have indeterminate pauses due to GC. The amount of
heap memory required is difficult to tell. However the way we tuned this
parameter is setting it to a low value and increasing it by 1Gb whenever
OOM is thrown.

Please check the problem of having large Java Heap

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap


Just for your reference, in our production setup, we have data of around
60Gb/node spread across 25 collections. We have configured 8GB as heap and
the rest of the memory we will leave it to OS to manage. We do around 1000
(search + Insert)/second on the data.

I hope this helps.

Regards,
Rahul



On Tue, Dec 15, 2015 at 4:33 PM, zhenglingyun  wrote:

> Hi, list
>
> I’m new to solr. Recently I encounter a “memory leak” problem with
> solrcloud.
>
> I have two 64GB servers running a solrcloud cluster. In the solrcloud, I
> have
> one collection with about 400k docs. The index size of the collection is
> about
> 500MB. Memory for solr is 16GB.
>
> Following is "ps aux | grep solr” :
>
> /usr/java/jdk1.7.0_67-cloudera/bin/java
> -Djava.util.logging.config.file=/var/lib/solr/tomcat-deployment/conf/logging.properties
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
> -Dsolr.hdfs.blockcache.blocksperbank=16384
> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> -Xloggc:/var/log/solr/gc.log
> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -DzkHost=
> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
> -Dsolr.authentication.simple.anonymous.allowed=true
> -Dsolr.security.proxyuser.hue.hosts=*
> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
> -Dsolr.max.connector.thread=1 -Dsolr.solr.home=/var/lib/solr
> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
> -Dsolr.hdfs.blockcache.blocksperbank=16384
> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> -Xloggc:/var/log/solr/gc.log
> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -DzkHost=
> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
> -Dsolr.authentication.simple.anonymous.allowed=true
> -Dsolr.security.proxyuser.hue.hosts=*
> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
> -Dsolr.max.connector.thread=1 -Dsolr.solr.home=/var/lib/solr
> -Djava.endorsed.dirs=/usr/lib/bigtop-tomcat/endorsed -classpath
> /usr/lib/bigtop-tomcat/bin/bootstrap.jar
> -Dcatalina.base=/var/lib/solr/tomcat-deployment
> -Dcatalina.home=/usr/lib/bigtop-tomcat -Djava.io.tmpdir=/var/lib/solr/
> org.apache.catalina.startup.Bootstrap start
>
>
> solr version is solr4.4.0-cdh5.3.0
> jdk version is 1.7.0_67
>
> Soft commit time is 1.5s. And we have real time indexing/partialupdating
> rate about 100 docs per second.
>
> When fresh started, Solr will use ab