Solr Cloud Segments and Merging Issues

2014-03-13 Thread Varun Rajput
I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
machine with a zookeeper quorum running on 3 other machines. The index size
on each shard is about 15GB. I noticed that the number of segments in
second shard was 42 and in the remaining shards was between 25-30.

I am basically trying to get the number of segments down to a reasonable
size like 4 or 5 in order to improve the search time. We do have some
documents indexed everyday, so we don't want to do an optimize every day.

The merge factor with the TierMergePolicy is only the number of segments
per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
shard, I tried clearing the index, reducing the mergeFactor and re-indexing
the same data in the same manner, multiple times, but I don't see a pattern
of reduction in number of segments.

No mergeFactor set  => 42 segments
mergeFactor=5  =>   22 segments
mergeFactor=2  =>   22 segments

Below is the simple configuration, as specified in the documentation, I am
using for merging:



  2

  2





What is the best way in which I can use merging to restrict the number of
segments being formed?

Also, we are moving from Solr 1.4 (Master-Slave) to Solr 4.6.0 Cloud and
see a great increase in response time from about 18ms to 150ms. Is this a
known issue? Is there no way to reduce the response time? In the MBeans,
the individual cores show the /select handler attributes having search
times around 8ms. What is it that causes the overall response time to
increase so much?

-Varun


Re: Solr Cloud Segments and Merging Issues

2014-03-13 Thread Varun Rajput
Hi Remi,

I read your post and like you, I have also identified that running solr
4.6.0 in cloud mode results in higher response time which has something to
do with merging of documents from the various shards.

Looking at the source code, we couldn't understand why it would take so much
time for merging the documents. If you do find any solution, please share
with me.

Thanks,
Varun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Segments-and-Merging-Issues-tp4123316p4123472.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud Segments and Merging Issues

2014-03-13 Thread Varun Rajput
Hey Shawn,

> The config with the old policy used to be the literal name
> "mergeFactor".  With TieredMergePolicy, there are now three settings
> that must be changed in order to actually be the same as what
> mergeFactor used to do.The followingconfig snippet is the equivalent
> config to a mergeFactor of 10, so these are the default settings.  If
> you don't change all three (especially segmentsPerTier), then you are
> not actually changing the "mergeFactor".
> 
>
>  10
>  10
>  30
>

I tried specifying all these configurations, but it still doesn't work as
expected. I even tried specifying a maxMergeSegmentMB to 20GB instead of the
default 5GB. This is the config I tried:


  2
  2
  100
  2199023220



> With newer Solr versions, there is not as much speedup to be gained from
> fewer segments as before.  There *is* a noticeable change, but it is no
> longer the night/day difference it used to be.

We did a performance test on a normal and optimized index and saw a
considerable improvement (almost double) in response time. That's the reason
why we want to reduce our number of segments as we have a large index with
very small amount of updates.

> Assuming that there are no system resource limitations(especially RAM),
> a distributed index is slower than a single index of the same total
> size.  Where distributed indexes have an edge is in very large indexes
> or indexes with a moderately high query rate -- by applying more total
> RAM and/or CPU resources to the problem.  If your index already fits
> entirely into the OS disk cache, or you are sending a a handful of test
> queries, you won't notice any performance benefit from going distributed.

We have a large index which won't fit in memory and need high query rates.

> For SUPER high query rates, you need more replicas.  More shards might
> actually make performance go down in this situation.

This is something we identified while testing. We had to optimize the number
of shards to be lesser but a reasonable number that will allow us grow the
size of data in future.

-Varun


> I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
> machine with a zookeeper quorum running on 3 other machines. The index
> size
> on each shard is about 15GB. I noticed that the number of segments in
> second shard was 42 and in the remaining shards was between 25-30.
>
> I am basically trying to get the number of segments down to a reasonable
> size like 4 or 5 in order to improve the search time. We do have some
> documents indexed everyday, so we don't want to do an optimize every day.
>
> The merge factor with the TierMergePolicy is only the number of segments
> per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
> shard, I tried clearing the index, reducing the mergeFactor and
> re-indexing
> the same data in the same manner, multiple times, but I don't see a
> pattern
> of reduction in number of segments.
>
> No mergeFactor set  => 42 segments
> mergeFactor=5  =>   22 segments
> mergeFactor=2  =>   22 segments
>
> Below is the simple configuration, as specified in the documentation, I am
> using for merging:
>
> 
>
>2
>
>2
>
> 
>
> 
>
> What is the best way in which I can use merging to restrict the number of
> segments being formed? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Segments-and-Merging-Issues-tp4123316p4123489.html
Sent from the Solr - User mailing list archive at Nabble.com.


A field-wide remove duplicate tokens filter

2014-12-17 Thread Varun Rajput
The org.apache.solr.analysis.RemoveDuplicatesTokenFilter, as per its 
description, "Filters out any tokens which are at the same logical position in 
the tokenstream as a previous token with the same text."
A very useful filter would be one which filters out duplicate tokens throughout 
the field, irrespective of the logical position of the token. Does something 
like this exist already or is being planned to be included in the coming 
releases?
I have an implementation of this in one of my project and can contribute if the 
community finds it useful as well.
Best,Varun