Sure, sharding can give you accurate faceting, although do note there
are nuances, JSON faceting can occasionally be not exact, although
there are JIRAs being worked on to correct this.

"traditional" faceting has a refinement phase that gets accurate counts.

But the net-net is that I believe your merging is just the first of
many problems you'll encounter with indexes this size and starting
over with a reasonable sharding strategy is likely the fastest path to
what you want.

Merging indexes isn't going to work for you though, you'll have to
create a new collection and reindex everything. As a straw-man
recommendation, I'd put no more than 200G on each shard in terms of
index size.

Best,
Erick

On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> I'm doing the merging on the SSD drive, the speed should be ok?
>
> We need to merge because the data are indexed in two different collections,
> and we need them to be under the same collection, so that we can do things
> like faceting more accurately.
> Will sharding alone achieve this? Or do we have to merge first before we do
> the sharding?
>
> Regards,
> Edwin
>
> On 23 November 2017 at 01:32, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Really, let's back up here though. This sure seems like an XY problem.
>> You're merging indexes that will eventually be something on the order
>> of 3.5TB. I claim that an index of that size is very difficult to work
>> with effectively. _Why_ do you want to do this? Do you have any
>> evidence that you'll be able to effectively use it?
>>
>> And Shawn tells you that the result will be one large segment. If you
>> replace documents in that index, it will consist of around 3.4975T
>> wasted space before the segment is merged, see:
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-
>> documents-optimize-may-bad/.
>>
>> You already know that merging is extremely painful. This sure seems
>> like a case where the evidence is mounting that you would be far
>> better off sharding and _not_ merging.
>>
>> FWIW,
>> Erick
>>
>> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
>> >> I am using the IndexMergeTool from Solr, from the command below:
>> >>
>> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
>> >> org.apache.lucene.misc.IndexMergeTool
>> >>
>> >> The heap size is 32GB. There are more than 20 million documents in the
>> two
>> >> cores.
>> >
>> > I have looked at IndexMergeTool, and confirmed that it does its job in
>> > exactly the same way that Solr does an optimize, so I would still expect
>> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
>> > hardware that can't transfer data that quickly.
>> >
>> > Thanks,
>> > Shawn
>> >
>>

Reply via email to