Hi Erick,

Yes, we are planning to do sharding when we upgrade to the newer Solr
7.1.0, and probably will re-index everything. But currently we are waiting
for certain issues on indexing the EML files to Solr 7.1.0 to be addressed
first, like for this JIRA, https://issues.apache.org/jira/browse/SOLR-11622,
which currently gives the following error when indexing EML files.

java.lang.NoClassDefFoundError:
org/apache/james/mime4j/stream/MimeConfig$Builder


Meanwhile, as we are still on Solr 6.5.1, we plan to just merge the index,
so that customer can continue to access the current index. The re-indexing
will likely to take 3 to 4 weeks too, given the size of the data. Also, is
there any way to do sharding for our current index size of 3.5TB, or is
re-index the only way?

Regards,
Edwin


On 23 November 2017 at 09:31, Erick Erickson <erickerick...@gmail.com>
wrote:

> Sure, sharding can give you accurate faceting, although do note there
> are nuances, JSON faceting can occasionally be not exact, although
> there are JIRAs being worked on to correct this.
>
> "traditional" faceting has a refinement phase that gets accurate counts.
>
> But the net-net is that I believe your merging is just the first of
> many problems you'll encounter with indexes this size and starting
> over with a reasonable sharding strategy is likely the fastest path to
> what you want.
>
> Merging indexes isn't going to work for you though, you'll have to
> create a new collection and reindex everything. As a straw-man
> recommendation, I'd put no more than 200G on each shard in terms of
> index size.
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
> <edwinye...@gmail.com> wrote:
> > I'm doing the merging on the SSD drive, the speed should be ok?
> >
> > We need to merge because the data are indexed in two different
> collections,
> > and we need them to be under the same collection, so that we can do
> things
> > like faceting more accurately.
> > Will sharding alone achieve this? Or do we have to merge first before we
> do
> > the sharding?
> >
> > Regards,
> > Edwin
> >
> > On 23 November 2017 at 01:32, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Really, let's back up here though. This sure seems like an XY problem.
> >> You're merging indexes that will eventually be something on the order
> >> of 3.5TB. I claim that an index of that size is very difficult to work
> >> with effectively. _Why_ do you want to do this? Do you have any
> >> evidence that you'll be able to effectively use it?
> >>
> >> And Shawn tells you that the result will be one large segment. If you
> >> replace documents in that index, it will consist of around 3.4975T
> >> wasted space before the segment is merged, see:
> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-
> >> documents-optimize-may-bad/.
> >>
> >> You already know that merging is extremely painful. This sure seems
> >> like a case where the evidence is mounting that you would be far
> >> better off sharding and _not_ merging.
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey <apa...@elyograg.org>
> wrote:
> >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> >> >> I am using the IndexMergeTool from Solr, from the command below:
> >> >>
> >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> >> >> org.apache.lucene.misc.IndexMergeTool
> >> >>
> >> >> The heap size is 32GB. There are more than 20 million documents in
> the
> >> two
> >> >> cores.
> >> >
> >> > I have looked at IndexMergeTool, and confirmed that it does its job in
> >> > exactly the same way that Solr does an optimize, so I would still
> expect
> >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
> >> > hardware that can't transfer data that quickly.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>

Reply via email to