Hi Erick, Yes, we are planning to do sharding when we upgrade to the newer Solr 7.1.0, and probably will re-index everything. But currently we are waiting for certain issues on indexing the EML files to Solr 7.1.0 to be addressed first, like for this JIRA, https://issues.apache.org/jira/browse/SOLR-11622, which currently gives the following error when indexing EML files.
java.lang.NoClassDefFoundError: org/apache/james/mime4j/stream/MimeConfig$Builder Meanwhile, as we are still on Solr 6.5.1, we plan to just merge the index, so that customer can continue to access the current index. The re-indexing will likely to take 3 to 4 weeks too, given the size of the data. Also, is there any way to do sharding for our current index size of 3.5TB, or is re-index the only way? Regards, Edwin On 23 November 2017 at 09:31, Erick Erickson <erickerick...@gmail.com> wrote: > Sure, sharding can give you accurate faceting, although do note there > are nuances, JSON faceting can occasionally be not exact, although > there are JIRAs being worked on to correct this. > > "traditional" faceting has a refinement phase that gets accurate counts. > > But the net-net is that I believe your merging is just the first of > many problems you'll encounter with indexes this size and starting > over with a reasonable sharding strategy is likely the fastest path to > what you want. > > Merging indexes isn't going to work for you though, you'll have to > create a new collection and reindex everything. As a straw-man > recommendation, I'd put no more than 200G on each shard in terms of > index size. > > Best, > Erick > > On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo > <edwinye...@gmail.com> wrote: > > I'm doing the merging on the SSD drive, the speed should be ok? > > > > We need to merge because the data are indexed in two different > collections, > > and we need them to be under the same collection, so that we can do > things > > like faceting more accurately. > > Will sharding alone achieve this? Or do we have to merge first before we > do > > the sharding? > > > > Regards, > > Edwin > > > > On 23 November 2017 at 01:32, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Really, let's back up here though. This sure seems like an XY problem. > >> You're merging indexes that will eventually be something on the order > >> of 3.5TB. I claim that an index of that size is very difficult to work > >> with effectively. _Why_ do you want to do this? Do you have any > >> evidence that you'll be able to effectively use it? > >> > >> And Shawn tells you that the result will be one large segment. If you > >> replace documents in that index, it will consist of around 3.4975T > >> wasted space before the segment is merged, see: > >> https://lucidworks.com/2017/10/13/segment-merging-deleted- > >> documents-optimize-may-bad/. > >> > >> You already know that merging is extremely painful. This sure seems > >> like a case where the evidence is mounting that you would be far > >> better off sharding and _not_ merging. > >> > >> FWIW, > >> Erick > >> > >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey <apa...@elyograg.org> > wrote: > >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: > >> >> I am using the IndexMergeTool from Solr, from the command below: > >> >> > >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > >> >> org.apache.lucene.misc.IndexMergeTool > >> >> > >> >> The heap size is 32GB. There are more than 20 million documents in > the > >> two > >> >> cores. > >> > > >> > I have looked at IndexMergeTool, and confirmed that it does its job in > >> > exactly the same way that Solr does an optimize, so I would still > expect > >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old > >> > hardware that can't transfer data that quickly. > >> > > >> > Thanks, > >> > Shawn > >> > > >> >