Re: Micro-Sharding

Shawn Heisey Mon, 05 Dec 2011 15:28:51 -0800

On 12/4/2011 12:41 AM, Ted Dunning wrote:

Read the papers I referred to.  They describe how to search fairly enormous
corpus with an 8GB in-memory index (and no disk cache at all).

They would seem to indicate moving away from Solr. While that would notbe entirely out of the question, I don't relish coming up with a wholenew system from scratch, one part of which will mean rewriting the buildsystem a third time.

I have 16 processor cores available for each index chain (two servers).  If
I set aside one for the distributed search itself and one for the
incremental (that small 3.5 to 7 day shard), it sounds like my ideal
numShards from Solr's perspective is 14.  I have some fear that my database
server will fall over under the load of 14 DB connections during a full
index rebuild, though.  Do you have any other thoughts for me?


Off-line indexing from a flat-file dump?  My guess is that you can dump to
disk from the db faster than you can index and a single dumping thread
might be faster than many.

What I envision when I read this is doing a single pass from thedatabase into a file, which is then split into a number of pieces, onefor each shard, then that gets imported simultaneously into a build corefor each shard. Is that what you were thinking?

It looks like there is a way to have mysql output xml, would that be areasonable way to go about this? I know a little bit about handling XMLin Perl, but only by reading the entire file. I need a very speedy wayto read and write (split) large XML, preferably in Java.

mysql -u user -p -h dbhost db --quick --xml -e 'SELECT * FROM view' >view.xml

When I ran this command, it took 64 minutes (about a third of the totaltime using the data import handler) and produced an XML file176632084KB, or 169GB in size, containing over 65 million documents.This view only includes the fields necessary to build the Solr index,all other fields are excluded. The total distributed index size isabout 60GB right now. I"ll be interested to see how long it takes tosplit and import the XML.


Thanks,
Shawn

Re: Micro-Sharding

Reply via email to