On 12/4/2011 12:41 AM, Ted Dunning wrote:
Read the papers I referred to.  They describe how to search fairly enormous
corpus with an 8GB in-memory index (and no disk cache at all).

They would seem to indicate moving away from Solr. While that would not be entirely out of the question, I don't relish coming up with a whole new system from scratch, one part of which will mean rewriting the build system a third time.

I have 16 processor cores available for each index chain (two servers).  If
I set aside one for the distributed search itself and one for the
incremental (that small 3.5 to 7 day shard), it sounds like my ideal
numShards from Solr's perspective is 14.  I have some fear that my database
server will fall over under the load of 14 DB connections during a full
index rebuild, though.  Do you have any other thoughts for me?

Off-line indexing from a flat-file dump?  My guess is that you can dump to
disk from the db faster than you can index and a single dumping thread
might be faster than many.

What I envision when I read this is doing a single pass from the database into a file, which is then split into a number of pieces, one for each shard, then that gets imported simultaneously into a build core for each shard. Is that what you were thinking?

It looks like there is a way to have mysql output xml, would that be a reasonable way to go about this? I know a little bit about handling XML in Perl, but only by reading the entire file. I need a very speedy way to read and write (split) large XML, preferably in Java.

mysql -u user -p -h dbhost db --quick --xml -e 'SELECT * FROM view' > view.xml

When I ran this command, it took 64 minutes (about a third of the total time using the data import handler) and produced an XML file 176632084KB, or 169GB in size, containing over 65 million documents. This view only includes the fields necessary to build the Solr index, all other fields are excluded. The total distributed index size is about 60GB right now. I"ll be interested to see how long it takes to split and import the XML.

Thanks,
Shawn

Reply via email to