On 12/4/2011 12:41 AM, Ted Dunning wrote:
Read the papers I referred to. They describe how to search fairly enormous
corpus with an 8GB in-memory index (and no disk cache at all).
They would seem to indicate moving away from Solr. While that would not
be entirely out of the question, I don't relish coming up with a whole
new system from scratch, one part of which will mean rewriting the build
system a third time.
I have 16 processor cores available for each index chain (two servers). If
I set aside one for the distributed search itself and one for the
incremental (that small 3.5 to 7 day shard), it sounds like my ideal
numShards from Solr's perspective is 14. I have some fear that my database
server will fall over under the load of 14 DB connections during a full
index rebuild, though. Do you have any other thoughts for me?
Off-line indexing from a flat-file dump? My guess is that you can dump to
disk from the db faster than you can index and a single dumping thread
might be faster than many.
What I envision when I read this is doing a single pass from the
database into a file, which is then split into a number of pieces, one
for each shard, then that gets imported simultaneously into a build core
for each shard. Is that what you were thinking?
It looks like there is a way to have mysql output xml, would that be a
reasonable way to go about this? I know a little bit about handling XML
in Perl, but only by reading the entire file. I need a very speedy way
to read and write (split) large XML, preferably in Java.
mysql -u user -p -h dbhost db --quick --xml -e 'SELECT * FROM view' >
view.xml
When I ran this command, it took 64 minutes (about a third of the total
time using the data import handler) and produced an XML file
176632084KB, or 169GB in size, containing over 65 million documents.
This view only includes the fields necessary to build the Solr index,
all other fields are excluded. The total distributed index size is
about 60GB right now. I"ll be interested to see how long it takes to
split and import the XML.
Thanks,
Shawn