On 10/27/2011 5:56 AM, Michael Sokolov wrote:
From everything you've said, it certainly sounds like a low-level I/O
problem in the client, not a server slowdown of any sort. Maybe Perl
is using the same connection over and over (keep-alive) and Java is
not. I really don't know. One thing I've heard is that
StreamingUpdateSolrServer (I think that's what it's called) can give
better throughput for large request batches. If you're not using
that, you may be having problems w/closing and re-opening connections?
Although I can't claim to know for sure, I'm fairly sure that the simple
LWP classes I'm using don't do keepalive unless you specifically
configure the user agent to do so. I'll look into it some more.
The StreamingUpdateSolrServer says that they only recommend using it
with the /update handler, not for queries. I'm not having a problem
with the deletes themselves, they go pretty fast. It's all of the
queries before each delete that are relatively slow. Doing those
queries really adds up. With multithreading, it does all the shards at
once, but it still can only query for a limited number of values at a
time due to maxBooleanClauses. Now I'm checking and deleting 1000
values at a time, on all shards simultanously. I use
CommonsHttpSolrServer, and each of those objects is created only once,
when the program first starts up.
I figure there are three possibilities:
1) A glaring inefficiency in CommonsHttpSolrServer queries as compared
to a straight HTTP POST request.
2) The compartmentalization provided by the virtual machine architecture
creates an odd synergy that is not present when there are only two Solr
instances on physical machines instead of eight of them (seven shards
plus a search broker) on virtual machines.
3) The extra physical memory on the servers with virtualization is
granting more of a disk-cache-related performance improvement than the
lack of virtualization on the others.
Only the first of those possible problems is something that can be
determined or fixed without migrating the other servers to my new
system. I'm having one other problem with the new build program. I
haven't figured out exactly what that problem is, so I am very reluctant
to switch everything over. So far it seems to be related to the MySQL
JDBC connector or my attempt at threading, not Solr.
I mentioned that the hardware is identical except for memory. That's
not quite true - the servers accessed by the java program are better.
One of them has a slightly faster CPU than its counterpart with
virtualization, and they all have 1TB hard drives as opposed to the
mixed 500GB & 750GB drives in the other servers. All of the servers are
Dell 2950 with six-drive RAID10 arrays.