On 2/22/2013 9:02 AM, jimtronic wrote:
Yes, these are good points. I'm using solr to leverage user preference data
and I need that data available real time. SQL just can't do the kind of
things I'm able to do in solr, so I have to wait until the write (a user
action, a user preference, etc) gets to solr from the db anyway.
I'm kind of curious about how many single documents i can send through via
the json update in a day. Millions would be nice, but I wonder what the
upper limit would be.
I have a distributed index with about 76 million documents in it,
original source is MySQL. It's comprised of seven shards - six of them
are large, with over 12 millon docs each. One shard is small, usually
containing only a few hundred thousand docs. The full-import updates
all seven shards in parallel, but other than that, it is not a
multi-threaded operation.
On my dev environment, I'm absolutely positive that my bottleneck is I/O
on the Solr server. That server has 7200RPM SAS drives in basic RAID1
and takes about 8 hours for a full-import. It contains the entire index.
In production, I am not sure where the bottleneck is - my guess is that
it's I/O, but it might be in the database. These servers have RAID10
with six 7200RPM SATA drives, a caching RAID controller, plenty of RAM,
and each one contains only half the index. On version 3.5, it takes
about 3.5 hours for a full-import. On 4.2-SNAPSHOT, it takes about 4
hours. The new version has the updateLog enabled.
Thanks,
Shawn