Hi,

we are currently thinking about the performance facts too.
I wonder if there are any sites on the net describing what a large index is? 

People always talk about huge indexes and heavy commits etc. but i can't find 
some stats about it in numbers and no information about the hardware used.

Maybe an article in the wiki would help.

I expect our index to be about 4 to 5 gig with 500.000 docs and 80.000 commits 
a day. Is that considered to be large, medium or small?

Greets
Sebastian

-----Ursprüngliche Nachricht-----
Von: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] 
Gesendet: Donnerstag, 3. November 2011 14:00
An: 'solr-user@lucene.apache.org'
Betreff: RE: large scale indexing issues / single threaded bottleneck

Shishir, we have 35 million "documents", and should be doing about 5000-10000 
new "documents" a day, but with very small "documents":  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-----Original Message-----
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com]
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time 
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there be any 
performance issues?

Thanks,
Shishir

-----Original Message-----
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com]
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday night. 
Adding a whole bunch of debug statements to the info stream showed that every 
document is following "update document" path instead of "add document" path. 
Meaning that all document IDs are getting into the "pending deletes" queue, and 
Solr has to rescan its index on every commit for potential deletions. This is 
single threaded and seems to get progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my debug 
statements showed that messages still go to updateDocument() function with 
deleteTerm not being null. So, I hacked Lucene a little bit and set 
deleteTerm=null as a temporary solution in the beginning of updateDocument(), 
and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million 
documents/hour (producing 20Gb of index every hour). Right now it's at 1TB 
index size and going, without noticeable degradation of the indexing speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads are 
being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. 
I guess the changes on the trunk would definitely help, but we will likely stay 
on 3.4

Will dig more into the issue on Monday. Really curious to see why 
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America 
Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are 
Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a 
Condition to Any Banking Service or Activity * Are Not Insured by Any Federal 
Government Agency. Attachments that are part of this EC may have additional 
important disclosures and disclaimers, which you should read. This message is 
subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you 
consent to the foregoing.

Reply via email to