On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote: > For my rich documentation handling, I'm using Extracting Request Handler, and > it requires OCR. > > However, currently, for the slow indexing speed which I'm experiencing, the > indexing is done directly from the Sybase database. I will fetch about 1000 > records at a time from Sybase, and stored in into a CacheRowSet for it to be > indexed. The query to the Sybase database is quite fast, and most of the time > is spend on processes in the CacheRowSet. <snip> > A) 384 GB <snip> > A) 22 GB <snip> > A) 5 TB <snip> > A) A virtual machine with Sybase database is running on the server
The discussion about the drawbacks of the Extracting Request Handler has already taken place. Tika should be running on separate hardware, not embedded in Solr. Having high-impact Tika processing run on the Solr server is going to slow everything down. Are the two types of indexing (ERH with OCR, and indexing from a DB) happening on the same Solr server? As soon as you mention virtual machines, my mental picture of the setup becomes much less clear. You'll need to fully describe the OS and hardware setup, at both the hypervisor and virtual machine level. Then I will know what questions to ask for more detailed information. Is Solr in a virtual machine? Is the 384GB at the hypervisor level, or the virtual machine level? Is the 22GB heap the total heap memory, or is that per Solr instance? If the 5TB is Solr index data, then there's no way you're going to get fast performance. Putting enough memory in one machine to effectively cache that much data is impractically expensive, and most server hardware doesn't have enough memory slots even if you do have the money. 384GB wouldn't be enough for 5TB of index, and that's not even taking into account the memory needed by your software, including Solr and Sybase. Thanks, Shawn