On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> For my rich documentation handling, I'm using Extracting Request Handler, and 
> it requires OCR.
>
> However, currently, for the slow indexing speed which I'm experiencing, the 
> indexing is done directly from the Sybase database. I will fetch about 1000 
> records at a time from Sybase, and stored in into a CacheRowSet for it to be 
> indexed. The query to the Sybase database is quite fast, and most of the time 
> is spend on processes in the CacheRowSet.
<snip>
> A) 384 GB
<snip>
> A) 22 GB
<snip>
> A) 5 TB
<snip>
> A) A virtual machine with Sybase database is running on the server

The discussion about the drawbacks of the Extracting Request Handler has
already taken place.  Tika should be running on separate hardware, not
embedded in Solr.  Having high-impact Tika processing run on the Solr
server is going to slow everything down.

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?

As soon as you mention virtual machines, my mental picture of the setup
becomes much less clear.  You'll need to fully describe the OS and
hardware setup, at both the hypervisor and virtual machine level.  Then
I will know what questions to ask for more detailed information.

Is Solr in a virtual machine?
Is the 384GB at the hypervisor level, or the virtual machine level?
Is the 22GB heap the total heap memory, or is that per Solr instance?

If the 5TB is Solr index data, then there's no way you're going to get
fast performance.  Putting enough memory in one machine to effectively
cache that much data is impractically expensive, and most server
hardware doesn't have enough memory slots even if you do have the
money.  384GB wouldn't be enough for 5TB of index, and that's not even
taking into account the memory needed by your software, including Solr
and Sybase.

Thanks,
Shawn

Reply via email to