On 3/17/2017 11:14 AM, Imad Qureshi wrote:
> I understand that but unfortunately that's not an option right now. We 
> already have 16 TB of index in HDFS. 
>
> So let me rephrase this question. How important is data locality for SOLR. Is 
> performance impacted if SOLR data is on a remote node?

What's going to matter is how fast the data can be retrieved.  With
standard local filesystems, the operating system will use unallocated
memory to cache the data, so if you have enough available memory for
that caching to be effective, access is lightning fast -- the most
requested index data will be in memory, and pulled directly from there
into the application.  If the disk has to be read to obtain the needed
data, it will be slow.  If data has to be transferred over a network
that's gigabit or slower, that is also slow.  Faster network
technologies are available for a price premium, but if a disk has to be
read to get the data, the network speed won't matter.  Good performance
means avoiding going to the disk or transferring over the network.

SSD storage is faster than regular disks, but still not as fast as main
memory, and increased storage speed probably won't matter if the network
can't keep up.

If I'm not mistaken, I think an HDFS client can allocate system memory
for caching purposes to avoid the slow transfer for frequently requested
data.  If my understanding is correct, then enough memory allocated to
the HDFS client MIGHT avoid network/disk transfer for the important data
in the index ... but whether this works in practice is a question I
cannot answer.

Unless your 16TB of index data is being utilized by MANY Solr servers
that each use a very small part of the data and have the ability to
cache a significant percentage of the data they're using, it's highly
unlikely that you're going to have enough memory for good caching. 
Indexes that large are typically slow unless you can afford a LOT of
hardware, which means a lot of money.

Thanks,
Shawn

Reply via email to