On 3/17/2017 11:14 AM, Imad Qureshi wrote: > I understand that but unfortunately that's not an option right now. We > already have 16 TB of index in HDFS. > > So let me rephrase this question. How important is data locality for SOLR. Is > performance impacted if SOLR data is on a remote node?
What's going to matter is how fast the data can be retrieved. With standard local filesystems, the operating system will use unallocated memory to cache the data, so if you have enough available memory for that caching to be effective, access is lightning fast -- the most requested index data will be in memory, and pulled directly from there into the application. If the disk has to be read to obtain the needed data, it will be slow. If data has to be transferred over a network that's gigabit or slower, that is also slow. Faster network technologies are available for a price premium, but if a disk has to be read to get the data, the network speed won't matter. Good performance means avoiding going to the disk or transferring over the network. SSD storage is faster than regular disks, but still not as fast as main memory, and increased storage speed probably won't matter if the network can't keep up. If I'm not mistaken, I think an HDFS client can allocate system memory for caching purposes to avoid the slow transfer for frequently requested data. If my understanding is correct, then enough memory allocated to the HDFS client MIGHT avoid network/disk transfer for the important data in the index ... but whether this works in practice is a question I cannot answer. Unless your 16TB of index data is being utilized by MANY Solr servers that each use a very small part of the data and have the ability to cache a significant percentage of the data they're using, it's highly unlikely that you're going to have enough memory for good caching. Indexes that large are typically slow unless you can afford a LOT of hardware, which means a lot of money. Thanks, Shawn