Thanks for the confirmation. I am aware of the current reader implementation along with its limitation, but wanted to make sure if there was any ongoing/exiting contribution related to this.
We are hoping to influence the selection of a single block replica location for remote reads based on the global context, since we have workloads that run both remotely and ones that run locally which short-circuits reads. Part of the planning phase involves assessing disk utilization across all the hosts prior to making a query dispatch plan. One obvious solution would be to completely disable circuit reads but that is not an option for us considering the criticality of those specific workloads. On Sun, Jul 23, 2017 at 8:48 PM, Harsh J <[email protected]> wrote: > There isn't an API way to hint/select DNs to read from currently - you may > need to do manual changes (contribution of such a feature is welcome, > please file a JIRA to submit a proposal). > > You can perhaps hook your control of which replica location for a given > block is selected by the reader under the non-public method DFSInputStream# > getBestNodeDNAddrPair(…): https://github.com/apache/hadoop/ > blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/ > main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L982-L1021 (ensure > to preserve the existing logic around edge cases, however) > > Note though that the block replica location list returned for off-rack > reads by the NameNode are randomized by default. Are you observing a > non-random distribution of reads? > > On Wed, 19 Jul 2017 at 06:16 Shivram Mani <[email protected]> wrote: > >> We have an application which uses the DFSInputStream to read blocks from >> a *remote* hadoop cluster. Is there any way we can influence which >> specific datanode the block fetch request is dispatched to ? >> >> The reasoning behind this is since our application workload is very heavy >> on IO , we would like to distribute the IO load as evenly as possible >> across the hosts/disks. Hence prior to reading data, we wish to obtain the >> location of the underlying blocks and build a dispatch plan so as to >> maximize the IO throughput on the HDFS cluster. >> >> How do we go about this ? >> >> >> -- >> Thanks >> Shivram >> > -- Thanks Shivram
