[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas updated HADOOP-14535:
----------------------------
    Attachment: 
0005-Random-access-and-seek-imporvements-to-azure-file-system.patch

I am attaching the updated patch 
(0005-Random-access-and-seek-improvements-to-azure-file-system.patch).  Random 
access is as much as 90% faster for block blobs *without* any regressions.  
There are unit tests demonstrating the performance (see 
TestBlockBlobInputStream.java) improvement for random access and unit tests 
demonstrating that there are no performance regressions in sequential reads 
after reverse seeks.  

However, please note that unit tests and various developer machines are not an 
appropriate environment for measuring performance.  The performance tests in 
TestBlockBlobInputStream.java merely demonstrate the behavior and prevent 
regressions.  There are many things which can impact performance measurements 
over short periods of time, such as but not limited to fluctuations in network 
traffic and routing, fluctuations in activity of other processes running on the 
client, fluctuations in load on the shared stamp that hosts your Azure Storage 
account, and throttling sometimes performed by enterprise IT departments.  The 
performance tests included with this change are written to execute quickly and 
work around these fluctuations, and prevent regressions in the code.  In the 
process of implementing and running these unit tests, I also validated the 
performance improvements by running variations of the code for longer periods 
and the results looked favorable.

My team plans to review and improve the instrumentation (Hadoop Metrics) for 
the wasb:// file system.  Although this change does not include new metrics, we 
will be looking into this in the future.

ALL tests in "hadoop-tools/hadoop-azure" are passing with the patch 
(0005-Random-access-and-seek-improvements-to-azure-file-system.patch).

> Support for random access and seek of block blobs
> -------------------------------------------------
>
>                 Key: HADOOP-14535
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14535
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>            Reporter: Thomas
>            Assignee: Thomas
>         Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to