[ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334311#comment-15334311
 ] 

Steve Loughran commented on HADOOP-13203:
-----------------------------------------

-1

I accept your contention that it works well for your benchmark. However it does 
so at the expense of being pathologically bad for anything trying to read a 
file in different ways. I can demonstrate this with the log for one of my 
SPARK-7481 runs. Essentially, to read a 20MB .csv.gz file has gone from < 20s 
to about 6 minutes: 20x slower.

{code}

2016-06-16 18:34:21,350 INFO  scheduler.DAGScheduler 
(Logging.scala:logInfo(58)) - Job 0 finished: count at S3LineCount.scala:99, 
took 350.460510 s
2016-06-16 18:34:21,355 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- Duration of  count s3a://landsat-pds/scene_list.gz = 350,666,373,013 ns
2016-06-16 18:34:21,355 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- line count = 514524
2016-06-16 18:34:21,356 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- File System = S3AFileSystem{uri=s3a://landsat-pds, 
workingDir=s3a://landsat-pds/user/stevel, partSize=5242880, 
enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, 
blockSize=1048576, multiPartThreshold=5242880, statistics {22626526 bytes read, 
0 bytes written, 3 read ops, 0 large read ops, 0 write ops}, metrics 
{{Context=S3AFileSystem} 
{FileSystemId=03e96d8b-c5d4-4b3c-8b9d-04931588912b-landsat-pds} 
{fsURI=s3a://landsat-pds/scene_list.gz} {files_created=0} {files_copied=0} 
{files_copied_bytes=0} {files_deleted=0} {directories_created=0} 
{directories_deleted=0} {ignored_errors=1} {invocations_copyfromlocalfile=0} 
{invocations_exists=0} {invocations_getfilestatus=3} {invocations_globstatus=1} 
{invocations_is_directory=0} {invocations_is_file=0} {invocations_listfiles=0} 
{invocations_listlocatedstatus=0} {invocations_liststatus=0} 
{invocations_mdkirs=0} {invocations_rename=0} {object_copy_requests=0} 
{object_delete_requests=0} {object_list_requests=0} 
{object_metadata_requests=3} {object_multipart_aborted=0} {object_put_bytes=0} 
{object_put_requests=0} {streamReadOperations=1584} 
{streamForwardSeekOperations=0} {streamBytesRead=22626526} 
{streamSeekOperations=0} {streamReadExceptions=0} {streamOpened=1584} 
{streamReadOperationsIncomplete=1584} {streamAborted=0} 
{streamReadFullyOperations=0} {streamClosed=1584} {streamBytesSkippedOnSeek=0} 
{streamCloseOperations=1584} {streamBytesBackwardsOnSeek=0} 
{streamBackwardSeekOperations=0} }}

{code}

And without the patch

{code}
2016-06-16 18:37:55,688 INFO  scheduler.DAGScheduler 
(Logging.scala:logInfo(58)) - Job 0 finished: count at S3LineCount.scala:99, 
took 15.853566 s
2016-06-16 18:37:55,693 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- Duration of  count s3a://landsat-pds/scene_list.gz = 16,143,975,760 ns
2016-06-16 18:37:55,693 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- line count = 514524
2016-06-16 18:37:55,694 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- File System = S3AFileSystem{uri=s3a://landsat-pds, 
workingDir=s3a://landsat-pds/user/stevel, partSize=5242880, 
enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, 
blockSize=1048576, multiPartThreshold=5242880, statistics {22626526 bytes read, 
0 bytes written, 3 read ops, 0 large read ops, 0 write ops}, metrics 
{{Context=S3AFileSystem} 
{FileSystemId=96650849-6e33-441f-a976-e74443239ad6-landsat-pds} 
{fsURI=s3a://landsat-pds/scene_list.gz} {files_created=0} {files_copied=0} 
{files_copied_bytes=0} {files_deleted=0} {directories_created=0} 
{directories_deleted=0} {ignored_errors=1} {invocations_copyfromlocalfile=0} 
{invocations_exists=0} {invocations_getfilestatus=3} {invocations_globstatus=1} 
{invocations_is_directory=0} {invocations_is_file=0} {invocations_listfiles=0} 
{invocations_listlocatedstatus=0} {invocations_liststatus=0} 
{invocations_mdkirs=0} {invocations_rename=0} {object_copy_requests=0} 
{object_delete_requests=0} {object_list_requests=0} 
{object_metadata_requests=3} {object_multipart_aborted=0} {object_put_bytes=0} 
{object_put_requests=0} {streamReadOperations=2601} 
{streamForwardSeekOperations=0} {streamBytesRead=22626526} 
{streamSeekOperations=0} {streamReadExceptions=0} {streamOpened=1} 
{streamReadOperationsIncomplete=2601} {streamAborted=0} 
{streamReadFullyOperations=0} {streamClosed=1} {streamBytesSkippedOnSeek=0} 
{streamCloseOperations=1} {streamBytesBackwardsOnSeek=0} 
{streamBackwardSeekOperations=0} }}
2016-06-16 18:37:55,694 INFO  examples.S3LineCount (Logging.scala:logInfo(58)) 
- Stopping Spark Context
{code}


The test, I believe, simply reads in the whole file: no seeks, no skipping. I 
see the no of stream opened calls has gone from 1 to 1584. I suspect that is 
what's at play here

I think this code needs what I suggested, some block mechanism which works with 
open read() calls along with read operations where the full length is known. It 
also needs to handle the scenario of a read(bytes[]) which overshoots the 
currently being read block, not by closing the current read and discarding the 
data, but by reading in all the data it can from the current block, then 
starting to read the new block.

Also

* The default block size needs to be significantly bigger
* A new S3Scale test to do the full byte-by-byte scan through the file; this 
will pick up performance problems immediately. ( i may put that in myself 
anyway, to catch similar problems in other patches)
* I think some more instrumentation would be good here. Specifically "bytes 
from current read discarded". 

> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, 
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch, 
> HADOOP-13203-branch-2-004.patch, stream_stats.tar.gz
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to