[
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574770#comment-15574770
]
ASF GitHub Bot commented on HADOOP-13560:
-----------------------------------------
Github user thodemoor commented on a diff in the pull request:
https://github.com/apache/hadoop/pull/130#discussion_r83385370
--- Diff:
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ---
@@ -881,40 +881,362 @@ Seoul
If the wrong endpoint is used, the request may fail. This may be reported
as a 301/redirect error,
or as a 400 Bad Request.
-### S3AFastOutputStream
- **Warning: NEW in hadoop 2.7. UNSTABLE, EXPERIMENTAL: use at own risk**
- <property>
- <name>fs.s3a.fast.upload</name>
- <value>false</value>
- <description>Upload directly from memory instead of buffering to
- disk first. Memory usage and parallelism can be controlled as up to
- fs.s3a.multipart.size memory is consumed for each (part)upload
actively
- uploading (fs.s3a.threads.max) or queueing
(fs.s3a.max.total.tasks)</description>
- </property>
- <property>
- <name>fs.s3a.fast.buffer.size</name>
- <value>1048576</value>
- <description>Size (in bytes) of initial memory buffer allocated for
an
- upload. No effect if fs.s3a.fast.upload is false.</description>
- </property>
+### <a name="s3a_fast_upload"></a>Stabilizing: S3A Fast Upload
+
+
+**New in Hadoop 2.7; significantly enhanced in Hadoop 2.9**
+
+
+Because of the nature of the S3 object store, data written to an S3A
`OutputStream`
+is not written incrementally —instead, by default, it is buffered to disk
+until the stream is closed in its `close()` method.
+
+This can make output slow:
+
+* The execution time for `OutputStream.close()` is proportional to the
amount of data
+buffered and inversely proportional to the bandwidth. That is
`O(data/bandwidth)`.
+* The bandwidth is that available from the host to S3: other work in the
same
+process, server or network at the time of upload may increase the upload
time,
+hence the duration of the `close()` call.
+* If a process uploading data fails before `OutputStream.close()` is
called,
+all data is lost.
+* The disks hosting temporary directories defined in `fs.s3a.buffer.dir`
must
+have the capacity to store the entire buffered file.
+
+Put succinctly: the further the process is from the S3 endpoint, or the
smaller
+the EC-hosted VM is, the longer it will take work to complete.
+
+This can create problems in application code:
+
+* Code often assumes that the `close()` call is fast;
+ the delays can create bottlenecks in operations.
+* Very slow uploads sometimes cause applications to time out. (generally,
+threads blocking during the upload stop reporting progress, so trigger
timeouts)
+* Streaming very large amounts of data may consume all disk space before
the upload begins.
+
+
+Work to addess this began in Hadoop 2.7 with the `S3AFastOutputStream`
+[HADOOP-11183](https://issues.apache.org/jira/browse/HADOOP-11183), and
+has continued with ` S3ABlockOutputStream`
+[HADOOP-13560](https://issues.apache.org/jira/browse/HADOOP-13560).
+
+
+This adds an alternative output stream, "S3a Fast Upload" which:
+
+1. Always uploads large files as blocks with the size set by
+ `fs.s3a.multipart.size`. That is: the threshold at which multipart
uploads
+ begin and the size of each upload are identical.
+1. Buffers blocks to disk (default) or in on-heap or off-heap memory.
+1. Uploads blocks in parallel in background threads.
+1. Begins uploading blocks as soon as the buffered data exceeds this
partition
+ size.
+1. When buffering data to disk, uses the directory/directories listed in
+ `fs.s3a.buffer.dir`. The size of data which can be buffered is limited
+ to the available disk space.
+1. Generates output statistics as metrics on the filesystem, including
+ statistics of active and pending block uploads.
+1. Has the time to `close()` set by the amount of remaning data to
upload, rather
+ than the total size of the file.
+
+With incremental writes of blocks, "S3A fast upload" offers an upload
+time at least as fast as the "classic" mechanism, with significant benefits
+on long-lived output streams, and when very large amounts of data are
generated.
+The in memory buffering mechanims may also offer speedup when running
adjacent to
+S3 endpoints, as disks are not used for intermediate data storage.
+
+
+```xml
+<property>
+ <name>fs.s3a.fast.upload</name>
+ <value>true</value>
+ <description>
+ Use the incremental block upload mechanism with
+ the buffering mechanism set in fs.s3a.fast.upload.buffer.
+ The number of threads performing uploads in the filesystem is defined
+ by fs.s3a.threads.max; the queue of waiting uploads limited by
+ fs.s3a.max.total.tasks.
+ The size of each buffer is set by fs.s3a.multipart.size.
+ </description>
+</property>
+
+<property>
+ <name>fs.s3a.fast.upload.buffer</name>
+ <value>disk</value>
+ <description>
+ The buffering mechanism to use when using S3A fast upload
+ (fs.s3a.fast.upload=true). Values: disk, array, bytebuffer.
+ This configuration option has no effect if fs.s3a.fast.upload is false.
+
+ "disk" will use the directories listed in fs.s3a.buffer.dir as
+ the location(s) to save data prior to being uploaded.
+
+ "array" uses arrays in the JVM heap
+
+ "bytebuffer" uses off-heap memory within the JVM.
+
+ Both "array" and "bytebuffer" will consume memory in a single stream
up to the number
+ of blocks set by:
+
+ fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks.
+
+ If using either of these mechanisms, keep this value low
+
+ The total number of threads performing work across all threads is set
by
+ fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the
number of queued
+ work items.
+ </description>
+</property>
+
+<property>
+ <name>fs.s3a.multipart.size</name>
+ <value>104857600</value>
+ <description>
+ How big (in bytes) to split upload or copy operations up into.
+ </description>
+</property>
+
+<property>
+ <name>fs.s3a.fast.upload.active.blocks</name>
+ <value>8</value>
+ <description>
+ Maximum Number of blocks a single output stream can have
+ active (uploading, or queued to the central FileSystem
+ instance's pool of queued operations.
+
+ This stops a single stream overloading the shared thread pool.
+ </description>
+</property>
+```
+
+**Notes**
+
+* If the amount of data written to a stream is below that set in
`fs.s3a.multipart.size`,
+the upload is performed in the `OutputStream.close()` operation —as with
+the original output stream.
+
+* The published Hadoop metrics monitor include live queue length and
+upload operation counts, so identifying when there is a backlog of work/
+a mismatch between data generation rates and network bandwidth. Per-stream
+statistics can also be logged by calling `toString()` on the current
stream.
+
+* Incremental writes are not visible; the object can only be listed
+or read when the multipart operation completes in the `close()` call, which
+will block until the upload is completed.
+
+
+#### <a name="s3a_fast_upload_disk"></a>Fast Upload with Disk Buffers
`fs.s3a.fast.upload.buffer=disk`
+
+When `fs.s3a.fast.upload.buffer` is set to `disk`, all data is buffered
+to local hard disks prior to upload. This minimizes the amount of memory
+consumed, and so eliminates heap size as the limiting factor in queued
uploads
+—exactly as the original "direct to disk" buffering used when
+`fs.s3a.fast.upload=false`.
+
+
+```xml
+<property>
+ <name>fs.s3a.fast.upload</name>
+ <value>true</value>
+</property>
+
+<property>
+ <name>fs.s3a.fast.upload.buffer</name>
+ <value>disk</value>
+</property>
+
+```
+
+
+#### <a name="s3a_fast_upload_bytebuffer"></a>Fast Upload with
ByteBuffers: `fs.s3a.fast.upload.buffer=bytebuffer`
+
+When `fs.s3a.fast.upload.buffer` is set to `bytebuffer`, all data is
buffered
+in "Direct" ByteBuffers prior to upload. This *may* be faster than
buffering to disk,
+and, if disk space is small (for example, tiny EC2 VMs), there may not
+be much disk space to buffer with.
+
+The ByteBuffers are created in the memory of the JVM, but not in the Java
Heap itself.
+The amount of data which can be buffered is
+limited by the Java runtime, the operating system, and, for YARN
applications,
+the amount of memory requested for each container.
+
+The slower the write bandwidth to S3, the greater the risk of running out
+of memory.
+
+
+```xml
+<property>
+ <name>fs.s3a.fast.upload</name>
+ <value>true</value>
+</property>
+
+<property>
+ <name>fs.s3a.fast.upload.buffer</name>
+ <value>bytebuffer</value>
+</property>
+```
+
+#### <a name="s3a_fast_upload_array"></a>Fast Upload with Arrays:
`fs.s3a.fast.upload.buffer=array`
+
+When `fs.s3a.fast.upload.buffer` is set to `array`, all data is buffered
+in byte arrays in the JVM's heap prior to upload.
+This *may* be faster than buffering to disk.
+
+This `array` option is similar to the in-memory-only stream offered in
+Hadoop 2.7 with `fs.s3a.fast.upload=true`
+
+The amount of data which can be buffered is limited by the available
+size of the JVM heap heap. The slower the write bandwidth to S3, the
greater
+the risk of heap overflows.
+
+```xml
+<property>
+ <name>fs.s3a.fast.upload</name>
+ <value>true</value>
+</property>
+
+<property>
+ <name>fs.s3a.fast.upload.buffer</name>
+ <value>array</value>
+</property>
+
+```
+#### <a name="s3a_fast_upload_thread_tuning"></a>S3A Fast Upload Thread
Tuning
+
+Both the [Array](#s3a_fast_upload_array) and [Byte
buffer](#s3a_fast_upload_bytebuffer)
+buffer mechanisms can consume very large amounts of memory, on-heap or
+off-heap respectively. The [disk buffer](#s3a_fast_upload_disk) mechanism
+does not use much memory up, but will consume hard disk capacity.
+
+If there are many output streams being written to in a single process, the
+amount of memory or disk used is the multiple of all stream's active
memory/disk use.
+
+Careful tuning may be needed to reduce the risk of running out memory,
especially
+if the data is buffered in memory.
+
+There are a number parameters which can be tuned:
+
+1. The total number of threads available in the filesystem for data
+uploads *or any other queued filesystem operation*. This is set in
+`fs.s3a.threads.max`
+
+1. The number of operations which can be queued for execution:, *awaiting
+a thread*: `fs.s3a.max.total.tasks`
+
+1. The number of blocks which a single output stream can have active,
+that is: being uploaded by a thread, or queued in the filesystem thread
queue:
+`fs.s3a.fast.upload.active.blocks`
+
+1. How long an idle thread can stay in the thread pool before it is
retired: `fs.s3a.threads.keepalivetime`
+
+
+When the maximum allowed number of active blocks of a single stream is
reached,
+no more blocks can be uploaded from that stream until one or more of those
active
+blocks' uploads completes. That is: a `write()` call which would trigger
an upload
+of a now full datablock, will instead block until there is capacity in the
queue.
+
+How does that come together?
+
+* As the pool of threads set in `fs.s3a.threads.max` is shared (and
intended
+to be used across all threads), a larger number here can allow for more
+parallel operations. However, as uploads require network bandwidth, adding
more
+threads does not guarantee speedup.
+
+* The extra queue of tasks for the thread pool (`fs.s3a.max.total.tasks`)
+covers all ongoing background S3A operations (future plans include:
parallelized
+rename operations, asynchronous directory operations).
+
+* When using memory buffering, a small value of
`fs.s3a.fast.upload.active.blocks`
+limits the amount of memory which can be consumed per stream.
+
+* When using disk buffering a larger value of
`fs.s3a.fast.upload.active.blocks`
+does not consume much memory. But it may result in a large number of
blocks to
+compete with other filesystem operations.
+
+
+We recommend a low value of `fs.s3a.fast.upload.active.blocks`; enough
+to start background upload without overloading other parts of the system,
+then experiment to see if higher values deliver more throughtput
—especially
+from VMs running on EC2.
+
+```xml
+
+<property>
+ <name>fs.s3a.fast.upload.active.blocks</name>
+ <value>4</value>
+ <description>
+ Maximum Number of blocks a single output stream can have
+ active (uploading, or queued to the central FileSystem
+ instance's pool of queued operations.
+
+ This stops a single stream overloading the shared thread pool.
+ </description>
+</property>
+
+<property>
+ <name>fs.s3a.threads.max</name>
+ <value>10</value>
+ <description>The total number of threads available in the filesystem for
data
+ uploads *or any other queued filesystem operation*.</description>
+</property>
+
+<property>
+ <name>fs.s3a.max.total.tasks</name>
+ <value>5</value>
+ <description>The number of operations which can be queued for
execution</description>
+</property>
+
+<property>
+ <name>fs.s3a.threads.keepalivetime</name>
+ <value>60</value>
+ <description>Number of seconds a thread can be idle before being
+ terminated.</description>
+</property>
+
+```
+
+
+#### <a name="s3a_multipart_purge"></a>Cleaning up After Incremental
Upload Failures: `fs.s3a.multipart.purge`
+
+
+If an incremental streaming operation is interrupted, there may be
+intermediate partitions uploaded to S3 —data which will be billed for.
+
+These charges can be reduced by enabling `fs.s3a.multipart.purge`,
+and setting a purge time in seconds, such as 86400 seconds —24 hours, after
+which the S3 service automatically deletes outstanding multipart
--- End diff --
To me, the wording here gives the impression this is a server-side
operation but the purging happens on the client by listing all uploads and then
sending a delete call with the ones to be purged. Consequently, this can cause
a (slight) delay when instantiating an s3a FS instance and there are lots of
active uploads (to purge).
> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>
> Key: HADOOP-13560
> URL: https://issues.apache.org/jira/browse/HADOOP-13560
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13560-branch-2-001.patch,
> HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch,
> HADOOP-13560-branch-2-004.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really
> works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very
> large commit operations for committers using rename
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]