[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683527#comment-17683527
]
ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------
steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094941352
##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \
Because object stores are slow to list files, consider setting the
`-numListstatusThreads` option when performing a `-update` operation
on a large directory tree (the limit is 40 threads).
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+ algorithm compares the modification times of source and target files along
+ with the file size to determine whether to skip the file copy. The behavior
+ is controlled by the property `distcp.update.modification.time`, which is
+ set to true by default. If the source file is more recently modified than
+ the target file, it is assumed that the content has changed, and the file
+ should be updated.
+ We need to ensure that there is no clock skew between the machines.
+ The fact that most object stores do have valid timestamps for directories
+ is irrelevant; only the file timestamps are compared. However, it is
+ important to have the clock of the client computers close to that of the
+ infrastructure, so that timestamps are consistent between the client/HDFS
+ cluster and that of the object store. Otherwise, changed files may be
+ missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+ in stores with same checksum algorithm as well. if set to true we check
+ both modification time and checksum between the files, but if this property
Review Comment:
really? I think if checksums are matching then timestamps shouldn't be
compared at all. If two files' checksums match, that is sufficient to say "they
are the same"
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
> Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]