[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683527#comment-17683527
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094941352


##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   really? I think if checksums are matching then timestamps shouldn't be 
compared at all. If two files' checksums match, that is sufficient to say "they 
are the same"





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18596
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18596
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Mehakmeet Singh
>            Assignee: Mehakmeet Singh
>            Priority: Major
>              Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to