xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059217816


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context 
context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", 
segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), 
DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = 
SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, 
outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", 
localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = 
SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, 
outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, 
outputSegmentTarURI,
+          _spec.isOverwriteOutput());
 
+
+      String metadataTarFileName = URLEncoder.encode(segmentName + 
Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   > > > Doesn't this introduce inconsistency between the two steps. Assume the 
following scenario -
   > > > 
   > > > * Generate one segment tar using StandaloneJobRunner, we also generate 
the corresponding metadata tar for it
   > > > * Replaced that segment tar in the output dir with some other tar file 
with the same name
   > > > * We trigger metadata push which pushes the old metadata tar, however 
the segment tar file contains different data
   > > 
   > > 
   > > You are 100% correct. I can make the segment creation job to delete the 
metadata tarball as well if not exists. But if users intend to manually update 
the segment tarball, then it's pretty hard to detect that.
   > 
   > Thinking out loud here, but may be we can introduce a crc check and keep 
segment tar crc in the metadata tar?
   
   The purpose of using metadata push is to avoid the segment download from 
controller. The CRC is in segment metadata, but we cannot access it without 
download the segment.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to