xiangfu0 commented on code in PR #10034: URL: https://github.com/apache/pinot/pull/10034#discussion_r1059217816
########## pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java: ########## @@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context context) { LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName, DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize)); //move segment to output PinotFS - URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI) - .resolve(segmentTarFileName); - LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI); - outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI); + URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI); + URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName); + SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI, + _spec.isOverwriteOutput()); + + String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8"); Review Comment: > > > Doesn't this introduce inconsistency between the two steps. Assume the following scenario - > > > > > > * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it > > > * Replaced that segment tar in the output dir with some other tar file with the same name > > > * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data > > > > > > You are 100% correct. I can make the segment creation job to delete the metadata tarball as well if not exists. But if users intend to manually update the segment tarball, then it's pretty hard to detect that. > > Thinking out loud here, but may be we can introduce a crc check and keep segment tar crc in the metadata tar? The purpose of using metadata push is to avoid the segment download from controller. The CRC is in segment metadata, but we cannot access it without download the segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org