xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059217816
##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context
context) {
LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}",
segmentName,
DataSizeUtils.fromBytes(uncompressedSegmentSize),
DataSizeUtils.fromBytes(compressedSegmentSize));
//move segment to output PinotFS
- URI outputSegmentTarURI =
SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI,
outputDirURI)
- .resolve(segmentTarFileName);
- LOGGER.info("Copying segment tar file from [{}] to [{}]",
localSegmentTarFile, outputSegmentTarURI);
- outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+ URI relativeOutputPath =
SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI,
outputDirURI);
+ URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+ SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile,
outputSegmentTarURI,
+ _spec.isOverwriteOutput());
+
+ String metadataTarFileName = URLEncoder.encode(segmentName +
Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");
Review Comment:
> > > Doesn't this introduce inconsistency between the two steps. Assume the
following scenario -
> > >
> > > * Generate one segment tar using StandaloneJobRunner, we also generate
the corresponding metadata tar for it
> > > * Replaced that segment tar in the output dir with some other tar file
with the same name
> > > * We trigger metadata push which pushes the old metadata tar, however
the segment tar file contains different data
> >
> >
> > You are 100% correct. I can make the segment creation job to delete the
metadata tarball as well if not exists. But if users intend to manually update
the segment tarball, then it's pretty hard to detect that.
>
> Thinking out loud here, but may be we can introduce a crc check and keep
segment tar crc in the metadata tar?
The purpose of using metadata push is to avoid the segment download from
controller. The CRC is in segment metadata, but we cannot access it without
download the segment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]