kkrugler opened a new issue #7791: URL: https://github.com/apache/pinot/issues/7791
Currently in `SegmentPushUtils.generateSegmentMetadataFile(PinotFS fileSystem, URI tarFileURI)`, it calls ``` java fileSystem.copyToLocalFile(tarFileURI, tarFile); ``` to download the segment from deep store and create a temp copy locally. It then unpacks and untars the segment, extracts the two files of interest (`creation.meta` and `metadata.properties`), creates a new tarball with these two files, and then pushes that to the controller. The actual amount of data extracted from each segment is a tiny fraction of the segment's total size. e.g. For a 200MB segment, the two files of interest are about 4K bytes. When a large number of segments are being pushed, this results in a significant performance hit. Instead, it's possible to stream/unpack the segment and extract only the data from the two target files, via something like: ``` java String uuid = UUID.randomUUID().toString(); File segmentMetadataTarFile = new File(FileUtils.getTempDirectory(), "segmentMetadata-" + uuid + ".tar.gz"); if (segmentMetadataTarFile.exists()) { FileUtils.forceDelete(segmentMetadataTarFile); } GzipCompressorOutputStream gzOut = new GzipCompressorOutputStream(new FileOutputStream(segmentMetadataTarFile)); TarArchiveOutputStream tos = new TarArchiveOutputStream(gzOut); TarArchiveInputStream tis = new TarArchiveInputStream(new GZIPInputStream(fileSystem.open(segmentUri))); TarArchiveEntry tarEntry; while ((tarEntry = tis.getNextTarEntry()) != null) { System.out.format("%s: %d\n", tarEntry.getName(), tarEntry.getRealSize()); String fullName = tarEntry.getName(); String filename = fullName.substring(fullName.lastIndexOf('/') + 1); if (tarEntry.isFile() && (filename.equals("metadata.properties") || filename.contentEquals("creation.meta"))) { TarArchiveEntry ae = new TarArchiveEntry(filename); ae.setSize(tarEntry.getRealSize()); tos.putArchiveEntry(ae); IOUtils.copy(tis, tos); tos.closeArchiveEntry(); } } tos.finish(); tos.close(); tis.close(); ``` The above is just example code, without exception handling, etc. but it is able to create the required tarball from a source segment. The real performance win is because the two files of interest occur in the segment tarball before the big files (`columns.psf`, `star_tree_index`, etc) so it would be appropriate to enforce this ordering in the segment writer utility code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org