[GitHub] [pinot] kkrugler opened a new issue #7791: Only download required files from segment for metadata push

GitBox Wed, 17 Nov 2021 16:18:17 -0800


kkrugler opened a new issue #7791:
URL: https://github.com/apache/pinot/issues/7791



   Currently in `SegmentPushUtils.generateSegmentMetadataFile(PinotFS 
fileSystem, URI tarFileURI)`, it calls
   ``` java
       fileSystem.copyToLocalFile(tarFileURI, tarFile);
   ```
   to download the segment from deep store and create a temp copy locally. It 
then unpacks and untars the segment, extracts the two files of interest 
(`creation.meta` and `metadata.properties`), creates a new tarball with these 
two files, and then pushes that to the controller.
   
   The actual amount of data extracted from each segment is a tiny fraction of 
the segment's total size. e.g. For a 200MB segment, the two files of interest 
are about 4K bytes. When a large number of segments are being pushed, this 
results in a significant performance hit.
   
   Instead, it's possible to stream/unpack the segment and extract only the 
data from the two target files, via something like:
   ``` java
           String uuid = UUID.randomUUID().toString();
           File segmentMetadataTarFile = new File(FileUtils.getTempDirectory(), 
"segmentMetadata-" + uuid + ".tar.gz");
           if (segmentMetadataTarFile.exists()) {
             FileUtils.forceDelete(segmentMetadataTarFile);
           }
           GzipCompressorOutputStream gzOut = new 
GzipCompressorOutputStream(new FileOutputStream(segmentMetadataTarFile));
           TarArchiveOutputStream tos = new TarArchiveOutputStream(gzOut);
                   
           TarArchiveInputStream tis = new TarArchiveInputStream(new 
GZIPInputStream(fileSystem.open(segmentUri)));
   
           TarArchiveEntry tarEntry;
           while ((tarEntry = tis.getNextTarEntry()) != null) {
               System.out.format("%s: %d\n", tarEntry.getName(), 
tarEntry.getRealSize());
   
               String fullName = tarEntry.getName();
               String filename = fullName.substring(fullName.lastIndexOf('/') + 
1);
               if (tarEntry.isFile() && (filename.equals("metadata.properties") 
|| filename.contentEquals("creation.meta"))) {
                   TarArchiveEntry ae = new TarArchiveEntry(filename);
                   ae.setSize(tarEntry.getRealSize());
                   tos.putArchiveEntry(ae);
                   IOUtils.copy(tis, tos);
                   tos.closeArchiveEntry();
               }
           }
           
           tos.finish();
           tos.close();
           tis.close();
   ```
   The above is just example code, without exception handling, etc. but it is 
able to create the required tarball from a source segment.
   
   The real performance win is because the two files of interest occur in the 
segment tarball before the big files (`columns.psf`, `star_tree_index`, etc) so 
it would be appropriate to enforce this ordering in the segment writer utility 
code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] kkrugler opened a new issue #7791: Only download required files from segment for metadata push

Reply via email to