ksnijjer opened a new issue #7328:
URL: https://github.com/apache/pinot/issues/7328


   As part of offline ingestion job's yaml config, you need to define a 
**outputDir** which in theory can be any S3 bucket location that the compute 
cluster(running the ingestion job) would have access to. Now when we do a 
segment metadata push(for e.g `jobType: SegmentCreationAndMetadataPush` )as 
part of the job execution we list all files in this output directory/download 
segment tar.gz and extract metadata. This metadata and other related 
information is then sent to the controller, which updates the ZK metadata for 
the table/segment and subsequently pushes the segment download URI to Pinot 
Server which actually downloads these files to the local disk. 
   
   There are few issues with current design:
   
   1)Since outputDir path can be a different location it can bypass the 
controller data dir path configured for deep storage
   2)Currently there is no data copied from outputDIR to controller dataDir, if 
for any reason output data bucket is purged/deleted then **Pinot servers will 
have data loss**
   3)Additionally user/Pinot admin needs to ensure that Pinot cluster has 
access to the S3 output bucket, in environments where for e.g IAM role/user is 
used to access S3(or any other object store) by Pinot server, there is a strong 
possibility that access is limited to specific locations/buckets only.
   
   Pros in current design:
   -Allows clean separation between ETL output vs. controller data dir, 
preventing issues like partial data generated due to job failures etc. being 
propagated to end user queries.
   
   Can we have some background task or processing on controller end which auto 
triggers copy of data to Deep store, when there is a difference between segment 
download URI vs. configured deep store UI ? That would address 1,2 and also 
retain existing separation between ingestion job and data storage.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to