ksnijjer opened a new issue #7328: URL: https://github.com/apache/pinot/issues/7328
As part of offline ingestion job's yaml config, you need to define a **outputDir** which in theory can be any S3 bucket location that the compute cluster(running the ingestion job) would have access to. Now when we do a segment metadata push(for e.g `jobType: SegmentCreationAndMetadataPush` )as part of the job execution we list all files in this output directory/download segment tar.gz and extract metadata. This metadata and other related information is then sent to the controller, which updates the ZK metadata for the table/segment and subsequently pushes the segment download URI to Pinot Server which actually downloads these files to the local disk. There are few issues with current design: 1)Since outputDir path can be a different location it can bypass the controller data dir path configured for deep storage 2)Currently there is no data copied from outputDIR to controller dataDir, if for any reason output data bucket is purged/deleted then **Pinot servers will have data loss** 3)Additionally user/Pinot admin needs to ensure that Pinot cluster has access to the S3 output bucket, in environments where for e.g IAM role/user is used to access S3(or any other object store) by Pinot server, there is a strong possibility that access is limited to specific locations/buckets only. Pros in current design: -Allows clean separation between ETL output vs. controller data dir, preventing issues like partial data generated due to job failures etc. being propagated to end user queries. Can we have some background task or processing on controller end which auto triggers copy of data to Deep store, when there is a difference between segment download URI vs. configured deep store UI ? That would address 1,2 and also retain existing separation between ingestion job and data storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org