lgo opened a new issue #5877: URL: https://github.com/apache/incubator-pinot/issues/5877
Here's some of the setup: ``` # pinot controller properties. # Requires `-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3` # pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS # Any S3 region pinot.controller.storage.factory.s3.region=us-west-1 # Data directory for Pinot. controller.data.dir=s3://mybucket/myfolder/pinot ``` When using an ingestion spec like the following ```yaml executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndUriPush inputDirURI: ... outputDirURI: 's3://mybucket/myfolder/pinot' overwriteOutput: true pinotFSSpecs: - scheme: s3 className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: region: 'us-west-2' pushJobSpec: # NB: This is particularly weird. Specifically, this seems # to be the "adjusted path" that is provided to the controller. I assume # that is because the ingestion job URI may not be the same for a # Controller? segmentUriPrefix: 's3://' segmentUriSuffix: '' recordReaderSpec: # Dataset specific config. tableSpec: # Table specific config. pinotClusterSpecs: # Cluster specific config. ``` When using the standalone ingestion job via `bin/pinot-ingestion-job.sh` * Segment generation is fine. * Data shows up on S3 as expected and the logline in `S3PinotFS` for `Copy` has the correct path, but the `SegmentPushUtils` does not, and the the `SegmentUriPushJobRunner` fails will get a 500 from the controller due to the path not being found. ``` 2020/08/17 14:45:38.719 INFO [S3PinotFS] [main] Copy /tmp/pinot-a4064eea-301d-4f24-8861-0575a73e6a0b/output/mytable_OFFLINE_1569293930_1569293987_0.tar.gz from local to s3://mybucket/myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz 2020/08/17 14:45:38.794 INFO [IngestionJobLauncher] [main] Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner 2020/08/17 14:45:38.795 INFO [PinotFSFactory] [main] Initializing PinotFS for scheme s3, classname org.apache.pinot.plugin.filesystem.S3PinotFS 2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Start sending table mytable segment URIs: [s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz] to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@4e07b95f] 2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Sending table mytable segment URI: s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz to location ``` I suspect it's related to how the output path is constructor before `SegmentPushUtils.sendSegmentUris`, but have not confirmed it. https://github.com/apache/incubator-pinot/blob/2b58bfb520df074f691277f2ae5b01ecb5c686c2/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/main/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentUriPushJobRunner.java#L90-L91 It also was not clear that the same issue would happen with the Hadoop/Spark SegmentUri push jobs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org