DixitThinkbiz opened a new issue, #15285: URL: https://github.com/apache/pinot/issues/15285
## Issue: SegmentGenerationAndPushTask Fails with FileNotFoundException ### Summary We are attempting to use S3 as deep storage for our Apache Pinot deployment. During the execution of the SegmentGenerationAndPushTask, the task fails with a `FileNotFoundException` indicating that a segment tar file could not be found. This appears to occur when the minion attempts to push a generated segment. ### Error Log ``` 2025/03/17 10:03:01.581 ERROR [TaskFactoryRegistry] [TaskStateModelFactory-task_thread-2] Caught exception while executing task: Task_SegmentGenerationAndPushTask_c870e94e-5051-4ca5-a9ca-31ab89c1118c_1742205780838_0 java.lang.RuntimeException: Failed to execute SegmentGenerationAndPushTask at org.apache.pinot.plugin.minion.tasks.segmentgenerationandpush.SegmentGenerationAndPushTaskExecutor.executeTask(SegmentGenerationAndPushTaskExecutor.java:128) ... Caused by: java.io.FileNotFoundException: /employee_attendance_OFFLINE_1742205781534_1742205781534_0_741c8aac-0020-4e69-aaa1-75e880255b68.tar.gz (No such file or directory) ``` ### Environment - **Apache Pinot Version:** 1.2.0 - **Deep Storage:** S3 (configured with S3PinotFS) - **Kafka Topic:** `kafka_pinot_poc.public.employee_attendance.transformed` - **Additional Info:** Using Minio for S3 endpoint in the server configuration. ### Reproduction Steps 1. **Schema & Table Configurations:** Set up the schema and both realtime and offline table configurations as specified in our configuration files. 2. **Service Setup:** Deploy the Controller, Server, and Minion using the provided configuration files (refer to sections below). 3. **Task Execution:** The `SegmentGenerationAndPushTask` is triggered (as scheduled) but fails with a `FileNotFoundException` for the expected tar file. ### Configuration Details #### Schema Configuration ```js const pinotSchemaAttendance = { schemaName: "employee_attendance", dimensionFieldSpecs: [ { name: "attendance_id", dataType: "INT" }, { name: "employee_id", dataType: "INT" }, ], dateTimeFieldSpecs: [ { name: "punch_time", dataType: "TIMESTAMP", format: "1:MILLISECONDS:EPOCH", granularity: "1:MILLISECONDS", }, ], primaryKeyColumns: ["attendance_id"], }; ``` #### Realtime Table Configuration ```js const pinotTableConfigAttendanceRealtime = { tableName: "employee_attendance_REALTIME", tableType: "REALTIME", segmentsConfig: { schemaName: "employee_attendance", replication: "1", retentionTimeUnit: "DAYS", retentionTimeValue: "15", replicasPerPartition: "1", minimizeDataMovement: false, timeColumnName: "punch_time", }, tenants: { broker: "DefaultTenant", server: "DefaultTenant", tagOverrideConfig: {}, }, tableIndexConfig: { invertedIndexColumns: [], noDictionaryColumns: [], streamConfigs: { streamType: "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "kafka_pinot_poc.public.employee_attendance.transformed", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "192.168.1.120:9092", "realtime.segment.flush.threshold.rows": "10", "stream.kafka.consumer.prop.auto.offset.reset": "smallest", }, loadMode: "MMAP", onHeapDictionaryColumns: [], varLengthDictionaryColumns: [], enableDefaultStarTree: false, enableDynamicStarTreeCreation: false, aggregateMetrics: false, nullHandlingEnabled: false, rangeIndexColumns: [], rangeIndexVersion: 2, optimizeDictionary: false, optimizeDictionaryForMetrics: false, noDictionarySizeRatioThreshold: 0.85, autoGeneratedInvertedIndex: false, createInvertedIndexDuringSegmentGeneration: false, sortedColumn: [], bloomFilterColumns: [], }, metadata: {}, quota: {}, task: { taskTypeConfigsMap: { RealtimeToOfflineSegmentsTask: { bucketTimePeriod: "1h", bufferTimePeriod: "2h", mergeType: "concat", maxNumRecordsPerSegment: "100000", schedule: "0 * * * * ?", }, }, }, routing: {}, query: { timeoutMs: 60000, }, ingestionConfig: { continueOnError: false, rowTimeValueCheck: false, segmentTimeValueCheck: true, }, isDimTable: false, }; ``` #### Offline Table Configuration ```js const pinotTableConfigAttendanceOffline = { tableName: "employee_attendance_OFFLINE", tableType: "OFFLINE", segmentsConfig: { schemaName: "employee_attendance", replication: "1", replicasPerPartition: "1", timeColumnName: "punch_time", minimizeDataMovement: false, segmentPushType: "APPEND", segmentPushFrequency: "HOURLY", }, tenants: { broker: "DefaultTenant", server: "DefaultTenant", }, tableIndexConfig: { invertedIndexColumns: [], noDictionaryColumns: [], rangeIndexColumns: [], rangeIndexVersion: 2, createInvertedIndexDuringSegmentGeneration: false, autoGeneratedInvertedIndex: false, sortedColumn: [], bloomFilterColumns: [], loadMode: "MMAP", onHeapDictionaryColumns: [], varLengthDictionaryColumns: [], enableDefaultStarTree: false, enableDynamicStarTreeCreation: false, aggregateMetrics: false, nullHandlingEnabled: false, optimizeDictionary: false, optimizeDictionaryForMetrics: false, noDictionarySizeRatioThreshold: 0.85, }, metadata: {}, quota: {}, routing: {}, query: {}, ingestionConfig: { batchIngestionConfig: { segmentIngestionType: "APPEND", segmentIngestionFrequency: "DAILY", batchConfigMaps: [ { "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS", "input.fs.prop.region": "ap-northeast-1", "input.fs.prop.endPoint": "https://s3.ap-northeast-1.amazonaws.com", "input.fs.prop.accessKey": "****", "input.fs.prop.secretKey": "*****", outputDirURI: "s3://ses-email-receiving-bucket-testing/", inputDirURI: "s3://ses-email-receiving-bucket-testing/", includeFileNamePattern: "glob:**/*.csv", inputFormat: "csv", }, ], }, }, task: { taskTypeConfigsMap: { SegmentGenerationAndPushTask: { inputDirURI: "s3://bucket-name/", outputDirURI: "s3://bucket-name/", inputFormat: "csv", schedule: "0 */1 * * * ?", }, MergeRollupTask: { "1hour.mergeType": "rollup", "1hour.bucketTimePeriod": "1h", "1hour.bufferTimePeriod": "3h", "1day.mergeType": "rollup", "1day.bucketTimePeriod": "1d", "1day.bufferTimePeriod": "1d", "CDR_COUNT.aggregationType": "sum", "DURATION.aggregationType": "sum", "VOLUME.aggregationType": "sum", }, }, }, metadata: { customConfigs: {}, }, }; ``` #### Controller Configuration (`controller.conf`) ```conf # Pinot Role pinot.service.role=CONTROLLER # Pinot Cluster name pinot.cluster.name=pinot-quickstart # Pinot Zookeeper Server pinot.zk.server=localhost:2181 # Use hostname as Pinot Instance ID pinot.set.instance.id.to.hostname=true # Pinot Controller Port controller.port=9000 controller.zk.str=pinot-zookeeper:2181 controller.vip.host=127.0.0.1 controller.vip.port=9000 controller.task.scheduler.enabled=true controller.local.temp.dir=/var/pinot/controller/data # Deep storage configuration pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.disableAcl=false pinot.controller.storage.factory.s3.region=ap-northeast-1 controller.data.dir=s3://bucket-name/ pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.controller.storage.factory.s3.accessKey=**** pinot.controller.storage.factory.s3.secretKey=**** ``` #### Server Configuration (`server.conf`) ```conf # Pinot Role pinot.service.role=SERVER # Pinot Cluster name pinot.cluster.name=pinot-quickstart # Pinot Zookeeper Server pinot.zk.server=localhost:2181 pinot.set.instance.id.to.hostname=true # Pinot Server Ports pinot.server.netty.port=8098 pinot.server.adminapi.port=8097 # Data directories and deep storage pinot.server.instance.dataDir=/tmp/pinot/data/server/index pinot.server.instance.segmentTarDir=/tmp/pinot/data/server/segmentTar pinot.server.segment.store.uri=s3://bucket-name/ pinot.server.storage.factory.s3.disableAcl=false pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=ap-northeast-1 pinot.server.segment.fetcher.protocols=file,http,s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.server.storage.factory.s3.accessKey=**** pinot.server.storage.factory.s3.secretKey=**** pinot.server.storage.factory.s3.endpoint=http://minio:9000 ``` #### Minion Configuration (`minion.conf`) ```conf pinot.set.instance.id.to.hostname=true pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.minion.storage.factory.s3.region=us-east-1 pinot.minion.segment.fetcher.protocols=file,http,s3 pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher ``` Please update the issue with any further observations or logs, and feel free to add details on your environment or any steps already taken to resolve the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org