DixitThinkbiz opened a new issue, #15285:
URL: https://github.com/apache/pinot/issues/15285

   ## Issue: SegmentGenerationAndPushTask Fails with FileNotFoundException
   
   ### Summary
   We are attempting to use S3 as deep storage for our Apache Pinot deployment. 
During the execution of the SegmentGenerationAndPushTask, the task fails with a 
`FileNotFoundException` indicating that a segment tar file could not be found. 
This appears to occur when the minion attempts to push a generated segment.
   
   ### Error Log
   ```
   2025/03/17 10:03:01.581 ERROR [TaskFactoryRegistry] 
[TaskStateModelFactory-task_thread-2] Caught exception while executing task: 
Task_SegmentGenerationAndPushTask_c870e94e-5051-4ca5-a9ca-31ab89c1118c_1742205780838_0
   java.lang.RuntimeException: Failed to execute SegmentGenerationAndPushTask
       at 
org.apache.pinot.plugin.minion.tasks.segmentgenerationandpush.SegmentGenerationAndPushTaskExecutor.executeTask(SegmentGenerationAndPushTaskExecutor.java:128)
       ...
   Caused by: java.io.FileNotFoundException: 
/employee_attendance_OFFLINE_1742205781534_1742205781534_0_741c8aac-0020-4e69-aaa1-75e880255b68.tar.gz
 (No such file or directory)
   ```
   
   ### Environment
   - **Apache Pinot Version:** 1.2.0
   - **Deep Storage:** S3 (configured with S3PinotFS)
   - **Kafka Topic:** `kafka_pinot_poc.public.employee_attendance.transformed`
   - **Additional Info:** Using Minio for S3 endpoint in the server 
configuration.
   
   ### Reproduction Steps
   1. **Schema & Table Configurations:**  
      Set up the schema and both realtime and offline table configurations as 
specified in our configuration files.
   2. **Service Setup:**  
      Deploy the Controller, Server, and Minion using the provided 
configuration files (refer to sections below).
   3. **Task Execution:**  
      The `SegmentGenerationAndPushTask` is triggered (as scheduled) but fails 
with a `FileNotFoundException` for the expected tar file.
   
   ### Configuration Details
   
   #### Schema Configuration
   ```js
   const pinotSchemaAttendance = {
     schemaName: "employee_attendance",
     dimensionFieldSpecs: [
       { name: "attendance_id", dataType: "INT" },
       { name: "employee_id", dataType: "INT" },
     ],
     dateTimeFieldSpecs: [
       {
         name: "punch_time",
         dataType: "TIMESTAMP",
         format: "1:MILLISECONDS:EPOCH",
         granularity: "1:MILLISECONDS",
       },
     ],
     primaryKeyColumns: ["attendance_id"],
   };
   ```
   
   #### Realtime Table Configuration
   ```js
   const pinotTableConfigAttendanceRealtime = {
     tableName: "employee_attendance_REALTIME",
     tableType: "REALTIME",
     segmentsConfig: {
       schemaName: "employee_attendance",
       replication: "1",
       retentionTimeUnit: "DAYS",
       retentionTimeValue: "15",
       replicasPerPartition: "1",
       minimizeDataMovement: false,
       timeColumnName: "punch_time",
     },
     tenants: {
       broker: "DefaultTenant",
       server: "DefaultTenant",
       tagOverrideConfig: {},
     },
     tableIndexConfig: {
       invertedIndexColumns: [],
       noDictionaryColumns: [],
       streamConfigs: {
         streamType: "kafka",
         "stream.kafka.consumer.type": "lowlevel",
         "stream.kafka.topic.name": 
"kafka_pinot_poc.public.employee_attendance.transformed",
         "stream.kafka.decoder.class.name": 
"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
         "stream.kafka.consumer.factory.class.name": 
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
         "stream.kafka.broker.list": "192.168.1.120:9092",
         "realtime.segment.flush.threshold.rows": "10",
         "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
       },
       loadMode: "MMAP",
       onHeapDictionaryColumns: [],
       varLengthDictionaryColumns: [],
       enableDefaultStarTree: false,
       enableDynamicStarTreeCreation: false,
       aggregateMetrics: false,
       nullHandlingEnabled: false,
       rangeIndexColumns: [],
       rangeIndexVersion: 2,
       optimizeDictionary: false,
       optimizeDictionaryForMetrics: false,
       noDictionarySizeRatioThreshold: 0.85,
       autoGeneratedInvertedIndex: false,
       createInvertedIndexDuringSegmentGeneration: false,
       sortedColumn: [],
       bloomFilterColumns: [],
     },
     metadata: {},
     quota: {},
     task: {
       taskTypeConfigsMap: {
         RealtimeToOfflineSegmentsTask: {
           bucketTimePeriod: "1h",
           bufferTimePeriod: "2h",
           mergeType: "concat",
           maxNumRecordsPerSegment: "100000",
           schedule: "0 * * * * ?",
         },
       },
     },
     routing: {},
     query: {
       timeoutMs: 60000,
     },
     ingestionConfig: {
       continueOnError: false,
       rowTimeValueCheck: false,
       segmentTimeValueCheck: true,
     },
     isDimTable: false,
   };
   ```
   
   #### Offline Table Configuration
   ```js
   const pinotTableConfigAttendanceOffline = {
     tableName: "employee_attendance_OFFLINE",
     tableType: "OFFLINE",
     segmentsConfig: {
       schemaName: "employee_attendance",
       replication: "1",
       replicasPerPartition: "1",
       timeColumnName: "punch_time",
       minimizeDataMovement: false,
       segmentPushType: "APPEND",
       segmentPushFrequency: "HOURLY",
     },
     tenants: {
       broker: "DefaultTenant",
       server: "DefaultTenant",
     },
     tableIndexConfig: {
       invertedIndexColumns: [],
       noDictionaryColumns: [],
       rangeIndexColumns: [],
       rangeIndexVersion: 2,
       createInvertedIndexDuringSegmentGeneration: false,
       autoGeneratedInvertedIndex: false,
       sortedColumn: [],
       bloomFilterColumns: [],
       loadMode: "MMAP",
       onHeapDictionaryColumns: [],
       varLengthDictionaryColumns: [],
       enableDefaultStarTree: false,
       enableDynamicStarTreeCreation: false,
       aggregateMetrics: false,
       nullHandlingEnabled: false,
       optimizeDictionary: false,
       optimizeDictionaryForMetrics: false,
       noDictionarySizeRatioThreshold: 0.85,
     },
     metadata: {},
     quota: {},
     routing: {},
     query: {},
     ingestionConfig: {
       batchIngestionConfig: {
         segmentIngestionType: "APPEND",
         segmentIngestionFrequency: "DAILY",
         batchConfigMaps: [
           {
             "input.fs.className": 
"org.apache.pinot.plugin.filesystem.S3PinotFS",
             "input.fs.prop.region": "ap-northeast-1",
             "input.fs.prop.endPoint": 
"https://s3.ap-northeast-1.amazonaws.com";,
             "input.fs.prop.accessKey": "****",
             "input.fs.prop.secretKey": "*****",
             outputDirURI: "s3://ses-email-receiving-bucket-testing/",
             inputDirURI: "s3://ses-email-receiving-bucket-testing/",
             includeFileNamePattern: "glob:**/*.csv",
             inputFormat: "csv",
           },
         ],
       },
     },
     task: {
       taskTypeConfigsMap: {
         SegmentGenerationAndPushTask: {
           inputDirURI: "s3://bucket-name/",
           outputDirURI: "s3://bucket-name/",
           inputFormat: "csv",
           schedule: "0 */1 * * * ?",
         },
         MergeRollupTask: {
           "1hour.mergeType": "rollup",
           "1hour.bucketTimePeriod": "1h",
           "1hour.bufferTimePeriod": "3h",
           "1day.mergeType": "rollup",
           "1day.bucketTimePeriod": "1d",
           "1day.bufferTimePeriod": "1d",
           "CDR_COUNT.aggregationType": "sum",
           "DURATION.aggregationType": "sum",
           "VOLUME.aggregationType": "sum",
         },
       },
     },
     metadata: {
       customConfigs: {},
     },
   };
   ```
   
   #### Controller Configuration (`controller.conf`)
   ```conf
   # Pinot Role
   pinot.service.role=CONTROLLER
   
   # Pinot Cluster name
   pinot.cluster.name=pinot-quickstart
   
   # Pinot Zookeeper Server
   pinot.zk.server=localhost:2181
   
   # Use hostname as Pinot Instance ID
   pinot.set.instance.id.to.hostname=true
   
   # Pinot Controller Port
   controller.port=9000
   controller.zk.str=pinot-zookeeper:2181
   controller.vip.host=127.0.0.1
   controller.vip.port=9000
   
   controller.task.scheduler.enabled=true
   controller.local.temp.dir=/var/pinot/controller/data
   
   # Deep storage configuration
   
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
   pinot.controller.storage.factory.s3.disableAcl=false
   pinot.controller.storage.factory.s3.region=ap-northeast-1
   controller.data.dir=s3://bucket-name/
   
   pinot.controller.segment.fetcher.protocols=file,http,s3
   
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
   
   pinot.controller.storage.factory.s3.accessKey=****
   pinot.controller.storage.factory.s3.secretKey=****
   ```
   
   #### Server Configuration (`server.conf`)
   ```conf
   # Pinot Role
   pinot.service.role=SERVER
   
   # Pinot Cluster name
   pinot.cluster.name=pinot-quickstart
   
   # Pinot Zookeeper Server
   pinot.zk.server=localhost:2181
   pinot.set.instance.id.to.hostname=true
   
   # Pinot Server Ports
   pinot.server.netty.port=8098
   pinot.server.adminapi.port=8097
   
   # Data directories and deep storage
   pinot.server.instance.dataDir=/tmp/pinot/data/server/index
   pinot.server.instance.segmentTarDir=/tmp/pinot/data/server/segmentTar
   pinot.server.segment.store.uri=s3://bucket-name/
   pinot.server.storage.factory.s3.disableAcl=false
   
pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
   pinot.server.storage.factory.s3.region=ap-northeast-1
   pinot.server.segment.fetcher.protocols=file,http,s3
   
pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
   pinot.server.storage.factory.s3.accessKey=****
   pinot.server.storage.factory.s3.secretKey=****
   pinot.server.storage.factory.s3.endpoint=http://minio:9000
   ```
   
   #### Minion Configuration (`minion.conf`)
   ```conf
   pinot.set.instance.id.to.hostname=true
   
pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
   pinot.minion.storage.factory.s3.region=us-east-1
   pinot.minion.segment.fetcher.protocols=file,http,s3
   
pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
   ```
   
   Please update the issue with any further observations or logs, and feel free 
to add details on your environment or any steps already taken to resolve the 
problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to