mneedham opened a new pull request #8337:
URL: https://github.com/apache/pinot/pull/8337


   At the moment if you try to batch import files that have the same name, but 
are in different directories e.g.
   
   ```
   input/2009/movies.csv
   input/2010/movies.csv
   ```
   
   You'll get the following exception:
   
   ```
   2022/03/11 10:22:18.046 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] 
Failed to generate Pinot segment for file - 
file:/home/markhneedham/projects/pinot-recipes/recipes/import-data-files-different-directories/input/2000_2009/movies.csv
   java.lang.IllegalStateException: Input path {} does not exist. 
[/tmp/pinot-4214a741-1111-4b31-b7f2-452833954e6a/input/movies.csv]
        at 
com.google.common.base.Preconditions.checkState(Preconditions.java:518) 
~[guava-20.0.jar:?]
        at 
org.apache.pinot.segment.spi.creator.SegmentGeneratorConfig.setInputFilePath(SegmentGeneratorConfig.java:420)
 ~[classes/:?]
        at 
org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:112)
 ~[classes/:?]
        at 
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:266)
 ~[classes/:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
   2022/03/11 10:22:18.055 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] 
Failed to generate Pinot segment for file - 
file:/home/markhneedham/projects/pinot-recipes/recipes/import-data-files-different-directories/input/2010_2019/movies.csv
   java.lang.IllegalStateException: Input path {} does not exist. 
[/tmp/pinot-4214a741-1111-4b31-b7f2-452833954e6a/input/movies.csv]
        at 
com.google.common.base.Preconditions.checkState(Preconditions.java:518) 
~[guava-20.0.jar:?]
        at 
org.apache.pinot.segment.spi.creator.SegmentGeneratorConfig.setInputFilePath(SegmentGeneratorConfig.java:420)
 ~[classes/:?]
        at 
org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:112)
 ~[classes/:?]
        at 
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:266)
 ~[classes/:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
   ```
   
   When the CSV files are extracted locally they are extracted into 
`/tmpdir/filename`. The problem is that if we have multiple files with the same 
name we get a collision, which shows with the file being deleted or I guess it 
could be possible that one file gets overridden by another one.
   
   So this PR tries to fix the problem by extracting the input files to:
   
   ```
   /tmpdir/-input-2009/movies.csv
   /tmpdir/-input-2010/movies.csv
   ```
   
   Instead of having them both extracted to:
   
   ```
   /tmpdir/movies.csv
   ```
   
   Maybe there's a better way to generate the directory name under `/tmpdir`? 
Happy to change if so!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to