[GitHub] [incubator-pinot] kkrugler commented on issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

GitBox Fri, 29 Jan 2021 14:06:44 -0800


kkrugler commented on issue #6492:
URL: 
https://github.com/apache/incubator-pinot/issues/6492#issuecomment-770075892



   While digging into the code, I found a few more issues that I'm fixing in 
the same PR:
   
   - Mapper was creating temp plugin directory in pwd (probably only an issue 
during tests)
   - Mapper wasn't registering Pinot file systems before starting segment 
generation.
   - Mapper was incorrectly using the `overwriteOutput` flag when writing 
segments to staging dir.
   - Job runner wasn't clearing out staging directory at start of execution 
(partial dir could be left around after a failed run, which would cause the 
next attempt to fail due to the output sub-dir existing)
   - Job runner wasn't adding the scheme to input file paths before writing out 
to the (temp) Hadoop input files.
   - Job runner was setting the job jar class to itself, but this class is 
inside of the Hadoop batch ingest plugin, which meant the "pinot-all" jar 
wasn't being distributed to Hadoop slaves.
   - Job runner wasn't disabling speculative execution, which could cause the 
job to fail due to two mappers writing to the same output file in the staging 
directory.
   - Job runner was using (very slow) copy command when updating the real 
output directory with the generated segments from the staging directory, versus 
the move command. It also wasn't honoring the `overwriteOutput` flag during the 
copy.
   - Job runner was adding the plugin tarball as an archive, but that meant the 
Hadoop distributed cache system was trying to unpack it, while the mapper code 
was also trying to expand it. I changed it to just add the tarball as a file, 
and left the mapper code as-is, but might be cleaner to leverage Hadoop's code.
   
   I also added a test for standalone batch ingestion, as that's now sharing 
some code with the Hadoop batch ingestion, and it didn't seem to have any 
current test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] kkrugler commented on issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

Reply via email to