kkrugler opened a new issue #6492: URL: https://github.com/apache/incubator-pinot/issues/6492
Currently the code creates a tarball of the plugin directory inside of the Pinot distribution directory, and then calls `job.addCacheArchive(file://<path to tarball>`. This won't work, as all that this call does is store the path in the JobConf. On the slaves, this path is used as the source for copying files to slaves, but that `file://xxx` path doesn't exist. The Hadoop [DistributedCache documentation](https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/filecache/DistributedCache.html) says: > Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster. So the `HadoopSegmentGenerationJobRunner` needs to copy files to HDFS, and set the distributed cache path to that location. There are some options for the location of where to copy these files. If you use the standard Hadoop command line `-files xxx` parameter (as an example), then the standard Hadoop tool framework will copy the file(s) to a job-specific directory inside of the "staging" directory. So we could try to leverage that same location. But since Pinot already requires a staging directory be specified in the job spec file, and this has to be in HDFS for a distributed job, we could use an explicit sub-dir within that directory. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org