[GitHub] [incubator-pinot] kkrugler opened a new issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

GitBox Tue, 26 Jan 2021 13:05:16 -0800


kkrugler opened a new issue #6492:
URL: https://github.com/apache/incubator-pinot/issues/6492

Currently the code creates a tarball of the plugin directory inside of the
Pinot distribution directory, and then calls `job.addCacheArchive(file://<path
to tarball>`. This won't work, as all that this call does is store the path in
the JobConf. On the slaves, this path is used as the source for copying files
to slaves, but that `file://xxx` path doesn't exist.

The Hadoop [DistributedCache
documentation](https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/filecache/DistributedCache.html)
says:

> Applications specify the files, via urls (hdfs:// or http://) to be cached
via the JobConf. The DistributedCache assumes that the files specified via urls
are already present on the FileSystem at the path specified by the url and are
accessible by every machine in the cluster.

So the `HadoopSegmentGenerationJobRunner` needs to copy files to HDFS, and
set the distributed cache path to that location. There are some options for the
location of where to copy these files. If you use the standard Hadoop command
line `-files xxx` parameter (as an example), then the standard Hadoop tool
framework will copy the file(s) to a job-specific directory inside of the
"staging" directory. So we could try to leverage that same location. But since
Pinot already requires a staging directory be specified in the job spec file,
and this has to be in HDFS for a distributed job, we could use an explicit
sub-dir within that directory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] kkrugler opened a new issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

Reply via email to