[
https://issues.apache.org/jira/browse/PIG-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904852#comment-15904852
]
Adam Szita commented on PIG-5177:
---------------------------------
[~kellyzly]
Basically the issue is caused by backend not being able to find the script file
(e.g. in _ScriptEngine#getScriptAsStream_)
1. This is only an issue in yarn-client mode, in local mode it works because
the script file is available in the local FS at its original location
2. Script files have to be carried along to (backend) executor nodes. This is
done differently in MR/Tez vs Spark mode.
In all cases the script file paths are available in
pigContext().getScriptFiles() (after they were registered on the frontend). In
MR/Tez modes _JarManager#createPigScriptUDFJar(PigContext)_ will create a jar
file and put the script files into it. This jar will be distributed among
backend nodes, and upon job execution they will be accessed with a ClassLoader.
(e.g here:
https://github.com/apache/pig/blob/spark/src/org/apache/pig/scripting/ScriptEngine.java#L146)
In Spark we use _LoadConverter#registerUdfFiles_ on the frontend and let Spark
do the job of distributing the script files to executor nodes. Later on the
backend an executor can retrieve the path of the script file using
SparkFiles.get(originalFileName). This will point to the file in the executor's
container, and we can use this to open a FileInputStream on it.
This patch solves about 30 E2E test case failures, since this is a common
problem among the scripting functionalities.
> Scripting and StreamingPythonUDFs fail with Spark exec type
> -----------------------------------------------------------
>
> Key: PIG-5177
> URL: https://issues.apache.org/jira/browse/PIG-5177
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Adam Szita
> Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5177.0.patch, PIG-5177.1.patch, PIG-5177.2.patch
>
>
> We are thrown an exception because the Python script file is not found on the
> backend side (on spark executors).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)