[Solved-1 Solution] Submit pig job from oozie ?



What is oozie

  • Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
  • It is integrated with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also schedule jobs specific to a system, like Java programs or shell scripts.

Problem:

You are working on automating Pig jobs using oozie in hadoop cluster.

You are able to run a sample pig script from oozie but your next requirement is to run a pig job where the pig script receives its input parameters from a shell script. please share your answer.

Solution 1:

  • How can we pass a parameter form a shell script output.

Here's the working example:

WORKFLOW.XML

<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
    <start to='shell1' />
    <action name='shell1'>
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                  <name>mapred.job.queue.name</name>
                  <value>${queueName}</value>
                </property>
            </configuration>
            <exec>so.sh</exec>
            <argument>A</argument>
            <argument>B</argument>
            <file>so.sh</file> 
            <capture-output/>
        </shell>
        <ok to="shell2" />
        <error to="fail" />
    </action>


    <action name='shell2'>
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                  <name>mapred.job.queue.name</name>
                  <value>${queueName}</value>
                </property>
            </configuration>
            <exec>so2.sh</exec>
            <argument>${wf:actionData('shell1')['out']}</argument>
            <file>so2.sh</file> 
        </shell>
        <ok to="end" />
        <error to="fail" />
    </action>

    <kill name="fail">
        <message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name='end' />
</workflow-app>
  • If we replace the 2nd shell action with your pig action and pass the param to the pig script like this:
...
<param>MY_PARAM=${wf:actionData('shell1')['out']}</param>
...

SO.SH

echo "out=test"

SO2.SH

echo "I'm so2.sh and I get the following param:"
echo $1

Regarding sharelib issue, we have to be sure that in the properties we configured the LIB_PATH=where/we/jars/are and hand over this param to the pig action,

<param>LIB_PATH=${LIB_PATH}</param>

than just register the jars from there:

REGISTER '$LIB_PATH/my_jar'

  • Map wf:actionData(String node)
  • This function is only applicable to action nodes that produce output data on completion.
  • The output data is in a Java Properties format and via this EL function it is available as a Map .

If the capture-output element is present, it indicates Oozie to capture output of the STDOUT of the shell command execution. The Shell command output must be in Java Properties file format and it must not exceed 2KB.

Here you can use simple work-a-round and execute your shell script in the pig itself and save it's result in a variable, and using that. Like this:

%DEFINE MY_VAR `echo "/abc/cba'`
A = LOAD '$MY_VAR' ...

Related Searches to Submit pig job from oozie