pig tutorial - apache pig tutorial - Apache Pig - User Defined Functions - pig latin - apache pig - pig hadoop



What is User Defined Functions in Apache Pig ?

  • In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s).
  • Using these UDF’s, you can define your own functions and use them.
 apache pig user defined functions

Learn apache pig - apache pig tutorial - apache pig user defined functions - apache pig examples - apache pig programs

Supporting languages:

  • The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
  • For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
  • Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
  • Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
  • In Apache Pig, you also have a Java repository for UDF’s named Piggybank. Using Piggybank, you can access Java UDF’s written with other users, and contribute your own UDF’s.
learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig user defined function

Types of UDF’s in Java

Writing UDF’s using Java, you can create and use the following three types of functions −

  • Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.
  • Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.
  • Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.

Writing UDF’s using Java:

  • To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function,

Step 1

  • Open Eclipse and create a new project (say myproject).

Step 2

  • Convert the newly created project into a Maven project.

Step 3

  • Copy the following content in the pom.xml.
  • This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
   xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> 
	
   <modelVersion>4.0.0</modelVersion> 
   <groupId>Pig_Udf</groupId> 
   <artifactId>Pig_Udf</artifactId> 
   <version>0.0.1-SNAPSHOT</version>
	
   <build>    
      <sourceDirectory>src</sourceDirectory>    
      <plugins>      
         <plugin>        
            <artifactId>maven-compiler-plugin</artifactId>        
            <version>3.3</version>        
            <configuration>          
               <source>1.7</source>          
               <target>1.7</target>        
            </configuration>      
         </plugin>    
      </plugins>  
   </build>
	
   <dependencies> 
	
      <dependency>            
         <groupId>org.apache.pig</groupId>            
         <artifactId>pig</artifactId>            
         <version>0.15.0</version>     
      </dependency> 
		
      <dependency>        
         <groupId>org.apache.hadoop</groupId>            
         <artifactId>hadoop-core</artifactId>            
         <version>0.20.2</version>     
      </dependency> 
      
   </dependencies>  
	
</project>

Step 4

  • Save the file and refresh it. In the Maven Dependencies section, we can find the downloaded jar files.

Step 5

  • Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}

While Writing UDF’s, it is set to inherit the EvalFunc class and provide operation to exec() function. With in this function, the code required for the UDF is written.

  • The above example, we have return the code to convert the contents of the specified column to uppercase.
  • After compiling the class without errors, right-click on the Sample_Eval.java file. It gives you a menu. Select export as shown in the following screenshot.
 apache pig user defined-functions2

Learn apache pig - apache pig tutorial - apache pig user defined-functions2 - apache pig examples - apache pig programs

  • On click export, you will get the following window. Click on JAR file.
 apache pig user defined-functions3

Learn apache pig - apache pig tutorial - apache pig user defined-functions3 - apache pig examples - apache pig programs

  • Proceed further by clicking Next> button. You will get another window where you need to enter the path in the local file system, where you need to store the jar file.
 apache pig user defined-functions4

Learn apache pig - apache pig tutorial - apache pig user defined-functions4 - apache pig examples - apache pig programs

  • Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This jar file contains the UDF written in Java.

Using the UDF:

  • Once writing the UDF and generating the Jar file, follow the steps given below

Step 1:

Registering the Jar file

  • After writing UDF (in Java) you have to register the Jar file that contain the UDF using the Register operator.
  • By registering the Jar file, users can intimate the location of the UDF to Apache Pig.

Syntax:

The Register operator syntax is given below.

REGISTER path;

Example:

  • As an example let us register the sample_udf.jar created previously in this chapter.
  • Start Apache Pig in local mode and register the jar file sample_udf.jar as given below.
$cd PIG_HOME/bin
$./pig -x local

REGISTER '/$PIG_HOME/sample_udf.jar'

Note

− imagine the Jar file in the path − /$PIG_HOME/sample_udf.jar

Step 2:

Defining Alias

  • After registering the UDF you can define an alias to it using the Define operator.

Syntax:

The syntax of the Define operator is shown below.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Example:

Define the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3:

Using the UDF

  • Once defining the alias you can use the UDF same as the built-in functions. Assume there is a file named wikitechy_emp_data in the HDFS /Pig_Data/ directory with the following content.
11,Kevin,22,newyork
12,BOB,23,Kolkata
13,Oviya,23,Tokyo
14,Jack,25,London
15,David,23,Bhuwaneshwar
16,Maggy,22,Chennai
17,Anto,22,newyork
18,Syam,23,Kolkata
19,Mary,25,Tokyo
20,Saran,25,London
21,Stacy,25,Bhuwaneshwar
22,Kelly,22,Chennai

Ensure you have loaded this file into Pig as given below.

grunt> wikitechy_emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
  • we convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH wikitechy_emp_data GENERATE sample_eval(name);

Verification:

  • we are verify the contents of the relative Upper_case as given below.
grunt> Dump Upper_case;

(KEVIN)
(BOB)
(OVIYA)
(JACK)
(DAVID)
(MAGGY)
(ANTO)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)

More functions: Datafu Pig

  • Library of useful UDFs released 2010
  • Created by LinkedIn engineering team:
    • Stats: variance, quantiles, median, etc.
    • Bags: concat, append, preped, etc.
    • Sampling
    • Page rank
    • Session estimation
  • Last major release: 1.2.0 (Dec, 2013) http://datafu.incubator.apache.org/
  • How to use UDF libraries

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig udf function

    pig scripting

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig scripting

    Calling a script

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig calling scripting

    Related Searches to Apache Pig - User Defined Functions