pig tutorial - apache pig tutorial - Apache Pig - User Defined Functions - pig latin - apache pig - pig hadoop




What is User Defined Functions in Apache Pig ?

  • In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s).
  • Using these UDF’s, you can define your own functions and use them.
 apache pig user defined functions

Learn apache pig - apache pig tutorial - apache pig user defined functions - apache pig examples - apache pig programs

Supporting languages:

  • The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
  • For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
  • Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
  • Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
  • In Apache Pig, you also have a Java repository for UDF’s named Piggybank. Using Piggybank, you can access Java UDF’s written with other users, and contribute your own UDF’s.
learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig user defined function

Types of UDF’s in Java

Writing UDF’s using Java, you can create and use the following three types of functions −

  • Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.
  • Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.
  • Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.

Writing UDF’s using Java:

  • To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function,

Step 1

  • Open Eclipse and create a new project (say myproject).

Step 2

  • Convert the newly created project into a Maven project.

Step 3

  • Copy the following content in the pom.xml.
  • This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
   xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> 
	
   <modelVersion>4.0.0</modelVersion> 
   <groupId>Pig_Udf</groupId> 
   <artifactId>Pig_Udf</artifactId> 
   <version>0.0.1-SNAPSHOT</version>
	
   <build>    
      <sourceDirectory>src</sourceDirectory>    
      <plugins>      
         <plugin>        
            <artifactId>maven-compiler-plugin</artifactId>        
            <version>3.3</version>        
            <configuration>          
               <source>1.7</source>          
               <target>1.7</target>        
            </configuration>      
         </plugin>    
      </plugins>  
   </build>
	
   <dependencies> 
	
      <dependency>            
         <groupId>org.apache.pig</groupId>            
         <artifactId>pig</artifactId>            
         <version>0.15.0</version>     
      </dependency> 
		
      <dependency>        
         <groupId>org.apache.hadoop</groupId>            
         <artifactId>hadoop-core</artifactId>            
         <version>0.20.2</version>     
      </dependency> 
      
   </dependencies>  
	
</project>

Step 4

  • Save the file and refresh it. In the Maven Dependencies section, we can find the downloaded jar files.

Step 5

  • Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}

While Writing UDF’s, it is set to inherit the EvalFunc class and provide operation to exec() function. With in this function, the code required for the UDF is written.

  • The above example, we have return the code to convert the contents of the specified column to uppercase.
  • After compiling the class without errors, right-click on the Sample_Eval.java file. It gives you a menu. Select export as shown in the following screenshot.
 apache pig user defined-functions2

Learn apache pig - apache pig tutorial - apache pig user defined-functions2 - apache pig examples - apache pig programs

  • On click export, you will get the following window. Click on JAR file.
 apache pig user defined-functions3

Learn apache pig - apache pig tutorial - apache pig user defined-functions3 - apache pig examples - apache pig programs

  • Proceed further by clicking Next> button. You will get another window where you need to enter the path in the local file system, where you need to store the jar file.
 apache pig user defined-functions4

Learn apache pig - apache pig tutorial - apache pig user defined-functions4 - apache pig examples - apache pig programs

  • Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This jar file contains the UDF written in Java.

Using the UDF:

  • Once writing the UDF and generating the Jar file, follow the steps given below

Step 1:

Registering the Jar file

  • After writing UDF (in Java) you have to register the Jar file that contain the UDF using the Register operator.
  • By registering the Jar file, users can intimate the location of the UDF to Apache Pig.

Syntax:

The Register operator syntax is given below.

REGISTER path;

Example:

  • As an example let us register the sample_udf.jar created previously in this chapter.
  • Start Apache Pig in local mode and register the jar file sample_udf.jar as given below.
$cd PIG_HOME/bin
$./pig -x local

REGISTER '/$PIG_HOME/sample_udf.jar'

Note

− imagine the Jar file in the path − /$PIG_HOME/sample_udf.jar

Step 2:

Defining Alias

  • After registering the UDF you can define an alias to it using the Define operator.

Syntax:

The syntax of the Define operator is shown below.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Example:

Define the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3:

Using the UDF

  • Once defining the alias you can use the UDF same as the built-in functions. Assume there is a file named wikitechy_emp_data in the HDFS /Pig_Data/ directory with the following content.
11,Kevin,22,newyork
12,BOB,23,Kolkata
13,Oviya,23,Tokyo
14,Jack,25,London
15,David,23,Bhuwaneshwar
16,Maggy,22,Chennai
17,Anto,22,newyork
18,Syam,23,Kolkata
19,Mary,25,Tokyo
20,Saran,25,London
21,Stacy,25,Bhuwaneshwar
22,Kelly,22,Chennai

Ensure you have loaded this file into Pig as given below.

grunt> wikitechy_emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
  • we convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH wikitechy_emp_data GENERATE sample_eval(name);

Verification:

  • we are verify the contents of the relative Upper_case as given below.
grunt> Dump Upper_case;

(KEVIN)
(BOB)
(OVIYA)
(JACK)
(DAVID)
(MAGGY)
(ANTO)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)

More functions: Datafu Pig

  • Library of useful UDFs released 2010
  • Created by LinkedIn engineering team:
    • Stats: variance, quantiles, median, etc.
    • Bags: concat, append, preped, etc.
    • Sampling
    • Page rank
    • Session estimation
  • Last major release: 1.2.0 (Dec, 2013) http://datafu.incubator.apache.org/
  • How to use UDF libraries

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig udf function

    pig scripting

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig scripting

    Calling a script

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig calling scripting

    Related Searches to Apache Pig - User Defined Functions

    Adblocker detected! Please consider reading this notice.

    We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

    We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

    We need money to operate the site, and almost all of it comes from our online advertising.

    Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

    ×