pig tutorial - apache pig tutorial - Pig Example - Apache pig example - pig latin - apache pig - pig hadoop



What is pig?

  • Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig.

Problem Statement : 1

  • Example: movies_data.csv
    • 1,Dhadakebaz,1986,3.2,7560
    • 2,Dhumdhadaka,1985,3.8,6300
    • 3,Ashi hi banva banvi,1988,4.1,7802
    • 4,Zapatlela,1993,3.7,6022
    • 5,Ayatya Gharat Gharoba,1991,3.4,5420
    • 6,Navra Maza Navsacha,2004,3.9,4904
    • 7,De danadan,1987,3.4,5623
    • 8,Gammat Jammat,1987,3.4,7563
    • 9,Eka peksha ek,1990,3.2,6244
    • 10,Pachhadlela,2004,3.1,6956
  • Load data
    • $ pig -x local
    • grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration)
    • grunt> dump movies; it displays the contents
  • Filter data
    • grunt> movies_greater_than_35 = FILTER movies BY (float)rating > 3.5;
    • grunt> dump movies_greater_than_35;
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache grunt filter data

    Store the results data from pig

    • grunt> store movies_greater_than_35 into 'my_movies';
    • It stores the result in local file system directory named 'my_movies'.

    Display the result

  • Now display the result from local file system.
  • cat my_movies/part-m-00000
  • Load command

    • The load command specified only the column names. We can modify the statement as follows to include the data type of the columns:
    •   grunt> movies = LOAD 
      'movies_data.csv' USING
      PigStorage(',') as (id:int,
      name:chararray, year:int,
      rating:double, duration:int);

    Check the filters

  • List the movies that were released between 1950 and 1960
  •  grunt> movies_between_90_95 = FILTER
    movies by year > 1990 and year < 1995;
  • List the movies that start with the Alpahbet D
  •  grunt> movies_starting_with_D = FILTER   
           movies by name matches 'D.*';
  • List the movies that have duration greater that 2 hours
  • grunt> movies_duration_2_hrs = FILTER 
    movies by duration > 7200; 

    Pig Code Output

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig latin code output

    Foreach statement in Pig

  • FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example.
  • List the movie names its duration in minutes
    grunt> movie_duration = FOREACH movies
    GENERATE name, (double)(duration/60);
  • The above statement generates a new alias that has the list of movies and it duration in minutes.
  • You can check the results using the DUMP command.
  • Output of the foreach statement in PIG

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig foreach statement output

    Group Statement in PIG

  • The GROUP keyword is used to group fields in a relation.
  • List the years and the number of movies released each year.
    grunt> grouped_by_year = group movies
    by year;
    grunt> count_by_year = FOREACH
    grouped_by_year GENERATE group,
    COUNT(movies); 
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig group statement output

    Order by in PIG Statement

  • Let us question the data to illustrate the ORDER BY operation.
  • List all the movies in the ascending order of year.
  • grunt> desc_movies_by_year = ORDER
    movies BY year ASC;
    grunt> DUMP desc_movies_by_year;
  • List all the movies in the descending order of year.
  • grunt> asc_movies_by_year = ORDER movies
    by year DESC;
    grunt> DUMP asc_movies_by_year; 
  • Ascending by year
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig order by statement output
  • Limit operator in pig
    • Use the LIMIT keyword to get only a limited number for results from relation.
    • grunt> top_5_movies = LIMIT movies 5;
      grunt> DUMP top_10_movies;  
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig engine

    Pig: Modes of Execution

  • Pig programs can be run in three methods which work in both local and MapReduce mode. They are
    • Script Mode
    • Grunt Mode
    • Embedded Mode
  • Script mode in pig

  • Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:
  • $ vim scriptfile.pig
    A = LOAD 'script_file';
    DUMP A;
    $ pig x
    local scriptfile.pig 

    Grunt mode in pig

  • Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell.
  • It is started when no file is specified for pig to run.
    $ pig -x local
    grunt> A = LOAD 'grunt_file';
    grunt> DUMP A;
  • You can also run pig scripts from grunt using run and exec commands.
    grunt> run scriptfile.pig
    grunt> exec scriptfile.pig
  • Embedded mode in PIG

  • You can embed pig programs in Java, python and ruby and can run from the same.
  • Problem Statement 2:

    • Using Pig find the most occurred start letter.

    Solution:

    Case 1:

    • Load the data into bag named "lines". The entire line is stuck to element line of type character array.
    grunt> lines  = LOAD "/user/Desktop/data.txt" AS (line: chararray);  
    

    Case 2:

    • The text in the bag lines needs to be tokenized this produces one word per row.
    grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line))   As token: chararray; 

    Case 3:

    • To retain the first letter of each word type the below command .This commands uses substring method to take the first character.
    grunt>letters = FOREACH tokens  GENERATE SUBSTRING(0,1)   as letter : chararray;  
    

    Case 4:

    • Create a bag for unique character where the grouped bag will contain the same character for each occurrence of that character.
    grunt>lettergrp = GROUP letters by letter;  
    

    Case 5:

    • The number of occurrence is counted in each group.
    grunt>countletter  = FOREACH  lettergrp  GENERATE group  , COUNT(letters); 

    Case 6:

    • Arrange the output according to count in descending order using the commands below.
    grunt>OrderCnt = ORDER countletter  BY  $1  DESC;  
    

    Case 7:

    • Limit to One to give the result.
    grunt> result  =LIMIT    OrderCnt    1;  
    

    Case 8:

    • Store the result in HDFS . The result is saved in output directory under sonoo folder.
    grunt> STORE   result   into 'home/sonoo/output';  
    

    Related Searches to Pig Example