pig tutorial - apache pig tutorial - Apache Pig Split Operator - pig latin - apache pig - pig hadoop



What is Split Operator Apache Pig ?

  • The SPLIT operator is used to split a relation into two or more relations.
  • The Split operator can be an operator within the reachability graph of a consistent region.
  • The Split operator is configurable with a single input port. The input port is non-mutating and its punctuation mode is Oblivious Output Ports.
  • The Split operator is configurable with one or more output ports.
  • SPLIT instruction:
    • Splits a relation into multiple relations based on conditions
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig split operation

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig split operation
  • Splitting Data into Training and Testing Dataset
  • SPLIT
    • SPLIT users into kids if age < 18, adults if age >= 18 and age <65, seniors otherwise;
    • SPLIT data into testing if RANDOM() <= 0.10, training otherwise;<
    • SPLIT operator cannot handle non deterministic functions (such as RANDOM).
  • Thus the above command won’t work and will raise an error:
  •  
    DEFINE split_into_training_testing(inputData, split_percentage)
    RETURNS training, testing{
    data = foreach $inputData generate RANDOM() as random_assignment, *;
    SPLIT data into testing_data if random_assignment <= $split_percentage, training_data otherwise;
    $training = foreach training_data generate $1..;
    $testing = foreach testing_data generate $1..;
    };
    inData = load ''some_files.txt‘ USING PigStorage(‘\t’);
    training, testing = split_into_training_testing(inData, 0.1);
    
        Syntax for Macro definition:-
        
    DEFINE macro_name (param [, param ...]) RETURNS {void | alias [, alias ...]} { pig_latin_fragment };
        
        
    Syntax for Macro expansion:-
        
    alias [, alias ...] = macro_name (param [, param ...]) ;
    
    

    Syntax

    grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
    

    Example

    Ensure that we have a file named wikitechy_employee_details.txt in the HDFS directory /pig_data/ as given below. wikitechy_employee_details.txt

    111,Anu,Shankar,23,9876543210,Chennai
    112,Barvathi,Nambiayar,24,9876543211,Chennai
    113,Kajal,Nayak,24,9876543212,Trivendram
    114,Preethi,Antony,21,9876543213,Pune
    115,Raj,Gopal,21,9876543214,Hyderabad
    116,Yashika,Kannan,22,9876543215,Delhi
    117,siddu,Narayanan,22,9876543216,Kolkata
    118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar
    
    • And we have loaded this file into Pig with the relation name wikitechy_employee_details as given below.
    Wikitechy_employee_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_details.txt' USING PigStorage(',')
       as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); 
    
    • Now split the relation into two, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25.
    SPLIT wikitechy_employee_details into wikitechy_employee _details1 if age<23, wikitechy_employee_details2 if (22<age and age>25);
    

    Verification

    Now verify the relations wikitechy_employee_details1 and wikitechy_employee_details2using the DUMP operator as shown below.

    grunt> Dump wikitechy_employee_details1;  
    
    grunt> Dump wikitechy_employee _details2; 
    

    Output

    • The following output, display the contents of the relations wikitechy_employee_details1 and wikitechy_employee _details2 respectively.
    grunt> Dump wikitechy_employee_details1;
    114,Preethi,Antony,21,9876543213,Pune
    115,Raj,Gopal,21,9876543214,Hyderabad
    116,Yashika,Kannan,22,9876543215,Delhi
    117,siddu,Narayanan,22,9876543216,Kolkata
      
    grunt> Dump wikitechy_employee_details2; 
    111,Anu,Shankar,23,9876543210,Chennai
    112,Barvathi,Nambiayar,24,9876543211,Chennai
    113,Kajal,Nayak,24,9876543212,Trivendram
    118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar
    

    Related Searches to Apache Pig Split Operator