pig tutorial - apache pig tutorial - Apache Pig Distinct Operator - pig latin - apache pig - pig hadoop



What is Distinct Operator in Apache Pig ?

  • The DISTINCT Operator is used to remove duplicated records and it works only on entire records, which does not work on individual fields.
  • The DISTINCT operators which are used in a SELECT statement filter the result set to remove duplicates
  • We can use DISTINCT operator in combination with an aggregation function, which is typically COUNT ().
  • The distinct operator is used to get the unique values by removing duplicates.
  • The DISTINCT operator is used to remove redundant tuples from a relation.

Pig Operations - Deduplication

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig deduplication distinct
  • works only on entire records, not on individual fields
  • forces a reduce phase, but optimizes by using the combiner
  • DISTINCT instruction:
    • Only preserves unique tuples
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig distinct operation

    Syntax

    grunt> Relation_name2 = DISTINCT Relatin_name1;
    

    Example:

    wikitechy_student_details.txt

    001,Sabrina,Reddy,9848022337,Hyderabad
    002,Arvin,Battacharya,9848022338,Kolkata 
    002,Arvin,Battacharya,9848022338,Kolkata 
    003,Arun,Khanna,9848022339,Delhi 
    003,Arun,Khanna,9848022339,Delhi 
    004,Preethi,Agarwal,9848022330,Pune 
    005,Sruti,Mohanthy,9848022336,Bhuwaneshwar
    006,Vanitha,Mishra,9848022335,Chennai 
    006,Vanitha,Mishra,9848022335,Chennai
    
    • And we have loaded this file into Pig with the relation name wikitechy_student_details which is given below:
    grunt> wikitechy_student_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_student_details.txt' USING PigStorage(',') 
       as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
    
    • We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
    • We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
    grunt> distinct_data = DISTINCT wikitechy_student_details;
    

    Verification

    grunt> Dump distinct_data;
    

    Output:

     (1,Sabrina,Reddy,9848022337,Hyderabad)
    (2,Arvin,Battacharya,9848022338,Kolkata) 
    (3,Arun,Khanna,9848022339,Delhi) 
    (4,Preethi,Agarwal,9848022330,Pune) 
    (5,Sruti,Mohanthy,9848022336,Bhuwaneshwar)
    (6,Vanitha,Mishra,9848022335,Chennai)
    

    Related Searches to Apache Pig Distinct Operator