pig tutorial - apache pig tutorial - Apache Pig Distinct Operator - pig latin - apache pig - pig hadoop




What is Distinct Operator in Apache Pig ?

  • The DISTINCT Operator is used to remove duplicated records and it works only on entire records, which does not work on individual fields.
  • The DISTINCT operators which are used in a SELECT statement filter the result set to remove duplicates
  • We can use DISTINCT operator in combination with an aggregation function, which is typically COUNT ().
  • The distinct operator is used to get the unique values by removing duplicates.
  • The DISTINCT operator is used to remove redundant tuples from a relation.

Pig Operations - Deduplication

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig deduplication distinct
  • works only on entire records, not on individual fields
  • forces a reduce phase, but optimizes by using the combiner
  • DISTINCT instruction:
    • Only preserves unique tuples
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig distinct operation

    Syntax

    grunt> Relation_name2 = DISTINCT Relatin_name1;
    

    Example:

    wikitechy_student_details.txt

    001,Sabrina,Reddy,9848022337,Hyderabad
    002,Arvin,Battacharya,9848022338,Kolkata 
    002,Arvin,Battacharya,9848022338,Kolkata 
    003,Arun,Khanna,9848022339,Delhi 
    003,Arun,Khanna,9848022339,Delhi 
    004,Preethi,Agarwal,9848022330,Pune 
    005,Sruti,Mohanthy,9848022336,Bhuwaneshwar
    006,Vanitha,Mishra,9848022335,Chennai 
    006,Vanitha,Mishra,9848022335,Chennai
    
    • And we have loaded this file into Pig with the relation name wikitechy_student_details which is given below:
    grunt> wikitechy_student_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_student_details.txt' USING PigStorage(',') 
       as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
    
    • We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
    • We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
    grunt> distinct_data = DISTINCT wikitechy_student_details;
    

    Verification

    grunt> Dump distinct_data;
    

    Output:

     (1,Sabrina,Reddy,9848022337,Hyderabad)
    (2,Arvin,Battacharya,9848022338,Kolkata) 
    (3,Arun,Khanna,9848022339,Delhi) 
    (4,Preethi,Agarwal,9848022330,Pune) 
    (5,Sruti,Mohanthy,9848022336,Bhuwaneshwar)
    (6,Vanitha,Mishra,9848022335,Chennai)
    

    Related Searches to Apache Pig Distinct Operator

    Adblocker detected! Please consider reading this notice.

    We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

    We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

    We need money to operate the site, and almost all of it comes from our online advertising.

    Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

    ×