[Solved-1 Solution] Removing duplicates using PigLatin ?



What is pig latin ?

  • Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.
  • Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Problem:

  • If you using PigLatin. And you may want to remove the duplicates from the bags and want to retain the last element of the particular key.

Input:

User1  7 LA
User1  8 NYC
User1  9 NYC
User2  3 NYC
User2  4 DC 

Output:

User1  9 NYC
User2  4 DC
  • Here the first filed is a key. And if you want the last record of that particular key to be retained in the output.
  • we know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

Solution 1:

  • If order by one of the fields in descending order. Its possible to get the last record. In the below code, have ordered by second field of input

Input :

User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC

Pig snippet :

user_details = LOAD 'user_details.csv'  USING  PigStorage(',') AS (user_name:chararray,no:long,city:chararray);

user_details_grp_user = GROUP user_details BY user_name;

required_user_details = FOREACH user_details_grp_user {
    user_details_sorted_by_no = ORDER user_details BY no DESC;
    top_record = LIMIT user_details_sorted_by_no 1;
    GENERATE FLATTEN(top_record);
}

Output : DUMP required_user_details

(User1,9,NYC )
(User2,4,DC)

Related Searches to Removing duplicates using PigLatin