[Solved-1 Solution] Pig referencing ?



Pig Relation:

  • A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table.
  • Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
  • Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.

Referencing Relations

  • Relations are referred to by name (or alias). Names are assigned by you as part of the Pig Latin statement. In this example the name (alias) of the relation is A.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
  • Positional notation is generated by the system. Positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
  • Names are assigned by using schemas (or, in the case of the GROUP operator and some functions, by the system). we can use any name that is not a Pig keyword; for example, f1, f2, f3 or a, b, c or name,
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)

Referencing Fields that are Complex Data Types

The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps.

  • Use the schemas for complex data types to name fields that are complex data types.
  • Use the dereference operators to reference and work with fields that are complex data types.
  • In this example the data file contains tuples. A schema for complex data types (in this case, tuples) is used to load the data. Then, dereference operators (the dot in t1.t1a and t2.$0) are used to access the fields in the tuples.

The below code using the positional notation

cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));

DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))

X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

Problem:

How to reference the elements in apache pig ?

Solution 1:

  • Let's do a simple Demonstration to understand this problem.
  • A file 'a.txt' stored at '/tmp/a.txt'

Folder in HDFS

A = LOAD '/tmp/a.txt' using PigStorage(',') AS (name:chararray,term:chararray,gpa:float);
Dump A;
(John,fl,3.9)
(John,fl,3.7)
(John,sp,4.0)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,fl,3.9)
(Mary,sp,4.0)
(Mary,sm,4.0)

Now let's group by this Alias 'A' on the basis of some parameter say name and term

B = GROUP A BY (name,term);
dump B;
((John,fl),{(John,fl,3.7),(John,fl,3.9)})
((John,sm),{(John,sm,3.8)})
((John,sp),{(John,sp,4.0)})
((Mary,fl),{(Mary,fl,3.9),(Mary,fl,3.8)})
((Mary,sm),{(Mary,sm,4.0)})
((Mary,sp),{(Mary,sp,4.0)})
describe B;
B: {group: (name: chararray,term: chararray),A: {(name: chararray,term: chararray,gpa: float)}}

Now it has become the problem statement that you have asked. Let me demonstrate you how to access elements of group tuple or element of A tuple or both

C = foreach B generate group.name,group.term,A.name,A.term,A.gpa;
dump C;
(John,fl,{(John),(John)},{(fl),(fl)},{(3.7),(3.9)})
(John,sm,{(John)},{(sm)},{(3.8)})
(John,sp,{(John)},{(sp)},{(4.0)})
(Mary,fl,{(Mary),(Mary)},{(fl),(fl)},{(3.9),(3.8)})
(Mary,sm,{(Mary)},{(sm)},{(4.0)})
(Mary,sp,{(Mary)},{(sp)},{(4.0)})

So we accessed all elements by this way.


Related Searches to Pig referencing