[Solved-1 Solution] Group key value of map in pig ?

What is group by ?

  • The GroupByKey core transform is a parallel reduction operation used to process collections of key/value pairs.
  • We use GroupByKey with an input PCollection of key/value pairs that represents a multimap, where the collection contains multiple pairs that have the same key, but different values.
  • The GroupByKey transform lets you gather together all of the values in the multimap that share the same key.

Problem :

Here we have a file


Pig script

A = LOAD 'txt' AS (in: map[]);

We know that we can take the values feeding in the key. In the above example we took the map that contains the values with respect to the key "a". Assuming that we don’t know the key, we need to group the values with respect to keys in a relation and dump it.


Does pig allow such operations or need to go with UDF ?

Solution 1:

  • We can create a custom UDF which converts the map to a bag (using Pig v0.10.0):
package com.example;	
import java.io.IOException;	
import java.util.Map;	
import java.util.Map.Entry;	
import org.apache.pig.EvalFunc;	
import org.apache.pig.data.BagFactory;	
import org.apache.pig.data.DataBag;	
import org.apache.pig.data.Tuple;	
import org.apache.pig.data.TupleFactory;	
public class MapToBag extends EvalFunc<DataBag> {	
    private static final BagFactory bagFactory = BagFactory.getInstance();	
    private static final TupleFactory tupleFactory = TupleFactory.getInstance();	
    public DataBag exec(Tuple input) throws IOException {	
        try {	
            Map<String, Object> map = (Map<String, Object>) input.get(0);	
            DataBag result = null;	
            if (map != null) {	
                result = bagFactory.newDefaultBag();	
                for (Entry<String, Object> entry : map.entrySet()) {	
                    Tuple tuple = tupleFactory.newTuple(2);	
                    tuple.set(0, entry.getKey());	
                    tuple.set(1, entry.getValue());	
            return result;	
        catch (Exception e) {	
            throw new RuntimeException("MapToBag error", e);	

Then use the below code

B = foreach A generate 
      flatten(com.example.MapToBag(in)) as (k:chararray, v:chararray);
describe B;
B: {k: chararray,v: chararray}
  • Now group by key and use a nested foreach:
C = foreach (group B by k) {
    value = foreach B generate v;
    generate group as key, value;
dump C;

