How to get array/bag of elements from Hive group by operator in pig ?



What is Hive ?

  • Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

What is group by ?

  • The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.

Syntax

  • Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;

Problem:

  • If we want to group by a given field and get the output with grouped fields. Below is an example
  • Imagine a table named 'sample_table' with two columns as below:-

F1 F2
001 111
001 222
001 123
002 222

We want to write Hive Query that will give the below output:-

001 [111, 222, 123]
002 [222, 333]
003 [555]

In Pig, this can be very easily achieved by something like this:-

grouped_relation = GROUP sample_table BY F1;
  • Can somebody please suggest if there is a simple way to do so in Hive?

Solution 1:

  • The built in aggregate function collect_set gets we almost what we want. It would actually work on above example input:
SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Solution 2:

  • collect_set actually works as expected since a set as per definition is a collection of well defined and distinct objects i.e. objects occur exactly once or not at all within a set.

Related Searches to How to get array/bag of elements from Hive group by operator in pig ?