[Solved-2 Solutions] Self cross-join in pig is disregarded ?



What is cross

  • Computes the cross product of two or more relations.

Syntax

alias = CROSS alias, alias [, alias …] [PARALLEL n];

Problem :

  • If one data have like those:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
  • And then a cross-join is done on A, A:
B = CROSS A, A;
DUMP B;
(1,2,3)
(4,2,1)

Why is second A optimized out from the query?

info: pig version 0.11
== UPDATE ==

If Sort A like:

C = ORDER A BY a1;
D = CROSS A, C;

It will give a correct cross-join.

Solution 1:

  • Its needed to load the data twice to achieve what you want. i.e.,
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;

Solution 2:

  • We cannot CROSS (or JOIN) a relation with itself. If wish to do this, we must create a copy of the data. In this case, we can use another LOAD statement. If we want to do this with a relation further down a pipeline, its need to duplicate it using FOREACH.
  • We have several macros that we use frequently and IMPORT by default in all of my Pig scripts in case we need them. One is used for just this purpose:
DEFINE DUPLICATE(in) RETURNS out
{
        $out = FOREACH $in GENERATE *;
};

The below code helps to cross-join

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;
  • Note that even though A1 and A2 are identical, it cannot assume that the records are in the same order. But if we are doing a CROSS or JOIN, this probably doesn't matter.

Related Searches to Self cross-join in pig is disregarded ?