[Solved-1 Solution] Apache pig - url parsing into a map ?



What's a URL?

  • Uniform Resource Locators (URLs) provide a way to locate a resource using a specific scheme, most often but not limited to HTTP. Just think of a URL as an address to a resource, and the scheme as a specification of how to get there.

Parsing a url

  • The URL class provides several methods that let you query URL objects. You can get the protocol, authority, host name, port number, path, query, filename, and reference from a url.

Problem:

How to URL parsing into a map in apache pig ?

Solution 1:

Use of flatten

  • The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.
  • FLATTEN the result of STRSPLIT so that there is no useless level of nesting in tuples, and FLATTEN again inside the nested foreach
  • Also, STRSPLIT has an optional third argument to give the maximum number of output strings. Use that to guarantee a schema for its output.

The below code is helps for url parsing:

A = load 'test.log' as (f:chararray, url:chararray);
B = foreach A generate f, TOKENIZE(url,'&') as attr;
C = foreach B {
    D = foreach attr generate FLATTEN(STRSPLIT($0,'=',2)) AS (key:chararray, val:chararray);
    generate f, FLATTEN(D);
};
E = foreach (group C by (f, key)) generate group.f, TOMAP(group.key, C.val);
dump E;

Output

test1,[user#{(3553)}])
(test1,[friend#{(2042)}])
(test1,[system#{(262)}])
(test2,[user#{(12523),(205)}])
(test2,[friend#{(26546),(3525),(353)}])
(test2,[browser#{(firfox)}])
  • After finished splitting out the tags and values, group also by the tag to get your bag of values. Then put that into a map. Note that this assumes that if we have two lines with the same id (test2, here) we have to combine them.
  • Unfortunately, there is apparently no way to combine maps without resorting to a UDF, but this should be just about the simplest of all possible UDFs.
public class COMBINE_MAPS extends EvalFunc<Map> {
    public Map<String, DataBag> exec(Tuple input) throws IOException {
        if (input == null || input.size() != 1) { return null; }

        // Input tuple is a singleton containing the bag of maps
        DataBag b = (DataBag) input.get(0);

        // Create map that we will construct and return
        Map<String, Object> m = new HashMap<String, Object>();

        // Iterate through the bag, adding the elements from each map
        Iterator<Tuple> iter = b.iterator();
        while (iter.hasNext()) {
            Tuple t = iter.next();
            m.putAll((Map<String, Object>) t.get(0));
        }

        return m;
    }
}

With a UDF like that, we can do

F = foreach (group E by f) generate COMBINE_MAPS(E.$1);

For better url parsing ,we should add the error-checking code to the UDF


Related Searches to Apache pig - url parsing into a map