[Solved-1 Solution] Processing Json through Pig Scripts ?



Json:

  • Each Pig tuple is stored on one line (as one value for TextOutputFormat) so that it can be read easily using TextInputFormat. Pig tuples are mapped to JSON objects.
  • Pig bags are mapped to JSON arrays. Pig maps are also mapped to JSON objects. Maps are assumed to be string to string. A schema is stored in a side file to deal with mapping between JSON and Pig types. The schema file share the same format as the one we use in PigStorage.

Problem:

  • If you have currently started to work with JSON files and process data using PIG scripts. You have to come across PiggyBank which you thought will be useful to load and process json file in PIG scripts.
  • Here is a simple PIGSCRIPT and the respective JSON as follows.
REGISTER piggybank.jar
a = LOAD 'file3.json' using org.apache.pig.piggybank.storage.JsonLoader() AS (json:map[]);
b = foreach a GENERATE flatten(json#'menu') AS menu;
c = foreach b generate flatten(menu#'popup') as popup;
d = foreach c generate flatten(popup#'menuitem') as menu;
e = foreach d generate flatten(menu#'value') as val;
DUMP e;

file3.json
{ "menu" : {
    "id" : "file",
    "value" : "File",
    "popup": {
      "menuitem" : [
        {"value" : "New", "onclick": "CreateNewDoc()"},
        {"value" : "Open", "onclick": "OpenDoc()"},
        {"value" : "Close", "onclick": "CloseDoc()"}
      ]
    }
 }}

The following exception came during runtime:

org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error while reading input - Could not json-decode string: { "menu" : {
    at org.apache.pig.piggybank.storage.JsonLoader.parseStringToTuple(JsonLoader.java:127)

Pig log file:

Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias e

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias e
        at org.apache.pig.PigServer.openIterator(PigServer.java:901)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:655)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:561)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
        at org.apache.pig.PigServer.openIterator(PigServer.java:893)
        ... 12 more
  

Please give the correct solution.

Solution 1:

We can handle nested json loading with Twitter's Elephant Bird

a = LOAD 'file3.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
  • This will parse the JSON into a map schema the JSONArray gets parsed into a DataBag of maps.

Related Searches to Processing Json through Pig Scripts ?