Storing data to SequenceFile from Apache Pig ?

Sequence File :

  • SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile


Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader

REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)

Is there also a library out there that would allow writing to Hadoop sequence files from Pig ?

Solution 1:

  • This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
  • The "Hadoop expansion pack" Twitter open-sourced at github , includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same - we already have those for sequence files, obviously).

