pig tutorial - apache pig tutorial - Apache Pig Storing Data - pig latin - apache pig - pig hadoop

What is data storing?

  • You can store the loaded data in the file system using the store operator.
  • A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails etc.
  • Thus, any database or file is a series of bytes that, once stored, is called a data store.
  • STORE instruction:
    • Stores the relation into the local FS or HDFS (usually!)
    • Useful for debugging
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig store operation

    Syntax of the Store statement

    STORE Relation_name INTO ' required_directory_path ' [USING function];


    • Assume we have a file student_data.txt in HDFS with the following content.
    001, Aadhira,Arushi  ,9848022337, Delhi
    002, Mahi,Champa,9848022338, Chennai
    003, Avantika,charu,9848022339, Pune
    004, Samaira,Hansa,9848022330, Kolkata
    005, Abhinav,Akaash,9848022336,Bhuwaneshwar
    006, Amarjeet,Aksat,9848022335, Hyderabad
    • And we have read it into a relation student using the LOAD operator as shown below.
    grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' 
       USING PigStorage(',')
       as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
       city:chararray );
    • Now, let us store the relation in the HDFS directory “/pig-Output/” as shown below.
    grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');


    • After executing the store statement, you will get the following output.
    • A directory is created with the specified name and the data will be stored in it.
    2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
    MapReduceLau ncher - 100% complete
    2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - 
    Script Statistics:
    HadoopVersion    PigVersion    UserId    StartedAt             FinishedAt             Features 
    2.6.0            0.15.0        Hadoop    2015-10-0 13:03:03    2015-10-05 13:05:05    UNKNOWN  
    Job Stats (time in seconds): 
    JobId          Maps    Reduces    MaxMapTime    MinMapTime    AvgMapTime    MedianMapTime    
    job_14459_06    1        0           n/a           n/a           n/a           n/a
    MaxReduceTime    MinReduceTime    AvgReduceTime    MedianReducetime    Alias    Feature   
         0                 0                0                0             student  MAP_ONLY 
    OutPut folder
    Input(s): Successfully read 0 records from: "hdfs://localhost:9000/pig_data/student_data.txt"  
    Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/pig_Output"  
    Total records written : 0
    Total bytes written : 0
    Spillable Memory Manager spill count : 0 
    Total bags proactively spilled: 0
    Total records proactively spilled: 0
    Job DAG: job_1443519499159_0006
    2015-10-05 13:06:06,192 [main] INFO  org.apache.pig.backend.hadoop.executionengine
    .mapReduceLayer.MapReduceLau ncher - Success!


    • Verify the stored data as shown below.

    Step 1:

    • First of all, list out the files in the directory named pig_output using the ls command as shown below.
    hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
    Found 2 items
    rw-r--r-   1 Hadoop supergroup          0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS
    rw-r--r-   1 Hadoop supergroup        224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000
    • You can observe that two files were created after executing the store statement.

    Step 2:

    • Using cat command, list the contents of the file named part-m-00000 as shown below.
    $ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000' 
    001, Aadhira,Arushi  ,9848022337, Delhi
    002, Mahi,Champa,9848022338, Chennai
    003, Avantika,charu,9848022339, Pune
    004, Samaira,Hansa,9848022330, Kolkata
    005, Abhinav,Akaash,9848022336,Bhuwaneshwar
    006, Amarjeet,Aksat,9848022335, Hyderabad

    Using PigStorage:

      • Delimiter
      • Limitations
    • PigStorage is a built-in function of Pig, and one of the most common functions used to load and store data in pigscripts.
    • PigStorage can be used to parse text data with an arbitrary delimiter, or to output data in an delimited format.


    • If no argument is provided, PigStorage will assume tab-delimited format.
    • If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.


    data = LOAD 's3n://input-bucket/input-folder' USING PigStorage(' ')
                AS (field0:chararray, field1:int);
    • The schema must be provided in the AS clause.
    • To store data using PigStorage, the same delimiter rules apply:
    STORE data INTO 's3n://output-bucket/output-folder' USING PigStorage('\t');


    • PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context.
    • For this reason, when loading a CSV file it is recommended to use CSVExcelStorage <http://help.mortardata.com/integrations/amazon_s3/csv> rather than PigStorage with a comma delimiter.

    Related Searches to Apache Pig Storing Data

    Adblocker detected! Please consider reading this notice.

    We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

    We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

    We need money to operate the site, and almost all of it comes from our online advertising.

    Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.