pig tutorial - apache pig tutorial - Apache Pig Storing Data - pig latin - apache pig - pig hadoop
What is data storing?
- You can store the loaded data in the file system using the store operator.
- A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails etc.
- Thus, any database or file is a series of bytes that, once stored, is called a data store.
- Stores the relation into the local FS or HDFS (usually!)
- Useful for debugging
Syntax of the Store statement
- Assume we have a file student_data.txt in HDFS with the following content.
- And we have read it into a relation student using the LOAD operator as shown below.
- Now, let us store the relation in the HDFS directory “/pig-Output/” as shown below.
- After executing the store statement, you will get the following output.
- A directory is created with the specified name and the data will be stored in it.
- Verify the stored data as shown below.
- First of all, list out the files in the directory named pig_output using the ls command as shown below.
- You can observe that two files were created after executing the store statement.
- Using cat command, list the contents of the file named part-m-00000 as shown below.
- PigStorage is a built-in function of Pig, and one of the most common functions used to load and store data in pigscripts.
- PigStorage can be used to parse text data with an arbitrary delimiter, or to output data in an delimited format.
- If no argument is provided, PigStorage will assume tab-delimited format.
- If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.
- The schema must be provided in the AS clause.
- To store data using PigStorage, the same delimiter rules apply:
- PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context.
- For this reason, when loading a CSV file it is recommended to use CSVExcelStorage <http://help.mortardata.com/integrations/amazon_s3/csv> rather than PigStorage with a comma delimiter.