pig tutorial - apache pig tutorial - Apache Pig - Reading Data - pig load data - pig latin - apache pig - pig hadoop
What is reading data in pig?
- Apache Pig works on top of Hadoop.
- It is an analytical tool that analyzes large datasets that exist in the Hadoop File System.
- To analyze data using Apache Pig, we have to initially load the data into Apache Pig.
- In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS.
- Therefore start HDFS and create the following sample data in HDFS.
|Student ID||First Name||Last Name||Phone||City|
The above dataset contains personal details like id, first name, last name, phone number and city, of six students.
Step 1: Verifying Hadoop
- First of all, verify the installation using Hadoop version command,
$ hadoop version
- If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output −
Step 2: Starting HDFS
- Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below.
Step 3: Create a Directory in HDFS
- In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below.
Step 4: Placing the data in HDFS
- The input file of Pig contains each tuple/record in individual lines.
- And the entities of the record are separated by a delimiter (In our example we used “,”).
- In the local file system, create an input file student_data.txt containing data as shown below.
- Now, move the file from the local file system to HDFS using put command as shown below. (You can use copyFromLocal command as well.)
Verifying the file:
- Use the cat command to verify whether the file has been moved into the HDFS, as shown below.
- See the content of the file as shown below.
The Load Operator
- You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.
- The load statement consists of two parts divided by the “=” operator.
- On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data.
- Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
- relation_name − We have to mention the relation in which we want to store the data.
- Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode)
- function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
- Schema − We have to define the schema of the data. We can define the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
- load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check).
- As an example, load the data in >student_data.txt in Pig under the schema named Student using the LOAD command.
- Start the Pig Grunt Shell
- First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below.
$ Pig -x mapreduce
- It will start the Pig Grunt shell as shown below.
Execute the Load Statement
- Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell.
- Following is the description of the above statement.
|Relation name||We have stored the data in the schema student.|
|Input file path||We are reading data from the file student_data.txt, which is in the /pig_data/ directory of HDFS.|
|Storage function||We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘\t’ as a parameter.|
We have stored the data using the following schema.