pig tutorial - apache pig tutorial - Apache Pig - Reading Data - pig load data - pig latin - apache pig - pig hadoop



What is reading data in pig?

  • Apache Pig works on top of Hadoop.
  • It is an analytical tool that analyzes large datasets that exist in the Hadoop File System.
  • To analyze data using Apache Pig, we have to initially load the data into Apache Pig.
learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig load data

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig load data

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig latin data type tuple map

Preparing HDFS:

  • In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS.
  • Therefore start HDFS and create the following sample data in HDFS.
Student ID First Name Last Name Phone City
001 Aadhira Arushi 9848022337 Delhi
002 Mahi Champa 9848022338 Chennai
003 Avantika charu 9848022339 Pune
004 Samaira Hansa 9848022330 Kolkata
005 Abhinav Akaash 9848022336 Bhuwaneshwar
006 Amarjeet Aksat 9848022335 Hyderabad

The above dataset contains personal details like id, first name, last name, phone number and city, of six students.

Step 1: Verifying Hadoop

  • First of all, verify the installation using Hadoop version command,

$ hadoop version

  • If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output −
Hadoop 2.6.0 
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 
Compiled by jenkins on 2014-11-13T21:10Z 
Compiled with protoc 2.5.0 
From source with checksum 18e43357c8f927c0695f1e9522859d6a 
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar

Step 2: Starting HDFS

  • Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/ 
$ start-dfs.sh 
localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out 
localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out 
Starting secondary namenodes [0.0.0.0] 
starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out
 
$ start-yarn.sh 
starting yarn daemons 
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out 
localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out

Step 3: Create a Directory in HDFS

  • In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/ 
$ hdfs dfs -mkdir hdfs//localhost:9000/Pig_Data

Step 4: Placing the data in HDFS

  • The input file of Pig contains each tuple/record in individual lines.
  • And the entities of the record are separated by a delimiter (In our example we used “,”).
  • In the local file system, create an input file student_data.txt containing data as shown below.
001, Aadhira,Arushi  ,9848022337, Delhi
002, Mahi,Champa,9848022338, Chennai
003, Avantika,charu,9848022339, Pune
004, Samaira,Hansa,9848022330, Kolkata
005, Abhinav,Akaash,9848022336,Bhuwaneshwar
006, Amarjeet,Aksat,9848022335, Hyderabad
  • Now, move the file from the local file system to HDFS using put command as shown below. (You can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin 
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/

Verifying the file:

  • Use the cat command to verify whether the file has been moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt

Output:

  • See the content of the file as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
  
001, Aadhira,Arushi  ,9848022337, Delhi
002, Mahi,Champa,9848022338, Chennai
003, Avantika,charu,9848022339, Pune
004, Samaira,Hansa,9848022330, Kolkata
005, Abhinav,Akaash,9848022336,Bhuwaneshwar
006, Amarjeet,Aksat,9848022335, Hyderabad

The Load Operator

  • You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.

Syntax:

  • The load statement consists of two parts divided by the “=” operator.
  • On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data.
  • Given below is the syntax of the Load operator.

Relation_name = LOAD 'Input file path' USING function as schema;

  • Where,
  • relation_name − We have to mention the relation in which we want to store the data.
  • Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode)
  • function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
  • Schema − We have to define the schema of the data. We can define the required schema as follows −

(column1 : data type, column2 : data type, column3 : data type);

Note

  • load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check).

Example

  • As an example, load the data in >student_data.txt in Pig under the schema named Student using the LOAD command.
  • Start the Pig Grunt Shell
  • First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below.

$ Pig -x mapreduce

  • It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found
  
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
 
grunt>

Execute the Load Statement

  • Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );
  • Following is the description of the above statement.
Relation name We have stored the data in the schema student.
Input file path We are reading data from the file student_data.txt, which is in the /pig_data/ directory of HDFS.
Storage function We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘\t’ as a parameter.
schema

We have stored the data using the following schema.

column id firstname lastname phone city
datatype int char array char array char array char array

Related Searches to Apache Pig - Reading Data