what is apache pig - apache pig tutorial - What is Apache Pig - pig latin - apache pig - pig hadoop




What is Big Data ?

big data - hadoop - big data tutorial
  • Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
  • Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information
  • It’s very difficult to manage such huge data……
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example

    Hadoop and its Characteristics

  • Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model
  • It is an Open-source Data Management technology with scale-out storage and distributed processing
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example

    Hadoop Ecosystem

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example

    Need for Pig

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example     - need for pig

    Where to use Pig?

  • Pig is a Data Flow language, thus it is most suitable for:
    • Quickly changing data processing requirements
    • Processing data from multiple channels
    • Quick hypothesis testing
    • Time sensitive data refreshes
    • Data profiling using sampling

    What is Pig ?

  • It is an open source data flow language
  • Pig Latin is used to express the queries and data manipulation operations in simple scripts
  • Pig converts the scripts into a sequence of underlying Map Reduce jobs
  • What does it mean to be Pig?

  • Pigs Eats Everything
    • Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.
  • Pigs Live Everywhere
    • Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. Check for Pig on Tez
  • Pigs Are Domestic Animals
    • Pig is designed to be easily controlled and modified by its users.
    • Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals.
    • Pig supports user provided load and store functions.
    • It supports external executables via its stream command and Map Reduce jars via its MapReduce command.
    • It allows users to provide a custom partitioner for their jobs in some circumstances and to set the level of reduce parallelism for their jobs.
  • Pigs fly
    • Pig processes data quickly. Designers want to consistently improve its performance, and not implement features in ways that weigh pig down so it can't fly.

    Apache Pig - Platforms

  • Platform for easier analyzing large data sets
    • PigLatin: Simple but powerful data flow language similar to scripting languages
    • PigLatin is a high level and easy to understand data flow programming language
    • Provides common data operations (e.g. filters, joins, ordering) and nested types (e.g. tuples, bags, maps)
    • It's more natural for analysts than MapReduce
    • Opens Hadoop to non-Java programmers
    • Pig Engine: Parses, optimizes and automatically executes PigLatin scripts as series of MapReduce jobs on Hadoop cluster

    Where does pig live

  • Pig is installed on user machine
  • No need to install anything on the Hadoop cluster
    • Pig and Hadoop versions must be compatible
  • Pig submits and executes jobs to the Hadoop cluster
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig client machine

    How does Pig work?

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - what is pig
    pig tutorial - apache pig

    Apache pig - data model

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig data model
  • Tuple
    • A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.
    • Example: (wikitechy, 30)
  • Bag
    • A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema).
    • A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in th same position (column) have the same type.
    • Example: {(Raja, 30), (Mohammad, 45)}
    • A bag can be a field in a relation; in that context, it is known as inner bag.
    • Example: {wikitechy, 30, {984xxxxx338, [email protected],}}
  • Relation
    • A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).
  • Map
    • A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’
    • Example: [name#wikitechy, age#30]

    Internalizing Pig

  • Let’s find out people who “overall” visit “highly ranked” pages
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example     - need for pig
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example     - need for pig

    pig in real time

  • Since Pig is a data flow language, it naturally suits for:
    • Data factory operations
    • Typically data is brought from multiple servers to HDFS
    • Pig is used for cleaning the data and preprocessing it
    • It helps data analysts and researchers for quickly prototyping their theories
    • Since Pig is extensible, it becomes way easier for data analysts to spawn their scripting language programs (like Ruby, Python programs) effectively against large data sets

    Ways to Handle Pig

  • Grunt Mode:
    • It’s interactive mode of Pig
    • Very useful for testing syntax checking and ad-hoc data exploration
  • Script Mode:
    • Runs set of instructions from a file
    • Similar to a SQL script file
  • Embedded Mode:
    • Executes Pig programs from a Java program
    • Suitable to create Pig Scripts on the fly
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example  - pig mode

    Modes of Pig

  • All of the different Pig invocations can run in the following modes:
  • Local
    • In this mode, entire Pig job runs as a single JVM process
    • Picks and stores data from local Linux path
    /* local mode */
    pig –x local …
    java -cp pig.jar org.apache.pig.Main -x local … 
  • Map Reduce
    • In this mode, Pig job runs as a series of map reduce jobs
    • Input and output paths are assumed as HDFS paths
     /* mapreduce mode */
    pig or pig –x mapreduce …
    java -cp pig.jar org.apache.pig.Main ...
    java -cp pig.jar org.apache.pig.Main -x mapreduce ...   

    Pig Components

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example  - pig mode

    Working with Data in pig

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig data methods

    Pig Programs Execution

  • Pig is just a wrapper on top of Map Reduce layer
  • It parses, optimizes and converts the Pig script to a series of Map Reduce jobs
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example  - pig mode

    Apache Pig Sample Script

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig mode
  • LOAD
    • Loads data from the file system.
    • LOAD 'data' [USING function] [AS schema];
    • If you specify a directory name, all the files in the directory are loaded.
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
  • STORE
    • Stores or saves results to the file system.
    • STORE alias INTO 'directory' [USING function];
    • A = LOAD ‘t.txt' USING PigStorage('\t');
    • STORE A INTO USING PigStorage(‘*') AS (f1:int, f2:int);
  • LIMIT
    • Limits the number of output tuples.
    • alias = LIMIT alias n;
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
    • B = LIMIT A 5;
  • FILTER
    • Selects tuples from a relation based on some condition..
    • alias = FILTER alias BY expression;
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
    • B = FILTER A f2 > 2;
     A = LOAD '/user/mapr/training/pig/emp.csv' USING
    PigStorage(',') AS (id, firstname, lastname, designation,
    city);
                    
                    DUMP A INTO '/user/mapr/training/pig/output';
                    
                    
        STORE A INTO '/user/mapr/training/pig/output'; 

    Apache Pig Example Scripts

     X = LOAD '/user/mapr/training/pig/emp_pig1.csv' USING PigStorage(',') AS
    (id, firstname, lastname, designation, city);
    Y = LOAD '/user/mapr/training/pig/emp_pig2.csv' USING PigStorage(',') AS
    (id, firstname, lastname, designation, city);
    Z = JOIN X by (designation), Y BY (designation);
    final = FILTER Z by X::designation MATCHES 'Manager';
    A = GROUP X BY city;
    B = FOREACH X GENERATE id, designation;
    STORE final INTO '/user/mapr/training/pig/output';  

    Apache Pig Advanced Scripts

    • Get distinct of elements in pig
    • process data in parallel in pig
    • sample data in pig
    • order by elements in pig
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig mode
  • DISTINCT
    • Removes duplicate tuples in a relation.
    • alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
    • B = DISTINCT A;
  • DUMP
    • Dumps or displays results to screen.
    • DUMP alias;
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
    • DUMP A;
  • ORDER BY
    • Sorts a relation based on one or more fields.
    • alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
    • A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
    • B = ORDER A BY f2;
    • DUMP B;
  • UNION
    • Computes the union of two or more relations.
    • alias = UNION [ONSCHEMA] alias, alias [, alias …];
    • L1 = LOAD 'f1' USING (a : int, b : float);
    • L2 = LOAD 'f1' USING (a : long, c : chararray);
    • U = UNION ONSCHEMA L1, L2;
    • DESCRIBE U ;
    • U : {a : long, b : float, c : chararray}
  • Join(Inner)
    • Performs an inner join of two or more relations based on common field values.
    • alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
    • A = load 'mydata';
    • B = load 'mydata';
    • C = join A by $0, B by $0;
    • DUMP C;
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig join
  • Join(Outer)
    • Performs an outer join of two relations based on common field values.
    • alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY rightalias- column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
    • A = LOAD 'a.txt' AS (n:chararray, a:int);
    • B = LOAD 'b.txt' AS (n:chararray, m:chararray);
    • C = JOIN A by $0 LEFT OUTER, B BY $0;
    • DUMP C;

    Apache Pig user defined Functions

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig user defined functions
  • FOREACH
    • Generates data transformations based on columns of data.
    • alias = FOREACH { block | nested_block };
    • X = FOREACH A GENERATE f1;
    • X = FOREACH B { S = FILTER A BY 'xyz‘ == ‘3’; GENERATE COUNT (S.$0); }
  • CROSS
    • Computes the cross product of two or more relations.
    • alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
    • A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
    • B = LOAD 'data2' AS (b1:int,b2:int);
    • X = CROSS A, B
  • (CO)GROUP
    • Groups the data in one or more relations.
    • The GROUP and COGROUP operators are identical.
    • alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
    • A = load 'student' AS (name:chararray, age:int, gpa:float);
    • B = GROUP A BY age;
    • DUMP B;

    Apache Pig Storage

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig user defined functions

    Pig Latin vs hiveql

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig latin vs hiveql

    Pig’s Debugging Operators

  • Use the DUMP operator to display results to your terminal screen.
  • Use the DESCRIBE operator to review the schema of a relation.
  • Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
  • Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
  • Shortcuts for Debugging Operators

  • \d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.
  • \de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.
  • \e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.
  • \i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.
  • \q - To quit grunt shell
  • Pig Advanced Operations

  • ASSERT
    • Assert a condition on the data..
    • ASSERT alias BY expression [, message];
    • A = LOAD 'data' AS (a0:int,a1:int,a2:int);
    • ASSERT A by a0 > 0, 'a0 should be greater than 0';
  • CUBE
    • Performs cube/rollup operations.
    • alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];
    • cubedinp = CUBE salesinp BY CUBE(product,year);
    • rolledup = CUBE salesinp BY ROLLUP(region,state,city);
    • cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);
  • SAMPLE
    • Selects a random sample of data based on the specified sample size.
    • SAMPLE alias size;
    • A = LOAD 'data' AS (f1:int,f2:int,f3:int);
    • X = SAMPLE A 0.01;
  • RANK
    • Returns each tuple with the rank within a relation.
    • alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [DENSE] ];
    • B = rank A;
    • C = rank A by f1 DESC, f2 ASC;
    • C = rank A by f1 DESC, f2 ASC DENSE;
  • MAPREDUCE
    • Executes native MapReduce jobs inside a Pig script.
    • alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `];
    • A = LOAD 'WordcountInput.txt';
    • B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
  • IMPORT
    • Import macros defined in a separate file.
    • IMPORT 'file-with-macro';
  • STREAM
    • Sends data to an external script or program.
    • alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;
    • A = LOAD 'data';
    • B = STREAM A THROUGH `perl stream.pl -n 5`;

    Built-in functions

    • Eval functions
      • AVG
      • CONCAT
      • COUNT
      • COUNT_STAR
    • Math functions
      • ABS
      • SQRT
      • Etc …
    • STRING functions
      • ENDSWITH
      • TRIM
    • Datetime functions
      • AddDuration
      • GetDay
      • GetHour
    • Dynamic Invokers
      • DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
      • encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
      • decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

    File System Commands with Apache Pig - Hadoop

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - hadoop file system commands

    Hadoop - Apache pig Utility Commands

    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig utility commands

    Some more commands in PIG

  • To select few columns from one dataset
    • S1 = foreach a generate a1, a1;
  • Simple calculation on dataset
    • K = foreach A generate $1, $2, $1*$2;
  • To display only 100 records
    • B = limit a 100;
  • To see the structure/Schema
    • Describe A;
  • To Union two datasets
    • C = UNION A,B;

    Using Hive tables with HCatalog

  • HCatalog (which is a component of Hive) provides access to Hive’s metastore, so that Pig queries can reference schemas each time.
  • • For example, after running through An Example to load data into a Hive table called records, Pig can access the table’s schema and data as follows:
  •  pig -useHCatalog
     grunt> records = LOAD ‘School_db.student_tbl'
    USING org.apache.hcatalog.pig.HCatLoader();
     grunt> DESCRIBE records;
     grunt> DUMP records;

    Related Searches to What is Apache Pig ?