# Pig Latin - Apache Pig Tutorial

## What is pig latin - Pig Programming Model: Data

• Pig operations operate on relations
• A relation is a bag
• A bag is a collection of tuples
• A tuple is an ordered set of fields
• A field is any type of data

## Basic data types:

• Boolean: True, False
• Int and Long: 1, 2, 3, 4, 5
• Float and Double: 2.3, 1.4, 4.5
• Chararray: ‘Hello’, ‘I am a string’
• DateTime: 2014-09-11T12:20:14.1234+00:00
• … more but you won’t probably use them very often

## Working with Data

• Data source: Local or HDFS (usually!)
• Data is automatically loaded in a distributed relation

## Checking relations’ content

• DUMP instruction:
• Prints the content of a relation at standard output
• DESCRIBE instruction:
• Prints the schema of the relation at standard output
• ILLUSTRATE instruction:
• Prints the schema of the relation and a tuple example at standard output

## Operating on relations

• FOREACH instruction:
• Generate new relations by projecting data of a relation
• FOREACH instruction:
• Let us execute the instruction and… it seems that nothing happens!

## Operating on relations

• Pig employs lazy evaluation
• Computation only when:
• Pig keeps a DAG on MR jobs needed to compute relations (optimized!)
• ## Operating on relations

• FILTER instruction:
• Generate a new relation by filtering data on a relation
• SPLIT instruction:
• Splits a relation into multiple relations based on conditions
• GROUP instruction:
• Creates tuples with the key and a of bag tuples with the same key values
• We can use multiple relations. Creates one bag per relation
• Nested FOREACH:
• Operate on data in bags inside a relation and then project
• (inner) JOIN instruction:
• Our classic database operator for relations!
• (left) JOIN instruction:
• Our classic database operator for relations!
• CROSS instruction:
• Cartesian product of two or more relations
• UNION instruction:
• Joins in the same relation multiple relations
• DISTINCT instruction:
• Only preserves unique tuples
• ORDER BY instruction:
• Sorts relations by a specific criteria
• LIMIT instruction:
• Truncates relation’s size
• RANK instruction:
• Appends position of each tuple in the relation
• We can also sort and rank!
• SAMPLE instruction:
• Sample the relation!
• CUBE instruction:
• Is this really useful? Yes! Many aggregates with just one operation
• CUBE/ROLLUP instruction:
• Like standard CUBE but nulls values are introduced from right to left
• ASSERT instruction:
• Assert that the whole relation fulfills a condition
• Useful for debugging
• STORE instruction:
• Stores the relation into the local FS or HDFS (usually!)
• Useful for debugging

## Where to find useful PigLatin scripts?

• PiggyBank - Pig’s repository of usercontributed functions
• load/store functions (e.g. from XML)
• datetime, text functions math, stats functions
• DataFu - LinkedIn's collection of Pig UDFs
• statistics functions (quantiles, variance etc.)
• convenient bag functions (intersection, union etc.)
• utility functions (assertions, random numbers, MD5, distance between lat/long pair), PageRank

## How to develop PigLatin scripts?

• Eclipse plugins
• PigEditor
• syntax/errors highlighting
• check of alias name existence
• auto completion of keywords, UDF names
• PigPen
• graphical visualization of scripts (box and arrows)
• Pig-Eclipse
• Plugins for Vim, Emacs, TextMate
• Usually provide syntax highlighting and code completion

## How to run PigLatin scripts?

• PigServer Java class, a JDBC like interface
• Python and JavaScript with PigLatin code embedded
• adds control flow constructs such as if and for
• avoids the need to invent a new language
• uses a JDBC-like compile, bind, run model