pig tutorial - apache pig tutorial - Pig latin - pig latin - apache pig - pig hadoop



What is pig latin - Pig Programming Model: Data

  • Pig operations operate on relations
  • A relation is a bag
  • A bag is a collection of tuples
  • A tuple is an ordered set of fields
  • A field is any type of data

Basic data types:

  • Boolean: True, False
  • Int and Long: 1, 2, 3, 4, 5
  • Float and Double: 2.3, 1.4, 4.5
  • Chararray: ‘Hello’, ‘I am a string’
  • DateTime: 2014-09-11T12:20:14.1234+00:00
  • … more but you won’t probably use them very often
learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig latin data model

Tuple: A catch-all data type

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig data type

Bag:

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig data type bag

Working with Data

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig data methods

Loading data?

  • Data source: Local or HDFS (usually!)
  • LOAD instruction:
    • Data is automatically loaded in a distributed relation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig load data
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig load data
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig latin data type tuple map

    Checking relations’ content

  • DUMP instruction:
    • Prints the content of a relation at standard output
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig dump statement
  • DESCRIBE instruction:
    • Prints the schema of the relation at standard output
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig describe statement
  • ILLUSTRATE instruction:
    • Prints the schema of the relation and a tuple example at standard output
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig illustrate statement

    Operating on relations

  • FOREACH instruction:
    • Generate new relations by projecting data of a relation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig foreach statement
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig foreach statement
  • FOREACH instruction:
    • Let us execute the instruction and… it seems that nothing happens!
    • We had some tracing output with LOAD, DUMP, and ILLUSTRATE…

    Operating on relations

  • Pig employs lazy evaluation
  • Computation only when:
    • LOAD, ILLUSTRATE, DUMP, STORE
  • Pig keeps a DAG on MR jobs needed to compute relations (optimized!)
  • Operating on relations

  • FILTER instruction:
    • Generate a new relation by filtering data on a relation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig filter operation
  • SPLIT instruction:
    • Splits a relation into multiple relations based on conditions
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig split operation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig split operation
  • GROUP instruction:
    • Creates tuples with the key and a of bag tuples with the same key values
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig group by operation
  • We can use multiple relations. Creates one bag per relation
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig group by operation
  • Nested FOREACH:
    • Operate on data in bags inside a relation and then project
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig nested foreach operation
  • (inner) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig inner join operation
  • (left) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig left outer join operation
  • CROSS instruction:
    • Cartesian product of two or more relations
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig cross join operation
  • UNION instruction:
    • Joins in the same relation multiple relations
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig union operation
  • DISTINCT instruction:
    • Only preserves unique tuples
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig distinct operation
  • ORDER BY instruction:
    • Sorts relations by a specific criteria
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig order by operation
  • LIMIT instruction:
    • Truncates relation’s size
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig limit operation
  • RANK instruction:
    • Appends position of each tuple in the relation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig rank operation
  • We can also sort and rank!
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig sort rank operation
  • SAMPLE instruction:
    • Sample the relation!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig sample instruction
  • CUBE instruction:
    • Is this really useful? Yes! Many aggregates with just one operation
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig cube operation
  • CUBE/ROLLUP instruction:
    • Like standard CUBE but nulls values are introduced from right to left
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig cube rollup operation
  • ASSERT instruction:
    • Assert that the whole relation fulfills a condition
    • Useful for debugging
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig assert operation
  • STORE instruction:
    • Stores the relation into the local FS or HDFS (usually!)
    • Useful for debugging
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig store operation

    Where to find useful PigLatin scripts?

    • PiggyBank - Pig’s repository of usercontributed functions
      • load/store functions (e.g. from XML)
      • datetime, text functions math, stats functions
    • DataFu - LinkedIn's collection of Pig UDFs
      • statistics functions (quantiles, variance etc.)
      • convenient bag functions (intersection, union etc.)
      • utility functions (assertions, random numbers, MD5, distance between lat/long pair), PageRank

    How to develop PigLatin scripts?

  • Eclipse plugins
    • PigEditor
      • syntax/errors highlighting
      • check of alias name existence
      • auto completion of keywords, UDF names
    • PigPen
      • graphical visualization of scripts (box and arrows)
    • Pig-Eclipse
    • Plugins for Vim, Emacs, TextMate
      • Usually provide syntax highlighting and code completion

    How to run PigLatin scripts?

  • PigServer Java class, a JDBC like interface
  • Python and JavaScript with PigLatin code embedded
    • adds control flow constructs such as if and for
    • avoids the need to invent a new language
    • uses a JDBC-like compile, bind, run model

    Related Searches to Apache Pig Overview