pig tutorial - apache pig tutorial - Python UDFs in Pig ? - pig latin - apache pig - pig hadoop
- Apache Pig is a popular system for executing complex Hadoop map-reduce based data-flows. It adds a layer of abstraction on top of Hadoop's map-reduce mechanisms in order to allow developers to take a high-level view of the data and operations on that data.
- Pig allows we to do things more explicitly. For example, we can join two or more data sources (much like an SQL join). Writing a join as a mapand reduce function is a bit of a drag and it's usually worth avoiding.
- So Pig is great because it simplifies complex tasks - it provides a high-level scripting language that allows users to take more of a picture view of their data flow.
- Pig is especially great because it is extensible. This will focus on its extensibility.
- We will be able to write PigLatin scripts that execute Python code as a part of a larger map-reduce workflow. Pig can be extended with other languages too, but for now we'll stick to Python.
User Defined Functions (UDFs)
- A Pig UDF is a function that is accessible to Pig, but written in a language that isn't PigLatin. Pig allows to register UDFs for use within a PigLatin script.
- A UDF needs to fit a specific prototype - we can't just write your function however we want because then Pig won't know how to call your function, it won't know what kinds of arguments it needs, and it won't know what kind of return value to expect.
- This is the most common type of UDF. It's used in FOREACH type statements. Here's an example of an eval function in action:
- These are just a special case of an eval UDF. An Aggregate function is usually applied to grouped data. For example:
- In other words, an aggregate UDF is a udf that is used to combine multiple pieces of information. Here we are aggregating sales data to show how many purchases were made by each user.
- A filter UDF returns a boolean value. If we have a data source that has a bunch of rows and only a portion of those rows are useful for the current analysis then a filter function of some kind would be useful.
An example of a filter function is action follows:
Here's the code about the simplest Python UDF we can write:
- The data output from a function has a specific form. Pig likes it if we specify the schema of the data because then it knows what it can do with that data. That's what the output_schema decorator is for. There are a bunch of different ways to specify a schema, we'll get to that in a little bit.
Specifying the UDF output schema
- Now a UDF has input and output. This little section is all about the outputs. Here we'll go over the different ways we can specify the output format of a Python UDF through use of the
outputSchemadecorator. We have a few options, here they are:
So outputSchema can be used to imply that a function outputs one or a combination of basic types.
- Those types are:
- chararray: like a string
- bytearray: a bunch of bytes in a row. Like a string but not as human friendly
- long: long integer
- int: normal integer
- double: floating point number
- If no schema is specified then Pig assumes that the UDF outputs a bytearray.
- Not only does a UDF have outputs but inputs as well! This sentence should be filed under 'dah'. we reserved it for a separate section so as not to clutter the discussion on output schemas.
First some UDFs:
Here is the use of those UDFs in a PigLatin script: