Specifying the UDF output schema

  • A UDF has input and output. Here is the different ways you can specify the output format of a Python UDF through use of the outputSchema decorator.

Sample Code:

# the original udf
# it returns a single chararray (that's PigLatin for String)
@outputSchema('word:chararray')
def hi_world():
return "hello world"

# this one returns a Python tuple. Pig recognises the first element
# of the tuple as a chararray like before, and the next one as a
# long (a kind of integer)
@outputSchema("word:chararray,number:long")
def hi_everyone():
return "hi there", 15

#we can use outputSchema to define nested schemas too, here is a bag of tuples
@outputSchema('some_bag:bag{t:(field_1:chararray, field_2:int)}')
def bag_udf():
return [
('hi',1000),
('there',2000),
('bill',0)
]

#and here is a map
@outputSchema('something_nice:map[]')
def my_map_maker():
return {"a":"b", "c":"d", "e","f"}

OutputSchema can be used to imply that a function outputs one or a combination of basic types. Those types are:

  • chararray: like a string
  • bytearray: a bunch of bytes in a row. Like a string but not as human friendly
  • long: long integer
  • int: normal integer
  • double: floating point number
  • datetime
  • boolean
  • No schema is specified;then the Pig assumes that the UDF outputs a bytearray.
Apache Pig UDF

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,