Specifying the UDF output schema
- A UDF has input and output. Here is the different ways you can specify the output format of a Python UDF through use of the outputSchema decorator.
Sample Code:
# the original udf
# it returns a single chararray (that's PigLatin for String)
@outputSchema('word:chararray')
def hi_world():
return "hello world"
# this one returns a Python tuple. Pig recognises the first element
# of the tuple as a chararray like before, and the next one as a
# long (a kind of integer)
@outputSchema("word:chararray,number:long")
def hi_everyone():
return "hi there", 15
#we can use outputSchema to define nested schemas too, here is a bag of tuples
@outputSchema('some_bag:bag{t:(field_1:chararray, field_2:int)}')
def bag_udf():
return [
('hi',1000),
('there',2000),
('bill',0)
]
#and here is a map
@outputSchema('something_nice:map[]')
def my_map_maker():
return {"a":"b", "c":"d", "e","f"}
OutputSchema can be used to imply that a function outputs one or a combination of basic types. Those types are:
- chararray: like a string
- bytearray: a bunch of bytes in a row. Like a string but not as human friendly
- long: long integer
- int: normal integer
- double: floating point number
- datetime
- boolean
- No schema is specified;then the Pig assumes that the UDF outputs a bytearray.