pig tutorial - apache pig tutorial - Apache Pig TOKENIZE() Function - pig latin - apache pig - pig hadoop



What is TOKENIZE() function in Apache Pig ?

  • The TOKENIZE() function used in Apache Pig is used to split a string in a single tuple and returns a bag which contains the output of the split operation.
  • The TOKENIZE() function is used to break an input string into tokens separated by a regular expression pattern.
  • The TOKENIZE() function is when the Token elements are placed under the element
  • The TOKENIZE() function will returns one token element, which contains the input string.
  • The TOKENIZE() function has each substring value which is found between the separator matches is placed inside elements with the name token and the namespace mhub

Syntax

grunt> TOKENIZE(expression [, 'field_delimiter']) 

Example

wikitechy_student_details.txt

111,Suresh Reddy,21,Hyderabad
112,Arvin Battacharya,22,Kolkata 
113,Ramesh Khanna,22,Delhi 
114,Preethi Agarwal,21,Pune 
115,Sruthi Mohanthy,23,Bhuwaneshwar 
116,Vanitha Mishra,23 ,Chennai 
117,Kamala Nayak,24,trivendram 
118,Bhargavi Nambiayar,24,Chennai 

We have loaded the file into Pig with the relation name wikitechy_student_details which is given below:

grunt> wikitechy_student_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_student_details.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int,  city:chararray);

Tokenizing a String

We can use the TOKENIZE() function to split into a string.

grunt> student_name_tokenize = foreach wikitechy_student_details Generate TOKENIZE(name);

Verification

grunt> Dump student_name_tokenize;

Output

({(Suresh),(Reddy)})
({(Arvin),(Battacharya)})
({(Ramesh),(Khanna)})
({(Preethi),(Agarwal)})
({(Sruthi),(Mohanthy)})
({(Vanitha),(Mishra)})
({(Kamala),(Nayak)})
({(Bhargavi),(Nambiayar)})

Related Searches to Apache Pig TOKENIZE() Function