[Solved-1 Solution] Datetime parsing in Apache Pig ?



What is parsing ?

  • Parsing methods convert the string representation of a date and time to an equivalent DateTime object.
  • Parsing is influenced by the properties of a format provider that supplies information such as the strings used for date and time separators, and the names of months, days, and eras.
  • The format provider is the current DateTimeFormatInfo object, which is provided implicitly by the current thread culture or explicitly by the IFormatProvider parameter of a parsing method.
  • For the IFormatProvider parameter, specify a CultureInfo object, which represents a culture, or a DateTimeFormatInfo object.

Problem :

We are trying to parse a Date in a Pig script and we got the following error "Hadoop does not return any error message". Here is the example of Date format: 16/7/18 11:00 AM

data = LOAD 'cleaned.txt'
AS (Date, Block, Primary_Type, Description, Location_Description, Arrest, Domestic, District, Year);

times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;

It looks like the error is caused by the STORE command on "times".

If we do a DUMP then we got the error:

ERROR 1066: Unable to open iterator for alias times
It happens only when we use the ToDate function.

Solution 1:

  • We need to specify the loader in the LOAD statement:
USING PigStorage('\t')
  • We always remember to specify the schema with this type
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);

After this the date conversion just works fine:

  • (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z)

Use below code :

data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
DUMP times;
  • PigStorage is the default load function for the LOAD operator.
  • The original issue happend by the lack of datatype
If you don't assign types, fields default to type bytearray


Related Searches to Datetime parsing in Apache Pig