[Solved-1 Solution] Using hive table over parquet in Pig ?



What is hive table ?

Tables:

  • Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):

Timestamp

  • Which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.

userid

  • which is of BIGINT type that identifies the user who viewed the page.

page_url

  • Which is of STRING type that captures the location of the page.

referer_url

  • Which is of STRING that captures the location of the page from where the user arrived at the current page.

IP

  • Which is of STRING type that captures the IP address from where the page request was made.

Problem:

  • If you trying to create a Hive table with schema string,string,double on a folder containing two Parquet files.
  • The first parquet file schema is string,string,double and the schema of the second file is string,double,string.
CREATE EXTERNAL TABLE dynschema (
 trans_date string,
 currency string,
 rate double) 
STORED AS PARQUET
LOCATION '/user/impadmin/test/parquet/evolution/';

If you try to use the hive table in pig(0.14) script.

A = LOAD 'dynschema' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;

But you get this error

java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable

  • Which is suspect due to the schema of the second file is different from the table schema as the first file's split is successfully read but this exception occurs while reading the second file's split.
  • Here, you see that there is logic of conversion from the data schema to the output schema, but while debugging, I found there is no difference in both the schema.

1. Pig support such cases of reading data from hive table created over multiple parquet files with different schema.
2. If yes, how to do this ?

Solution 1:

If we have files with 2 different schemas,

The following seems to be sensible:

1. Split up the files, based on which schema they have
2. Make tables out of them
3. If desirable, load the individual tables and store them into a supertable


Related Searches to Using hive table over parquet in Pig