[Solved-1 Solution] Running Pig query over data stored in Hive ?
What is Pig query ?
- Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key.
- Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.
What is apache hive ?
- Apache Hive enables advanced work on Apache Hadoop Distributed File System and MapReduce. It allows SQL developers to write Hive Query Language statements similar to standard SQL ones.
How to run Pig queries stored in Hive format ?
We have configured Hive to store compressed data. Before that we used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use ?
Here's what we found out: Using HiveColumnarLoader makes sense if we store data as a RCFile.
To load table using this we need to register some jars first:
- To load data from Sequence file we have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
- This may not work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and we may get this exception: