[Solved-2 Solutions] How to use Cassandra's Map Reduce with or w/o Pig ?
How to use Cassandra's Map Reduce with or w/o Pig ?
- The way that a developer writes a MapReduce program that uses Cassandra as the data source is as follows.
What is Custom Input Format
- The data to be processed on top of Hadoop is usually stored on Distributed File System. e.g. HDFS (Hadoop Distributed File System).
- To read the data to be processed, Hadoop comes up with InputFormat, which has following responsibilities: Compute the input splits of data. Provide a logic to read the inputsplit.
- We write a regular MapReduce program and the jars that are now available provide a CustomInputFormat that allows the input source to be Cassandra (which is Hadoop).
If we are using Pycassa we did say we are out of luck until either
- (1) The maintainer of that project adds support for MapReduce
- (2) We throw some Python functions together that write up a Java
- Mapreduce program and run it.
- Cassandra provides an implementation of InputFormat. Incase you are new to Hadoop the InputFormat is what the mapper is going to use to load your data into it (basically).
- Their subclass connects your mapper to pull the data in from Cassandra. What is also great here is that the Cassandra folks have also spent the time implementing the integration in the classic “Word Count” example.
- Cassandra rows or row fragments (that is, pairs of key + SortedMap of columns) are input to Map tasks for processing by your job, as specified by a SlicePredicate that describes which columns to fetch from each row. Here’s how this looks in the word_count example, which selects just one configurable columnName from each row:
- Cassandra also provides a Pig LoadFunc for running jobs in Pig DSL instead of writing Java code by hand.