When to use Hadoop, HBase, Hive and Pig ?
- MapReduce is just a computing framework. HBase has nothing to do with it. That said, we can efficiently put or fetch data to/from HBase by writing MapReduce jobs.
Use when deal with large data:
- Alternatively we can write sequential programs using other HBase APIs, such as Java, to put or fetch the data.
- But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when our data is too huge.
- Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + Computation or Processing framework (MapReduce).
- Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access.
- This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
- Apache HBase is an open source NoSQL database that provides real-time read/write access to those large datasets.
- HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas.
- HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
- It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes work easier.
- While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly.
- Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. We can write Pig script in PigLatin and using Pig interpreter process them.
- Pig makes our life a lot easier, otherwise writing MapReduce is always not easy.