pig tutorial - apache pig tutorial - Pig vs Hive vs SQL - pig latin - apache pig - pig hadoop
- SQL programmers required languages that were relatively easy to learn for someone having SQL background and at the same time was free of SQL’s excess baggage mentioned above and Could easily handle large data sets.
- Originally developed at Yahoo Research in 2006, Pig addressed all these issues and provided better optimization scope and extensibility.
- Apache Pig also allows developers to follow multiple query approach, which reduces the data scan iterations.
- It has provisions for a number of nested data types (Maps, Tuples and Bags) and commonly used data operations such as Filters, Ordering and Joins.
- These advantages have seen Pig being adopted by a large number of users around the globe.
- Structured Query Language (SQL) has been a programmer’s companion for decades. It was the de-facto solution for extracting data for further processing.
- Big Data has changed how we visualize and process data.
- SQL’s demand of storing data in a strict relational database schemas and its declarative nature often deflects focus from the ultimate purpose – to extract data for analysis.
- For all its processing power, Pig requires programmers to learn something on top of SQL.
- It requires learning and mastering something new.
- Hive statements are remarkably similar to SQL and despite the limitations of Hive Query Language (HQL) in terms of the commands that it understands, it is still very useful.
- Hive provides an excellent open source implementation of MapReduce.
- It works well when it comes to processing data stored in a distributed manner, unlike SQL which requires strict adherence to schemas while storing data.
- Out of the three approaches to data extracting, processing and analysis, there is no one-size-fits-all approach.
- A number of factors such as data storage approach, programming language architecture and expected results should be given due consideration before making the choice.
Pig vs SQL
- The DBMS systems that SQL operates on, are considered to be faster than MapReduce (operated on by Pig through the PigLatin platform). However, it is the loading of data that is more challenging in case of RDBMS, making the set up difficult.
- PigLatin offers a number of advantages in terms of declaring execution plans, ETL routines and pipeline modification.
- SQL is declarative and PigLatin is procedural to a large extent.
- What we mean by this is in SQL, we largely specify “what” is to be accomplished and in Pig, we mention “how” a task is to be performed.
- A script written in Pig is essentially converted to a MapReduce job in the background before it is executed.
- A Pig script is shorter than the corresponding MapReduce job, which significantly cuts down development time.
Hive vs SQL
- SQL is a general purpose database language that has extensively been used for both transactional and analytical queries.
- Hive, on the other hand, is built with an analytical focus. What this means is Hive lacks update and delete functions but is superfast in reading and processing huge volumes of data faster than SQL.
- Hence, even though Hive SQL is SQL-like, lack of support for modifying or deleting data is a major difference.