apache hive - Hive - Language Capabilities - hive tutorial - hadoop hive - hadoop hive - hiveql



  • Hive is a data warehousing infrastructure built on top of apache Hadoop.
  • Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and processing (using the MapReduce programming paradigm) on commodity hardware.
  • Hive enables easy data summarization, ad-hoc querying and analysis of large volumes of data.
  • It is best used for batch jobs over large sets of immutable data (like web logs).
  • It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to easily perform ad-hoc querying, summarization and data analysis.
apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

Hive - Data Flow Architecture at Facebook :

learn hive - hive tutorial - apache hive - data flow architecture -  hive examples

learn hive - hive tutorial - apache hive - data flow architecture - hive examples

apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

Hive Query Language capabilities:

  • Hive query language provides the basic SQL like operations. These operations work on tables or partitions.
    • Ability to create and manage tables and partitions (create, drop and alter).
    • Ability to support various Relational, Arithmetic and Logical Operators.
    • Ability to do various joins between two tables.
    • Ability to evaluate functions like aggregations on multiple “group by” columns in a table.
    • Ability to store the results of a query into another table.
    • Ability to download the contents of a table to a local directory.
    • Ability to create an external table that points to a specified location within HDFS
    • Ability to store the results of a query in an HDFS directory.
    • Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.

Hive & Hadoop Usage @ Facebook :

  • Hadoop/Hive Warehouse – the new generation - In Facebook
    • 5800 cores, Raw Storage capacity of 8.7 PetaBytes
    • 12 TB per node
    • Two level network topology
      • 1 Gbit/sec from node to rack switch
      • 4 Gbit/sec to top level rack switch
  • Statistics per day - In Facebook :
    • 12 TB of compressed new data added per day
    • 135TB of compressed data scanned per day
    • 7500+ Hive jobs per day
    • 80K compute hours per day
  • Hive simplifies Hadoop - In Facebook:
    • New engineers go though a Hive training session
    • ~200 people/month run jobs on Hadoop/Hive
    • Analysts (non-engineers) use Hadoop through Hive
    • 95% of jobs are Hive Jobs
  • Types of Applications:

  • Reporting
    • Eg: Daily/Weekly aggregations of impression/click counts
    • Measures of user engagement
    • Microstrategy reports
  • Ad hoc Analysis
    • Eg: how many group admins broken down by state/country
  • Machine Learning (Assembling training data)
    • Ad Optimization
    • Eg: User Engagement as a function of user attributes
  • Many others

    Major Components of Hive and its interaction with Hadoop:

    • Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
    Hive Architecture

    Learn Hive Tutorials - Hive Architecture - Hive Example

    • The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages.
    • The Metastore is the system catalog. All other components of Hive interact with the Metastore.
    • The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
    • The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map/reduce jobs.
    • The driver submits the individual map/reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.
    apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    ORCFile – Columnar Storage for Hive :

  • ORC is an optimized, compressed, columnar storage format
    • Only needed columns are read
    • Blocks of data can be skipped using indexes and predicate pushdown
  • learn hive - hive tutorial - apache hive - data from hbase to hive - orc file - columnar storage hive -  hive examples

    learn hive - hive tutorial - apache hive - data from hbase to hive - orc file - columnar storage hive - hive examples

  • Hive tables can be value partitioned
    • Each partition is associated with a folder in HDFS
    • All partitions have an entry in the Hive Catalog
    • The Hive optimizer will parse the query for filter conditions and skip unneeded partitions
    learn hive - hive tutorial - apache hive - data from hbase to hive - hive partitioning folder structure -  hive examples

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive partitioning folder structure - hive examples

    apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    Hive ORCFile – Predicate pushdown :

  • Predicate pushdown
    • ORC ( and other storage formats ) support predicate pushdown
      • Query filters are pushed down into the storage handler
      • Blocks of data can be skipped without reading them from HDFS based on ORC index
    •                           SELECT SUM (PROFIT) FROM SALES WHERE DAY = 03
  • learn hive - hive tutorial - apache hive - data from hbase to hive - hive predicate push down -  hive examples

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive predicate push down - hive examples

    Hive - Partitioning vs. Predicate Pushdown :

  • Both reduce the data that needs to be read
               Partitioning works at split generation, no need to start containers
               Predicate pushdown is applied during file reads
  • Partitioning is applied in the split generation/optimizer
               Impact on Optimizer and HCatalog for large number of partitions
               Thousands of partitions will result in performance problems
  • Predicate Pushdown needs to read the file footers
               Container are allocated even though they can run very quickly
               No overhead in Optimizer/Catalog
  • apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    Hive - Partitioning and Predicate Pushdown :

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive partitioning push down -  hive examples

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive partitioning push down - hive examples

    apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    Hive - Loading Data with Dynamic Partitioning :

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive dynamic partitioning -  hive examples

    learn hive - hive tutorial - apache hive - data from hbase to hive - hive dynamic partitioning - hive examples

    apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    Hive - More Real-World Use Cases :

  • Bizo: We use Hive for reporting and ad hoc queries.
  • Chitika: … for data mining and analysis …
  • CNET: … for data mining, log analysis and ad hoc queries
  • Digg: … data mining, log analysis, R&D, reporting/analytics
  • Grooveshark: … user analytics, dataset cleaning, machine learning R&D.
  • Hi5: … analytics, machine learning, social graph analysis.
  • HubSpot: … to serve near real-time web analytics.
  • Last.fm: … for various ad hoc queries.
  • Trending Topics: … for log data normalization and building sample data sets for trend detection R&D.
  • VideoEgg: … analyze all the usage data
  • apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

    Hive - Interfaces to Front End UI technology :

  • JDBC
  •            Enables integration with JDBC based SQL clients
  • ODBC
  •            Enables integration with Microstrategy
  • Thrift
  •            Enables writing cross language clients
               Main form of integration with php based Web UI

    Wikitechy Apache Hive tutorials provides you the base of all the following topics . Enjoy learning on big data , hadoop , data analytics , big data analytics , mapreduce , hadoop tutorial , what is hadoop , big data hadoop , apache hadoop , apache hive , hadoop wiki , hadoop jobs , hadoop training , hive tutorial , hadoop big data , hadoop architecture , hadoop certification , hadoop ecosystem , hadoop fs , apache pig , hadoop cluster , cloudera hadoop , hadoop download , hadoop mapreduce , hadoop workflow , hive data types , hadoop hive , pig hadoop , hadoop administration , hadoop installation , hive hadoop , learn hadoop , hadoop for dummies , hadoop commands , hive definition , hiveql , learnhive , hive sql , hive database , hive date functions , hive query , apache hive tutorial , hive apache , hive wiki , what is a hive , hive big data , programming hive , what is hive in hadoop , hive documentation , how does hive work

    Related Searches to Hive - Language Capabilities