What is cloudera impala ?

Answer : Impala was the first to bring SQL querying to the public in April 2013…

Cloudera’s Impala

Impala was the first to bring SQL querying to the public in April 2013. Impala comes with a bunch of interesting features:
  • Impala can query many file format such as Parquet, Avro, Text, RCFile, SequenceFile
  • Impala supports data stored in HDFS, Apache HBase and Amazon S3
  • Impala supports multiple compression codecs:
    • Snappy (Recommended for its effective balance between compression ratio and decompression speed),
    • Gzip (Recommended when achieving the highest level of compression),
    • Deflate (not supported for text files), Bzip2, LZO (for text files only);
  • Impala provides security through authorization based on Sentry (OS user ID)
    • Defining which users are allowed to access which resources,
    • What operations are they allowed to perform authentication based on Kerberos + ability to specify Active Directory username/password,
    • How does Impala verify the identity of the users to confirm that they are allowed exercise their privileges assigned to that user auditing,
    • What operations were attempted,
    • Did they succeed or not, allowing to track down suspicious activity; audit data are collected by Cloudera Manager;
  • Impala supports SSL network encryption between Impala and client programs, and between the Impala-related daemons running on different nodes in the cluster;
  • Impala allows to use UDFs and UDAFs;
  • Impala orders the joins automatically to be the most efficient;
  • Impala allows admission control – prioritization and queueing of queries within impala;
  • Impala allows multi-user concurrent queries;
  • Impala caches frequently accessed data in memory;
  • Impala computes statistics (with COMPUTE STATS);
  • Impala provides window functions (aggregation OVER PARTITION, RANK, LEAD, LAG, NTILE, and so on) – to provide more advanced SQL analytic capabilities (since version 2.0);
  • Impala allows external joins and aggregation using disk (since version 2.0) – enables operations to spill to disk if their internal state exceeds the aggregate memory size;
  • Impala allows subqueries inside WHERE clauses;
  • Impala allows incremental statistics – only run statistics on the new or changed data for even faster statistics computations;
  • Impala enables queries on complex nested structures including maps, structs and arrays;
  • Impala enables merging (MERGE) in updates into existing tables;
  • Impala enables some OLAP functions (ROLLUP, CUBE, GROUPING SET);
  • Impala allows use of impala for inserts and updates into HBase.
Leave a Reply

Your email address will not be published.

You May Also Like