sqoop - sqoop1 vs sqoop2 - apache sqoop - sqoop tutorial - sqoop hadoop



sqoop1 vs sqoop2 - Service Level Integration

  • Sqoop1
    • Service Level IntegrationHive, HBase
      • Require local installation
    • Oozie – von Neumann(esque) integration:
      • Package Sqoop as an action
      • Then run Sqoop from node machines, causing one MR job to be dependent on another MR job
      • Error prone, difficult to debug
  • Sqoop2
    • Hive, HBase
      • Server side integration
    • Oozie
        • REST API integration

    Sqoop1 Architecture :

    learn sqoop - sqoop tutorial - sqoop1 tutorial - sqoop code - sqoop programming - sqoop download - sqoop examples
  • Sqoop1 Challenges
    • CrypAc, contextual command line arguments
    • Tight coupling between data transfer and output format
    • Security concerns with openly shared credentials
    • Not easy to manage installation/Configuration
    • Connectors are forced to follow JDBC model

    Sqoop2 Architecture :

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop code - sqoop programming - sqoop download - sqoop examples
    learn sqoop - sqoop tutorial - what is sqoop2 - sqoop code - sqoop programming - sqoop download - sqoop examples

    Sqoop1: Client side Tool

  • Client side installation+ configuration –
    • Connectors are installed/configured locally
    • Local requires root privileges
    • JDBC drivers are needed locally
    • Database connecAvity is needed locally

    Sqoop2: Sqoop as a Service - client side tools :

  • Server side installation + configuration
    • Connectors are installed/configured in one place
    • Managed by administrator and run by operator
    • JDBC drivers are needed in one place
    • Database connectivity is needed on the server
  • learn sqoop - sqoop tutorial - what is sqoop2 - sqoop code - sqoop programming - sqoop download - sqoop examples

    Client Interface

  • Sqoop1 client interface:
    • Command line interface (CLI) based
    • Can be automated via scripting
  • Sqoop 2 client interface:
    • CLI based (in either interactive or script mode)
    • Web based (remotely accessible)
    • REST API is exposed for external tool integration
    learn sqoop - sqoop tutorial - what is sqoop2 - sqoop code - sqoop programming - sqoop download - sqoop examples

    Implementing Connectors :

  • Sqoop 1
    • Connectors are forced to follow JDBC model
      • Connectors are limited/required to use common JDBC vocabulary (URL, database, table, etc)
    • Connectors must implement all Sqoop functionality they want to support
      • New functionality may not be available for previously implemented connectors
    • Require knowledge of database idiosyncrasies
      • e.g. Couchbase does not need to specify a table name, which is required, causing -­‐table to get overloaded as backfill or dump operation
    • e.g. null string representation is not supported by all connectors
      • Functionality is limited to what the implicitly chosen connector supports
  • Sqoop 2
  • Connectors are not restricted to JDBC model
    • Connectors can define own domain
  • Common functionality are abstracted out of connectors
    • Connectors are only responsible for data transfer
    • Common Reduce phase implements data transformation and system integration
    • Connectors can benefit from future development of common functionality
  • Users make explicit connector choice
    • Less error - prone, more predictable
  • Users need not be aware of the functionality of all connectors
    • Couchbase users need not care that other connectors use tables

    Sqoop2 – Security :

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop2 security - sqoop code - sqoop programming - sqoop download - sqoop examples

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop2 security - sqoop code - sqoop programming - sqoop download - sqoop examples

  • Sqoop1: Security
    • Inherit/Propagate Kerberos principal for the jobs it launches
    • Access to files on HDFS can be controlled via HDFS security
    • Limited support (user/password) for secure access to external systems
  • Sqoop2: Security
    • Inherit/Propagate Kerberos principal for the jobs it launches
    • Access to files on HDFS can be controlled via HDFS security
    • Support for secure access to external systems via role-­‐based access to connection objects
      • – Administrators create/edit/delete connections
      • – Operators use connections

    Sqoop2 – External System Access :

  • Sqoop1: External System Access
    • Every invocation requires necessary credentials to access external systems (e.g. relational database)
      • Workaround: create a user with limited access in lieu of giving out password
      • Does not scale
      • Permission granularity is hard to obtain
    • Hard to prevent misuse once credentials are given
  • Sqoop2: External System Access
    • Connections are enabled as first-­‐class objects
    • Connections encompass credentials
    • Connections are created once and then used many times for various import/export jobs
    • Connections are created by administrator and used by operator
    • Safeguard credential access from end users
    • Connections can be restricted in scope based on operation (import/export)
    • Operators cannot abuse credentials

    Sqoop2 – Resource Management :

  • Sqoop1: Resource Management
    • No explicit resource management policy
    • Users specify the number of map jobs to run
    • Cannot throttle load on external systems
  • Sqoop2: Resource Management
    • Connections allow specification of resource management policy
    • Administrators can limit the total number of physical connections open at one time
    • Connections can also be disabled

    Related Searches to sqoop1 vs sqoop2