sqoop - Sqoop Import Mainframe - apache sqoop - sqoop tutorial - sqoop hadoop



What is sqoop import-mainframe?

  • The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS.
  • A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data.
  • Records will be stored with the entire record as a single text field.

Syntax:

$ sqoop import-mainframe (generic-args) (import-args)
$ sqoop-import-mainframe (generic-args) (import-args)
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team

While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another.

Common arguments:

Argument Description
--connect <hostname> Specify mainframe host to connect
--connection-manager <class-name> Specify connection manager class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for a file containing the authentication password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file <filename> Optional properties file that provides connection parameters
Sqoop related tags : sqoop import , sqoop interview questions , sqoop export , sqoop commands , sqoop user guide , sqoop documentation

Connecting to a Mainframe

  • Sqoop is designed to import mainframe datasets into HDFS. To do so, you must specify a mainframe host name in the Sqoop --connect argument.
$ sqoop import-mainframe --connect z390
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team
  • This will connect to the mainframe host z390 via ftp.
  • You might need to authenticate against the mainframe host to access it. You can use the --username to supply a username to the mainframe.
  • Sqoop provides couple of different ways to supply a password, secure and non-secure, to the mainframe which is detailed below.

Secure way of supplying password to the mainframe:

  • You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials.
  • Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means without exposing the password in the job configuration.
  • The file containing the password can either be on the Local FS or HDFS.

Example:

$ sqoop import-mainframe --connect z390 \
    --username david --password-file ${user.home}/.password
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team

Another way of supplying passwords is using the -P argument which will read a password from a console prompt.

learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop mapreduce - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop data transfer - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop data transfer - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop data transfer - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop data transfer - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

Note

  • The --password parameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such as ps.
  • The -P argument is the preferred method over using the --password argument. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.
Sqoop related tags : sqoop import , sqoop interview questions , sqoop export , sqoop commands , sqoop user guide , sqoop documentation

Example:

$ sqoop import-mainframe --connect z390 --username david --password 12345
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team

Import control arguments:

Argument Description
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--delete-target-dir Delete the import target directory if it exists
-m,--num-mappers <n> Use n map tasks to import in parallel
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)

Selecting the Files to Import

  • You can use the --dataset argument to specify a partitioned dataset name. All sequential datasets in the partitioned dataset will be imported.
Sqoop related tags : sqoop import , sqoop interview questions , sqoop export , sqoop commands , sqoop user guide , sqoop documentation

Controlling Parallelism

  • Sqoop imports data in parallel by making multiple ftp connections to the mainframe to transfer multiple files simultaneously.
  • You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.
  • Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ.
  • By default, four tasks are used. You can adjust this value to maximize the data transfer rate from the mainframe.

Controlling Distributed Cache

  • Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job.
  • When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache.
  • Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs.
  • Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.

Controlling the Import Process

  • By default, Sqoop will import all sequential files in a partitioned dataset pds to a directory named pds inside your home directory in HDFS.
  • For example, if your username is someuser, then the import tool will write to /user/someuser/pds/(files).
  • You can adjust the parent directory of the import with the --warehouse-dir argument. For example:
$ sqoop import-mainframe --connnect <host> --dataset foo --warehouse-dir /shared \
    ...
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team
  • This command would write to a set of files in the /shared/pds/ directory.
  • You can also explicitly choose the target directory, like so:
$ sqoop import-mainframe --connnect <host> --dataset foo --target-dir /dest \
    ...
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team
  • This will import the files into the /dest directory. --target-dir is incompatible with --warehouse-dir.
  • By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents.

File Formats:

  • By default, each record in a dataset is stored as a text record with a newline at the end.
  • Each record is assumed to contain a single text field with the name DEFAULT_COLUMN.
  • When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
  • You can also import mainframe records to Sequence, Avro, or Parquet files.
  • By default, data is not compressed.
  • You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument.

Output line formatting arguments:

Argument Description
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character

Since mainframe record contains only one field, importing to delimited files will not contain any field delimiter. However, the field may be enclosed with enclosing character or escaped by an escaping character.

Input parsing arguments:

Argument Description
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape character
--fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line character
--input-optionally-enclosed-by <char> Sets a field enclosing character
  • When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import.
  • The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse()method reinterprets this data.
  • The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on.
  • This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.

Example Invocations

  • The following examples illustrate how to use the import tool in a variety of situations.
  • A basic import of all sequential files in a partitioned dataset named EMPLOYEES in the mainframe host z390:
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \
    --username SomeUser -P
Enter password: (hidden)
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team
  • Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \
    --username SomeUser --password-file mypassword -m 8
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team
  • Importing the data to Hive:
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \
    --hive-import
Click "Copy code" button to copy into clipboard - By wikitechy - sqoop tutorial - team


Related Searches to Sqoop Import Mainframe