HdfsFile

Hdfs File Source Connector

Key Featuresarrow-up-right

Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.

Read data from hdfs file system.

Supported DataSource Infoarrow-up-right

Datasource
Supported Versions

HdfsFile

hadoop 2.x and 3.x

Source Optionsarrow-up-right

Name
Type
Required
Default
Description

path

string

yes

-

The source file path.

file_format_type

string

yes

-

We supported as the following file types:text csv parquet orc json excel xml binary.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is txt.

fs.defaultFS

string

yes

-

The hadoop cluster address that start with hdfs://, for example: hdfs://hadoopcluster

read_columns

list

no

-

The read column list of the data source, user can use it to implement field projection.The file type supported column projection as the following shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this feature when reading text json csv files, the schema option must be configured.

hdfs_site_path

string

no

-

The path of hdfs-site.xml, used to load ha configuration of namenodes

delimiter/field_delimiter

string

no

\001

Field delimiter, used to tell connector how to slice and dice fields when reading text files. default \001, the same as hive's default delimiter

parse_partition_from_path

boolean

no

true

Control whether parse the partition keys and values from file path. For example if you read a file from path hdfs://hadoop-cluster/tmp/nexus/parquet/name=tyrantlucifer/age=26. Every record data from file will be added these two fields:[name:tyrantlucifer,age:26].Tips:Do not define partition fields in schema option.

date_format

string

no

yyyy-MM-dd

Date type format, used to tell connector how to convert string to date, supported as the following formats:yyyy-MM-dd yyyy.MM.dd yyyy/MM/dd default yyyy-MM-dd.Date type format, used to tell connector how to convert string to date, supported as the following formats:yyyy-MM-dd yyyy.MM.dd yyyy/MM/dd default yyyy-MM-dd

datetime_format

string

no

yyyy-MM-dd HH:mm:ss

Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:yyyy-MM-dd HH:mm:ss yyyy.MM.dd HH:mm:ss yyyy/MM/dd HH:mm:ss yyyyMMddHHmmss .default yyyy-MM-dd HH:mm:ss

time_format

string

no

HH:mm:ss

Time type format, used to tell connector how to convert string to time, supported as the following formats:HH:mm:ss HH:mm:ss.SSS.default HH:mm:ss

remote_user

string

no

-

The login user used to connect to hadoop login name. It is intended to be used for remote users in RPC, it won't have any credentials.

krb5_path

string

no

/etc/krb5.conf

The krb5 path of kerberos

kerberos_principal

string

no

-

The principal of kerberos

kerberos_keytab_path

string

no

-

The keytab path of kerberos

skip_header_row_number

long

no

0

Skip the first few lines, but only for the txt and csv.For example, set like following:skip_header_row_number = 2.then Nexus will skip the first 2 lines from source files

schema

config

no

-

the schema fields of upstream data

sheet_name

string

no

-

Reader the sheet of the workbook,Only used when file_format is excel.

xml_row_tag

string

no

-

Specifies the tag name of the data rows within the XML file, only used when file_format is xml.

xml_use_attr_format

boolean

no

-

Specifies whether to process data using the tag attribute format, only used when file_format is xml.

compress_codec

string

no

none

The compress codec of files

encoding

string

no

UTF-8

common-options

no

-

Source plugin common parameters, please refer to Source Common Options for details.

delimiter/field_delimiter [string]arrow-up-right

delimiter parameter will deprecate after version 2.3.5, please use field_delimiter instead.

compress_codec [string]arrow-up-right

The compress codec of files and the details that supported as the following shown:

  • txt: lzo none

  • json: lzo none

  • csv: lzo none

  • orc/parquet: automatically recognizes the compression type, no additional settings required.

encoding [string]arrow-up-right

Only used when file_format_type is json,text,csv,xml. The encoding of the file to read. This param will be parsed by Charset.forName(encoding).

Task Examplearrow-up-right

This example defines a Nexus synchronization task that read data from Hdfs and sends it to Hdfs.

Last updated