S3File
Last updated
Last updated
S3 File Source Connector
Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.
Read data from aws s3 file system.
S3
current
Data type mapping is related to the type of file being read, We supported as the following file types:
text
csv
parquet
orc
json
excel
xml
If you assign file type to json
, you should also assign schema option to tell connector how to parse data to the row you want.
For example:
upstream data is the following:
You can also save multiple pieces of data in one file and split them by newline:
you should assign schema as the following:
connector will generate data as the following:
200
get success
true
If you assign file type to text
csv
, you can choose to specify the schema information or not.
For example, upstream data is the following:
If you do not assign data schema connector will treat the upstream data as the following:
tyrantlucifer#26#male
If you assign data schema, you should also assign the option field_delimiter
too except CSV file type
you should assign schema and delimiter as the following:
connector will generate data as the following:
tyrantlucifer
26
male
If you assign file type to parquet
orc
, schema option not required, connector can find the schema of upstream data automatically.
BOOLEAN
BOOLEAN
INT
INT
BYTE
BYTE
SHORT
SHORT
LONG
LONG
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
BINARY
STRING VARCHAR CHAR
STRING
DATE
LOCAL_DATE_TYPE
TIMESTAMP
LOCAL_DATE_TIME_TYPE
DECIMAL
DECIMAL
LIST(STRING)
STRING_ARRAY_TYPE
LIST(BOOLEAN)
BOOLEAN_ARRAY_TYPE
LIST(TINYINT)
BYTE_ARRAY_TYPE
LIST(SMALLINT)
SHORT_ARRAY_TYPE
LIST(INT)
INT_ARRAY_TYPE
LIST(BIGINT)
LONG_ARRAY_TYPE
LIST(FLOAT)
FLOAT_ARRAY_TYPE
LIST(DOUBLE)
DOUBLE_ARRAY_TYPE
Map<K,V>
MapType, This type of K and V will transform to Nexus type
STRUCT
NexusRowType
If you assign file type to parquet
orc
, schema option not required, connector can find the schema of upstream data automatically.
INT_8
BYTE
INT_16
SHORT
DATE
DATE
TIMESTAMP_MILLIS
TIMESTAMP
INT64
LONG
INT96
TIMESTAMP
BINARY
BYTES
FLOAT
FLOAT
DOUBLE
DOUBLE
BOOLEAN
BOOLEAN
FIXED_LEN_BYTE_ARRAY
TIMESTAMP DECIMAL
DECIMAL
DECIMAL
LIST(STRING)
STRING_ARRAY_TYPE
LIST(BOOLEAN)
BOOLEAN_ARRAY_TYPE
LIST(TINYINT)
BYTE_ARRAY_TYPE
LIST(SMALLINT)
SHORT_ARRAY_TYPE
LIST(INT)
INT_ARRAY_TYPE
LIST(BIGINT)
LONG_ARRAY_TYPE
LIST(FLOAT)
FLOAT_ARRAY_TYPE
LIST(DOUBLE)
DOUBLE_ARRAY_TYPE
Map<K,V>
MapType, This type of K and V will transform to Nexus type
STRUCT
NexusRowType
path
string
yes
-
The s3 path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option
file_format_type
string
yes
-
File type, supported as the following file types: text
csv
parquet
orc
json
excel
xml
binary
bucket
string
yes
-
The bucket address of s3 file system, for example: s3n://nexus-test
, if you use s3a
protocol, this parameter should be s3a://nexus-test
.
fs.s3a.endpoint
string
yes
-
fs s3a endpoint
fs.s3a.aws.credentials.provider
string
yes
com.amazonaws.auth.InstanceProfileCredentialsProvider
read_columns
list
no
-
The read column list of the data source, user can use it to implement field projection. The file type supported column projection as the following shown: text
csv
parquet
orc
json
excel
xml
. If the user wants to use this feature when reading text
json
csv
files, the "schema" option must be configured.
access_key
string
no
-
Only used when fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
access_secret
string
no
-
Only used when fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
hadoop_s3_properties
map
no
-
delimiter/field_delimiter
string
no
\001
Field delimiter, used to tell connector how to slice and dice fields when reading text files. Default \001
, the same as hive's default delimiter.
parse_partition_from_path
boolean
no
true
Control whether parse the partition keys and values from file path. For example if you read a file from path s3n://hadoop-cluster/tmp/nexus/parquet/name=tyrantlucifer/age=26
. Every record data from file will be added these two fields: name="tyrantlucifer", age=16
date_format
string
no
yyyy-MM-dd
Date type format, used to tell connector how to convert string to date, supported as the following formats:yyyy-MM-dd
yyyy.MM.dd
yyyy/MM/dd
. default yyyy-MM-dd
datetime_format
string
no
yyyy-MM-dd HH:mm:ss
Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:yyyy-MM-dd HH:mm:ss
yyyy.MM.dd HH:mm:ss
yyyy/MM/dd HH:mm:ss
yyyyMMddHHmmss
time_format
string
no
HH:mm:ss
Time type format, used to tell connector how to convert string to time, supported as the following formats:HH:mm:ss
HH:mm:ss.SSS
skip_header_row_number
long
no
0
Skip the first few lines, but only for the txt and csv. For example, set like following:skip_header_row_number = 2
. Then Nexus will skip the first 2 lines from source files
schema
config
no
-
The schema of upstream data.
sheet_name
string
no
-
Reader the sheet of the workbook,Only used when file_format is excel.
xml_row_tag
string
no
-
Specifies the tag name of the data rows within the XML file, only valid for XML files.
xml_use_attr_format
boolean
no
-
Specifies whether to process data using the tag attribute format, only valid for XML files.
compress_codec
string
no
none
encoding
string
no
UTF-8
common-options
no
-
delimiter parameter will deprecate after version 2.3.5, please use field_delimiter instead.
The compress codec of files and the details that supported as the following shown:
txt: lzo
none
json: lzo
none
csv: lzo
none
orc/parquet: automatically recognizes the compression type, no additional settings required.
Only used when file_format_type is json,text,csv,xml. The encoding of the file to read. This param will be parsed by Charset.forName(encoding)
.
In this example, We read data from s3 path s3a://nexus-test/nexus/text
and the file type is orc in this path. We use org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
to authentication so access_key
and secret_key
is required. All columns in the file will be read and send to sink.
Use InstanceProfileCredentialsProvider
to authentication The file type in S3 is json, so need config schema option.
Use InstanceProfileCredentialsProvider
to authentication The file type in S3 is json and has five fields (id
, name
, age
, sex
, type
), so need config schema option. In this job, we only need send id
and name
column to mysql.
The way to authenticate s3a. We only support org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
and com.amazonaws.auth.InstanceProfileCredentialsProvider
now. More information about the credential provider you can see
If you need to add other option, you could add it here and refer to this
Source plugin common parameters, please refer to for details.