OssFile
Last updated
Last updated
Oss file source connector
Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.
Data type mapping is related to the type of file being read, We supported as the following file types:
text
csv
parquet
orc
json
excel
xml
If you assign file type to json
, you should also assign schema option to tell connector how to parse data to the row you want.
For example:
upstream data is the following:
You can also save multiple pieces of data in one file and split them by newline:
you should assign schema as the following:
connector will generate data as the following:
200
get success
true
If you assign file type to text
csv
, you can choose to specify the schema information or not.
For example, upstream data is the following:
If you do not assign data schema connector will treat the upstream data as the following:
tyrantlucifer#26#male
If you assign data schema, you should also assign the option field_delimiter
too except CSV file type
you should assign schema and delimiter as the following:
connector will generate data as the following:
tyrantlucifer
26
male
If you assign file type to parquet
orc
, schema option not required, connector can find the schema of upstream data automatically.
BOOLEAN
BOOLEAN
INT
INT
BYTE
BYTE
SHORT
SHORT
LONG
LONG
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
BINARY
STRING VARCHAR CHAR
STRING
DATE
LOCAL_DATE_TYPE
TIMESTAMP
LOCAL_DATE_TIME_TYPE
DECIMAL
DECIMAL
LIST(STRING)
STRING_ARRAY_TYPE
LIST(BOOLEAN)
BOOLEAN_ARRAY_TYPE
LIST(TINYINT)
BYTE_ARRAY_TYPE
LIST(SMALLINT)
SHORT_ARRAY_TYPE
LIST(INT)
INT_ARRAY_TYPE
LIST(BIGINT)
LONG_ARRAY_TYPE
LIST(FLOAT)
FLOAT_ARRAY_TYPE
LIST(DOUBLE)
DOUBLE_ARRAY_TYPE
Map<K,V>
MapType, This type of K and V will transform to Nexus type
STRUCT
NexusRowType
If you assign file type to parquet
orc
, schema option not required, connector can find the schema of upstream data automatically.
INT_8
BYTE
INT_16
SHORT
DATE
DATE
TIMESTAMP_MILLIS
TIMESTAMP
INT64
LONG
INT96
TIMESTAMP
BINARY
BYTES
FLOAT
FLOAT
DOUBLE
DOUBLE
BOOLEAN
BOOLEAN
FIXED_LEN_BYTE_ARRAY
TIMESTAMP DECIMAL
DECIMAL
DECIMAL
LIST(STRING)
STRING_ARRAY_TYPE
LIST(BOOLEAN)
BOOLEAN_ARRAY_TYPE
LIST(TINYINT)
BYTE_ARRAY_TYPE
LIST(SMALLINT)
SHORT_ARRAY_TYPE
LIST(INT)
INT_ARRAY_TYPE
LIST(BIGINT)
LONG_ARRAY_TYPE
LIST(FLOAT)
FLOAT_ARRAY_TYPE
LIST(DOUBLE)
DOUBLE_ARRAY_TYPE
Map<K,V>
MapType, This type of K and V will transform to Nexus type
STRUCT
NexusRowType
path
string
yes
-
The Oss path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option
file_format_type
string
yes
-
File type, supported as the following file types: text
csv
parquet
orc
json
excel
xml
binary
bucket
string
yes
-
The bucket address of oss file system, for example: oss://nexus-test
.
endpoint
string
yes
-
fs oss endpoint
read_columns
list
no
-
The read column list of the data source, user can use it to implement field projection. The file type supported column projection as the following shown: text
csv
parquet
orc
json
excel
xml
. If the user wants to use this feature when reading text
json
csv
files, the "schema" option must be configured.
access_key
string
no
-
access_secret
string
no
-
delimiter
string
no
\001
Field delimiter, used to tell connector how to slice and dice fields when reading text files. Default \001
, the same as hive's default delimiter.
parse_partition_from_path
boolean
no
true
Control whether parse the partition keys and values from file path. For example if you read a file from path oss://hadoop-cluster/tmp/nexus/parquet/name=tyrantlucifer/age=26
. Every record data from file will be added these two fields: name="tyrantlucifer", age=16
date_format
string
no
yyyy-MM-dd
Date type format, used to tell connector how to convert string to date, supported as the following formats:yyyy-MM-dd
yyyy.MM.dd
yyyy/MM/dd
. default yyyy-MM-dd
datetime_format
string
no
yyyy-MM-dd HH:mm:ss
Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:yyyy-MM-dd HH:mm:ss
yyyy.MM.dd HH:mm:ss
yyyy/MM/dd HH:mm:ss
yyyyMMddHHmmss
time_format
string
no
HH:mm:ss
Time type format, used to tell connector how to convert string to time, supported as the following formats:HH:mm:ss
HH:mm:ss.SSS
skip_header_row_number
long
no
0
Skip the first few lines, but only for the txt and csv. For example, set like following:skip_header_row_number = 2
. Then Nexus will skip the first 2 lines from source files
schema
config
no
-
The schema of upstream data.
sheet_name
string
no
-
Reader the sheet of the workbook,Only used when file_format is excel.
xml_row_tag
string
no
-
Specifies the tag name of the data rows within the XML file, only used when file_format is xml.
xml_use_attr_format
boolean
no
-
Specifies whether to process data using the tag attribute format, only used when file_format is xml.
compress_codec
string
no
none
Which compress codec the files used.
encoding
string
no
UTF-8
file_filter_pattern
string
no
*.txt
means you only need read the files end with .txt
common-options
config
no
-
The compress codec of files and the details that supported as the following shown:
txt: lzo
none
json: lzo
none
csv: lzo
none
orc/parquet: automatically recognizes the compression type, no additional settings required.
Only used when file_format_type is json,text,csv,xml. The encoding of the file to read. This param will be parsed by Charset.forName(encoding)
.
Filter pattern, which used for filtering files.
Only need to be configured when the file_format_type are text, json, excel, xml or csv ( Or other format we can't read the schema from metadata).
The schema of upstream data.
The following example demonstrates how to create a data synchronization job that reads data from Oss and prints it on the local client:
No need to config schema file type, eg: orc
.
Need config schema file type, eg: json
Source plugin common parameters, please refer to for details.
fields [Config]