# OssFile

> Oss file source connector

### Key features[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#key-features) <a href="#key-features" id="key-features"></a>

* [x] &#x20;batch
* [ ] &#x20;stream
* [x] &#x20;exactly-once

Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.

* [x] &#x20;column projection
* [x] &#x20;parallelism
* [ ] &#x20;support user-defined split
* [x] &#x20;file format type
  * [x] &#x20;text
  * [x] &#x20;csv
  * [x] &#x20;parquet
  * [x] &#x20;orc
  * [x] &#x20;json
  * [x] &#x20;excel
  * [x] &#x20;xml
  * [x] &#x20;binary

### Data Type Mapping[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#data-type-mapping) <a href="#data-type-mapping" id="data-type-mapping"></a>

Data type mapping is related to the type of file being read, We supported as the following file types:

`text` `csv` `parquet` `orc` `json` `excel` `xml`

#### JSON File Type[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#json-file-type) <a href="#json-file-type" id="json-file-type"></a>

If you assign file type to `json`, you should also assign schema option to tell connector how to parse data to the row you want.

For example:

upstream data is the following:

```

{"code":  200, "data":  "get success", "success":  true}

```

You can also save multiple pieces of data in one file and split them by newline:

```

{"code":  200, "data":  "get success", "success":  true}
{"code":  300, "data":  "get failed", "success":  false}

```

you should assign schema as the following:

```

schema {
    fields {
        code = int
        data = string
        success = boolean
    }
}

```

connector will generate data as the following:

| code | data        | success |
| ---- | ----------- | ------- |
| 200  | get success | true    |

#### Text Or CSV File Type[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#text-or-csv-file-type) <a href="#text-or-csv-file-type" id="text-or-csv-file-type"></a>

If you assign file type to `text` `csv`, you can choose to specify the schema information or not.

For example, upstream data is the following:

```

tyrantlucifer#26#male

```

If you do not assign data schema connector will treat the upstream data as the following:

| content               |
| --------------------- |
| tyrantlucifer#26#male |

If you assign data schema, you should also assign the option `field_delimiter` too except CSV file type

you should assign schema and delimiter as the following:

```

field_delimiter = "#"
schema {
    fields {
        name = string
        age = int
        gender = string 
    }
}

```

connector will generate data as the following:

| name          | age | gender |
| ------------- | --- | ------ |
| tyrantlucifer | 26  | male   |

#### Orc File Type[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#orc-file-type) <a href="#orc-file-type" id="orc-file-type"></a>

If you assign file type to `parquet` `orc`, schema option not required, connector can find the schema of upstream data automatically.

| Orc Data type                        | Nexus Data type                                            |
| ------------------------------------ | ---------------------------------------------------------- |
| BOOLEAN                              | BOOLEAN                                                    |
| INT                                  | INT                                                        |
| BYTE                                 | BYTE                                                       |
| SHORT                                | SHORT                                                      |
| LONG                                 | LONG                                                       |
| FLOAT                                | FLOAT                                                      |
| DOUBLE                               | DOUBLE                                                     |
| BINARY                               | BINARY                                                     |
| <p>STRING<br>VARCHAR<br>CHAR<br></p> | STRING                                                     |
| DATE                                 | LOCAL\_DATE\_TYPE                                          |
| TIMESTAMP                            | LOCAL\_DATE\_TIME\_TYPE                                    |
| DECIMAL                              | DECIMAL                                                    |
| LIST(STRING)                         | STRING\_ARRAY\_TYPE                                        |
| LIST(BOOLEAN)                        | BOOLEAN\_ARRAY\_TYPE                                       |
| LIST(TINYINT)                        | BYTE\_ARRAY\_TYPE                                          |
| LIST(SMALLINT)                       | SHORT\_ARRAY\_TYPE                                         |
| LIST(INT)                            | INT\_ARRAY\_TYPE                                           |
| LIST(BIGINT)                         | LONG\_ARRAY\_TYPE                                          |
| LIST(FLOAT)                          | FLOAT\_ARRAY\_TYPE                                         |
| LIST(DOUBLE)                         | DOUBLE\_ARRAY\_TYPE                                        |
| Map\<K,V>                            | MapType, This type of K and V will transform to Nexus type |
| STRUCT                               | NexusRowType                                               |

#### Parquet File Type[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#parquet-file-type) <a href="#parquet-file-type" id="parquet-file-type"></a>

If you assign file type to `parquet` `orc`, schema option not required, connector can find the schema of upstream data automatically.

| Orc Data type           | Nexus Data type                                            |
| ----------------------- | ---------------------------------------------------------- |
| INT\_8                  | BYTE                                                       |
| INT\_16                 | SHORT                                                      |
| DATE                    | DATE                                                       |
| TIMESTAMP\_MILLIS       | TIMESTAMP                                                  |
| INT64                   | LONG                                                       |
| INT96                   | TIMESTAMP                                                  |
| BINARY                  | BYTES                                                      |
| FLOAT                   | FLOAT                                                      |
| DOUBLE                  | DOUBLE                                                     |
| BOOLEAN                 | BOOLEAN                                                    |
| FIXED\_LEN\_BYTE\_ARRAY | <p>TIMESTAMP<br>DECIMAL</p>                                |
| DECIMAL                 | DECIMAL                                                    |
| LIST(STRING)            | STRING\_ARRAY\_TYPE                                        |
| LIST(BOOLEAN)           | BOOLEAN\_ARRAY\_TYPE                                       |
| LIST(TINYINT)           | BYTE\_ARRAY\_TYPE                                          |
| LIST(SMALLINT)          | SHORT\_ARRAY\_TYPE                                         |
| LIST(INT)               | INT\_ARRAY\_TYPE                                           |
| LIST(BIGINT)            | LONG\_ARRAY\_TYPE                                          |
| LIST(FLOAT)             | FLOAT\_ARRAY\_TYPE                                         |
| LIST(DOUBLE)            | DOUBLE\_ARRAY\_TYPE                                        |
| Map\<K,V>               | MapType, This type of K and V will transform to Nexus type |
| STRUCT                  | NexusRowType                                               |

### Options[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#options) <a href="#options" id="options"></a>

| name                         | type    | required | default value       | Description                                                                                                                                                                                                                                                                                                                         |
| ---------------------------- | ------- | -------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| path                         | string  | yes      | -                   | The Oss path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse\_partition\_from\_path" option                                                                                                                                   |
| file\_format\_type           | string  | yes      | -                   | File type, supported as the following file types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`                                                                                                                                                                                                                        |
| bucket                       | string  | yes      | -                   | The bucket address of oss file system, for example: `oss://nexus-test`.                                                                                                                                                                                                                                                             |
| endpoint                     | string  | yes      | -                   | fs oss endpoint                                                                                                                                                                                                                                                                                                                     |
| read\_columns                | list    | no       | -                   | The read column list of the data source, user can use it to implement field projection. The file type supported column projection as the following shown: `text` `csv` `parquet` `orc` `json` `excel` `xml` . If the user wants to use this feature when reading `text` `json` `csv` files, the "schema" option must be configured. |
| access\_key                  | string  | no       | -                   |                                                                                                                                                                                                                                                                                                                                     |
| access\_secret               | string  | no       | -                   |                                                                                                                                                                                                                                                                                                                                     |
| delimiter                    | string  | no       | \001                | Field delimiter, used to tell connector how to slice and dice fields when reading text files. Default `\001`, the same as hive's default delimiter.                                                                                                                                                                                 |
| parse\_partition\_from\_path | boolean | no       | true                | Control whether parse the partition keys and values from file path. For example if you read a file from path `oss://hadoop-cluster/tmp/nexus/parquet/name=tyrantlucifer/age=26`. Every record data from file will be added these two fields: name="tyrantlucifer", age=16                                                           |
| date\_format                 | string  | no       | yyyy-MM-dd          | Date type format, used to tell connector how to convert string to date, supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`. default `yyyy-MM-dd`                                                                                                                                                             |
| datetime\_format             | string  | no       | yyyy-MM-dd HH:mm:ss | Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss`                                                                                                                               |
| time\_format                 | string  | no       | HH:mm:ss            | Time type format, used to tell connector how to convert string to time, supported as the following formats:`HH:mm:ss` `HH:mm:ss.SSS`                                                                                                                                                                                                |
| skip\_header\_row\_number    | long    | no       | 0                   | Skip the first few lines, but only for the txt and csv. For example, set like following:`skip_header_row_number = 2`. Then Nexus will skip the first 2 lines from source files                                                                                                                                                      |
| schema                       | config  | no       | -                   | The schema of upstream data.                                                                                                                                                                                                                                                                                                        |
| sheet\_name                  | string  | no       | -                   | Reader the sheet of the workbook,Only used when file\_format is excel.                                                                                                                                                                                                                                                              |
| xml\_row\_tag                | string  | no       | -                   | Specifies the tag name of the data rows within the XML file, only used when file\_format is xml.                                                                                                                                                                                                                                    |
| xml\_use\_attr\_format       | boolean | no       | -                   | Specifies whether to process data using the tag attribute format, only used when file\_format is xml.                                                                                                                                                                                                                               |
| compress\_codec              | string  | no       | none                | Which compress codec the files used.                                                                                                                                                                                                                                                                                                |
| encoding                     | string  | no       | UTF-8               |                                                                                                                                                                                                                                                                                                                                     |
| file\_filter\_pattern        | string  | no       |                     | `*.txt` means you only need read the files end with `.txt`                                                                                                                                                                                                                                                                          |
| common-options               | config  | no       | -                   | Source plugin common parameters, please refer to  [source-common-options](https://docs.selfuel.digital/data-integration-with-nexus/nexus-elements/connectors/source/source-common-options "mention") for details.                                                                                                                   |

#### compress\_codec \[string][​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#compress_codec-string) <a href="#compress_codec-string" id="compress_codec-string"></a>

The compress codec of files and the details that supported as the following shown:

* txt: `lzo` `none`
* json: `lzo` `none`
* csv: `lzo` `none`
* orc/parquet:\
  automatically recognizes the compression type, no additional settings required.

#### encoding \[string][​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#encoding-string) <a href="#encoding-string" id="encoding-string"></a>

Only used when file\_format\_type is json,text,csv,xml. The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

#### file\_filter\_pattern \[string][​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#file_filter_pattern-string) <a href="#file_filter_pattern-string" id="file_filter_pattern-string"></a>

Filter pattern, which used for filtering files.

#### schema \[config][​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#schema-config) <a href="#schema-config" id="schema-config"></a>

Only need to be configured when the file\_format\_type are text, json, excel, xml or csv ( Or other format we can't read the schema from metadata).

**fields \[Config]**[**​**](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#fields-config)

The schema of upstream data.

### How to Create a Oss Data Synchronization Jobs[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#how-to-create-a-oss-data-synchronization-jobs) <a href="#how-to-create-a-oss-data-synchronization-jobs" id="how-to-create-a-oss-data-synchronization-jobs"></a>

The following example demonstrates how to create a data synchronization job that reads data from Oss and prints it on the local client:

```
# Set the basic configuration of the task to be performed
env {
  parallelism = 1
  job.mode = "BATCH"
}

# Create a source to connect to Oss
source {
  OssFile {
    path = "/nexus/orc"
    bucket = "oss://tyrantlucifer-image-bed"
    access_key = "xxxxxxxxxxxxxxxxx"
    access_secret = "xxxxxxxxxxxxxxxxxxxxxx"
    endpoint = "oss-cn-beijing.aliyuncs.com"
    file_format_type = "orc"
  }
}

# Console printing of the read Oss data
sink {
  Console {
  }
}
```

```
# Set the basic configuration of the task to be performed
env {
  parallelism = 1
  job.mode = "BATCH"
}

# Create a source to connect to Oss
source {
  OssFile {
    path = "/nexus/json"
    bucket = "oss://tyrantlucifer-image-bed"
    access_key = "xxxxxxxxxxxxxxxxx"
    access_secret = "xxxxxxxxxxxxxxxxxxxxxx"
    endpoint = "oss-cn-beijing.aliyuncs.com"
    file_format_type = "json"
    schema {
      fields {
        id = int 
        name = string
      }
    }
  }
}

# Console printing of the read Oss data
sink {
  Console {
  }
}
```

#### Multiple Table[​](https://seatunnel.apache.org/docs/2.3.7/connector-v2/source/OssFile#multiple-table) <a href="#multiple-table" id="multiple-table"></a>

No need to config schema file type, eg: `orc`.

```
env {
  parallelism = 1
  spark.app.name = "Nexus"
  spark.executor.instances = 2
  spark.executor.cores = 1
  spark.executor.memory = "1g"
  spark.master = local
  job.mode = "BATCH"
}

source {
  OssFile {
    tables_configs = [
      {
          schema = {
              table = "fake01"
          }
          bucket = "oss://whale-ops"
          access_key = "xxxxxxxxxxxxxxxxxxx"
          access_secret = "xxxxxxxxxxxxxxxxxxx"
          endpoint = "https://oss-accelerate.aliyuncs.com"
          path = "/test/nexus/read/orc"
          file_format_type = "orc"
      },
      {
          schema = {
              table = "fake02"
          }
          bucket = "oss://whale-ops"
          access_key = "xxxxxxxxxxxxxxxxxxx"
          access_secret = "xxxxxxxxxxxxxxxxxxx"
          endpoint = "https://oss-accelerate.aliyuncs.com"
          path = "/test/nexus/read/orc"
          file_format_type = "orc"
      }
    ]
    result_table_name = "fake"
  }
}

sink {
  Assert {
    rules {
        table-names = ["fake01", "fake02"]
    }
  }
}
```

Need config schema file type, eg: `json`

```

env {
  execution.parallelism = 1
  spark.app.name = "Nexus"
  spark.executor.instances = 2
  spark.executor.cores = 1
  spark.executor.memory = "1g"
  spark.master = local
  job.mode = "BATCH"
}

source {
  OssFile {
    tables_configs = [
      {
          bucket = "oss://whale-ops"
          access_key = "xxxxxxxxxxxxxxxxxxx"
          access_secret = "xxxxxxxxxxxxxxxxxxx"
          endpoint = "https://oss-accelerate.aliyuncs.com"
          path = "/test/nexus/read/json"
          file_format_type = "json"
          schema = {
            table = "fake01"
            fields {
              c_map = "map<string, string>"
              c_array = "array<int>"
              c_string = string
              c_boolean = boolean
              c_tinyint = tinyint
              c_smallint = smallint
              c_int = int
              c_bigint = bigint
              c_float = float
              c_double = double
              c_bytes = bytes
              c_date = date
              c_decimal = "decimal(38, 18)"
              c_timestamp = timestamp
              c_row = {
                C_MAP = "map<string, string>"
                C_ARRAY = "array<int>"
                C_STRING = string
                C_BOOLEAN = boolean
                C_TINYINT = tinyint
                C_SMALLINT = smallint
                C_INT = int
                C_BIGINT = bigint
                C_FLOAT = float
                C_DOUBLE = double
                C_BYTES = bytes
                C_DATE = date
                C_DECIMAL = "decimal(38, 18)"
                C_TIMESTAMP = timestamp
              }
            }
          }
      },
      {
          bucket = "oss://whale-ops"
          access_key = "xxxxxxxxxxxxxxxxxxx"
          access_secret = "xxxxxxxxxxxxxxxxxxx"
          endpoint = "https://oss-accelerate.aliyuncs.com"
          path = "/test/nexus/read/json"
          file_format_type = "json"
          schema = {
            table = "fake02"
            fields {
              c_map = "map<string, string>"
              c_array = "array<int>"
              c_string = string
              c_boolean = boolean
              c_tinyint = tinyint
              c_smallint = smallint
              c_int = int
              c_bigint = bigint
              c_float = float
              c_double = double
              c_bytes = bytes
              c_date = date
              c_decimal = "decimal(38, 18)"
              c_timestamp = timestamp
              c_row = {
                C_MAP = "map<string, string>"
                C_ARRAY = "array<int>"
                C_STRING = string
                C_BOOLEAN = boolean
                C_TINYINT = tinyint
                C_SMALLINT = smallint
                C_INT = int
                C_BIGINT = bigint
                C_FLOAT = float
                C_DOUBLE = double
                C_BYTES = bytes
                C_DATE = date
                C_DECIMAL = "decimal(38, 18)"
                C_TIMESTAMP = timestamp
              }
            }
          }
      }
    ]
    result_table_name = "fake"
  }
}

sink {
  Assert {
    rules {
      table-names = ["fake01", "fake02"]
    }
  }
}
```
