Source
Source connectors in Nexus share several common core features, though the level of support for each feature varies depending on the connector.
Exactly-once: If every data item from the source is sent downstream only once, the source connector is considered to support exactly-once delivery.
In Nexus, we can store the read Split and its offset (the position of the data read at that point, such as the line number, byte size, or offset) as a StateSnapshot during checkpointing. In case of a task restart, the system retrieves the last StateSnapshot, locates the Split and offset from the last read, and resumes sending data downstream from that point.
Example connectors: File, Kafka.
Column projection: If the connector allows reading only specified columns from the source data (note that reading all columns first and then filtering out unnecessary ones later using the schema is not considered true column projection).
For example, JDBCSource can use SQL to define which columns to read.
In contrast, KafkaSource reads all content from the topic and then uses the schema to filter out unnecessary columns, which is not true column projection.
Batch mode: In batch job mode, the data read is finite, and the job stops once all data has been read.
Stream mode: In streaming job mode, the data read is infinite, and the job continues without stopping.
Parallelism: The Source Connector supports configuring parallelism, meaning multiple tasks can be created to read data concurrently. In parallelism, the source is divided into multiple splits, which are then assigned by an enumerator to SourceReaders for processing.
Support for user-defined split: Users can define their own split rules for how the data should be partitioned.
Support for reading multiple tables: Nexus supports reading from multiple tables in a single job.
Last updated