- Articles
- Apache Kafka Connect In Action Bridging Sftp & Amazon S3 For Data Integration
AWS / Amazon
Apache Kafka Connect in action: Bridging SFTP and Amazon S3 for Data Integration
Transferring files from an SFTP server to an Amazon S3 bucket is a common requirement in many data workflows. While this task is often handled with custom scripts or manual processes, such approaches can be difficult to maintain and prone to error over time.
Apache Kafka Connect offers a structured way to move data between systems using pre-built connectors. In this guide, you'll set up a Kafka Connect pipeline that reads CSV files from an SFTP server and writes them to an Amazon S3 bucket in Avro format. We'll configure a source connector to monitor the SFTP server for new files and a sink connector to deliver that data to S3, no custom code required.
Kafka Connectors explained
- Source Connector (SFTP)
A source connector in Kafka Connect is responsible for pulling data from an external system (the source) and publishing it to one or more Kafka topics. In the context of an SFTP server, the source connector reads files or data from the server, processes it according to the specified configuration and streams it into the Kafka cluster. - Sink Connector (Amazon S3)
A sink connector in Kafka Connect is used to consume data from Kafka topics and deliver it to an external system (the sink). In this use case, the sink connector pulls data from the Kafka topics where the source connector published it and writes it to an Amazon S3 bucket.
Diagram
Prerequisite
Before getting started, ensure the following:
- SFTP server with access to the required folders.
- AWS user with correct S3 permissions.
- Docker & Docker Compose is installed.
- Java is installed.
- An SFTP client software is installed.
Environment setup: Setting up your Kafka Connect environment with Docker
To build this data pipeline between an SFTP server and Amazon S3 using Apache Kafka Connect, you'll first need to set up the required services. We'll use Docker to run all components in isolated containers.
Required services
Use Docker (or Docker Compose) to run the following containers:
confluentinc/cp-kafka:7.8.0
— Kafka brokercnfldemos/cp-server-connect-datagen:0.6.4-7.6.0
— Kafka Connect workerconfluentinc/cp-schema-registry:7.8.0
— for managing Avro schemasconfluentinc/cp-enterprise-control-center:7.8.0
— graphical UI for Kafka Connect
Required connectors
Download and install the following Kafka Connect connectors from Confluent Hub:
- SFTP Source Connector
Enables Kafka Connect to read CSV files from an SFTP server. - S3 Sink Connector
Allows Kafka Connect to write data into an Amazon S3 bucket in structured format.
Kafka Connect SFTP to S3: Configurations steps
Once your environment is up and running, follow these steps to configure the Kafka Connect pipeline. You'll first create a Kafka topic, then set up the SFTP source connector, followed by the S3 sink connector.
Step 1: Access the Kafka Control Center
Open the Kafka control center UI using the browser.http://localhost:9021
Step 2: Create a Kafka Topic
Go to the Topics section and create a new topic named:users-topic
Step 3: Set up the SFTP Source Connector
- Navigate to Connect > Add connector
- Select
SftpCsvSourceConnector
inside the default cluster listed under the Connect page.
- Fill in the configuration with the following values:
1{
2 "name": "SftpCsvSourceConnectorConnector_0",
3 "config": {
4 "name": "SftpCsvSourceConnectorConnector_0",
5 "connector.class": "io.confluent.connect.sftp.SftpCsvSourceConnector",
6 "sftp.host": "192.168.1.236",
7 "sftp.port": "2222",
8 "sftp.username": "foo",
9 "sftp.password": "****",
10 "kafka.topic": "tp-users-topic",
11 "input.path": "/upload/input",
12 "finished.path": "/upload/finished",
13 "error.path": "/upload/fixit",
14 "input.file.pattern": "users[0-9]{0,3}.csv",
15 "key.schema": "{\"name\" : \"com.example.users.UserKey\",\"type\" : \"STRUCT\",\"isOptional\" : false, \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}}}",
16 "value.schema": " { \"name\": \"com.example.users.User\", \"type\": \"STRUCT\", \"isOptional\": false, \"fieldSchemas\": { \"firstName\": { \"type\": \"STRING\", \"isOptional\": true }, \"lastName\": { \"type\": \"STRING\", \"isOptional\": true } } }",
17 "schema.generation.enabled": "true",
18 "csv.skip.lines": "1"
19 }
20}
Once launched, check that the connector status is marked as Running.
Step 4: Set up the S3 Sink Connector
- Go back to Connect > Add connector
- Select
S3SinkConnector
from the list - Provide the following values to the parameters in the settings page and hit the Launch button:
1{
2 "name": "S3SinkConnectorConnector_0",
3 "config": {
4 "name": "S3SinkConnectorConnector_0",
5 "connector.class": "io.confluent.connect.s3.S3SinkConnector",
6 "topics": "users-topic",
7 "format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
8 "flush.size": "5",
9 "rotate.interval.ms": "60000",
10 "schema.compatibility": "BACKWARD",
11 "s3.bucket.name": "kafka-connector-tp-bucket",
12 "s3.region": "eu-west-2",
13 "aws.access.key.id": "<AWS Access Key>",
14 "aws.secret.access.key": "****************************************",
15 "storage.class": "io.confluent.connect.s3.storage.S3Storage",
16 "topics.dir": "tp-sink"
17 }
18}
- Again, verify that the connector status shows Running.
Step 5: Test the solution we just built
- Log in to the Sftp server and upload the following CSV file to the input folder with the filename as
users001.csv.
1firstName, lastName
2Grace, Abraham
3Hannah, Allan
4Heather, Alsop
- After following the above steps, the file users001.csv should be moved to the finished folder on the SFTP server and a new file in Avro format should be visible in the specified S3 bucket.
Conclusion
Building a data pipeline between an SFTP server and Amazon S3 using Apache Kafka Connect is a reliable way to automate file transfers and reduce operational overhead. By using dedicated source and sink connectors, you avoid the need for custom scripts or manual interventions. This makes your workflow easier to maintain, more scalable and adaptable to future needs.
Once configured, the pipeline monitors your SFTP server for new files, moves them through Kafka and writes them to S3 in a structured format ready for analytics, archiving, or further processing.
Whether you're dealing with periodic batch uploads or planning a more dynamic, real-time data infrastructure, this approach offers a solid foundation for modern data integration.
If you're setting up a Kafka Connect pipeline and need technical guidance from configuration issues to custom connector development feel free to reach out.
Related Articles
Catch up on the latest news, articles, guides and opinions from Claria.