Contents
  1. Articles
  2. Apache Kafka Connect In Action Bridging Sftp & Amazon S3 For Data Integration

AWS / Amazon

Apache Kafka Connect in action: Bridging SFTP and Amazon S3 for Data Integration

Apache Kafka Connect in action: Bridging SFTP and Amazon S3 for Data Integration

Transferring files from an SFTP server to an Amazon S3 bucket is a common requirement in many data workflows. While this task is often handled with custom scripts or manual processes, such approaches can be difficult to maintain and prone to error over time.

Apache Kafka Connect offers a structured way to move data between systems using pre-built connectors. In this guide, you'll set up a Kafka Connect pipeline that reads CSV files from an SFTP server and writes them to an Amazon S3 bucket in Avro format. We'll configure a source connector to monitor the SFTP server for new files and a sink connector to deliver that data to S3, no custom code required.

Kafka Connectors explained

  • Source Connector (SFTP)
    A source connector in Kafka Connect is responsible for pulling data from an external system (the source) and publishing it to one or more Kafka topics. In the context of an SFTP server, the source connector reads files or data from the server, processes it according to the specified configuration and streams it into the Kafka cluster.
  • Sink Connector (Amazon S3)
    A sink connector in Kafka Connect is used to consume data from Kafka topics and deliver it to an external system (the sink). In this use case, the sink connector pulls data from the Kafka topics where the source connector published it and writes it to an Amazon S3 bucket.

Diagram

diagram

Prerequisite

Before getting started, ensure the following:

  • SFTP server with access to the required folders.
  • AWS user with correct S3 permissions.
  • Docker & Docker Compose is installed.
  • Java is installed.
  • An SFTP client software is installed.

Environment setup: Setting up your Kafka Connect environment with Docker

To build this data pipeline between an SFTP server and Amazon S3 using Apache Kafka Connect, you'll first need to set up the required services. We'll use Docker to run all components in isolated containers.

Required services

Use Docker (or Docker Compose) to run the following containers:

  • confluentinc/cp-kafka:7.8.0 — Kafka broker
  • cnfldemos/cp-server-connect-datagen:0.6.4-7.6.0 — Kafka Connect worker
  • confluentinc/cp-schema-registry:7.8.0 — for managing Avro schemas
  • confluentinc/cp-enterprise-control-center:7.8.0 — graphical UI for Kafka Connect

Required connectors

Download and install the following Kafka Connect connectors from Confluent Hub:

Kafka Connect SFTP to S3: Configurations steps

Once your environment is up and running, follow these steps to configure the Kafka Connect pipeline. You'll first create a Kafka topic, then set up the SFTP source connector, followed by the S3 sink connector.

Step 1: Access the Kafka Control Center

Open the Kafka control center UI using the browser.
http://localhost:9021

Step 2: Create a Kafka Topic

Go to the Topics section and create a new topic named:
users-topic

Topics page

Step 3: Set up the SFTP Source Connector

  • Navigate to Connect > Add connector
  • Select SftpCsvSourceConnector inside the default cluster listed under the Connect page.
SftpCsvSourceConnector
  • Fill in the configuration with the following values:
1{
2  "name": "SftpCsvSourceConnectorConnector_0",
3  "config": {
4    "name": "SftpCsvSourceConnectorConnector_0",
5    "connector.class": "io.confluent.connect.sftp.SftpCsvSourceConnector",
6    "sftp.host": "192.168.1.236",
7    "sftp.port": "2222",
8    "sftp.username": "foo",
9    "sftp.password": "****",
10    "kafka.topic": "tp-users-topic",
11    "input.path": "/upload/input",
12    "finished.path": "/upload/finished",
13    "error.path": "/upload/fixit",
14    "input.file.pattern": "users[0-9]{0,3}.csv",
15    "key.schema": "{\"name\" : \"com.example.users.UserKey\",\"type\" : \"STRUCT\",\"isOptional\" : false,   \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}}}",
16    "value.schema": " {     \"name\": \"com.example.users.User\",     \"type\": \"STRUCT\",     \"isOptional\": false,     \"fieldSchemas\": {       \"firstName\": {         \"type\": \"STRING\",         \"isOptional\": true       },       \"lastName\": {         \"type\": \"STRING\",         \"isOptional\": true       }     }   }",
17    "schema.generation.enabled": "true",
18    "csv.skip.lines": "1"
19  }
20}

Once launched, check that the connector status is marked as Running.

Connector running

Step 4: Set up the S3 Sink Connector

  • Go back to Connect > Add connector
  • Select S3SinkConnector from the list
  • Provide the following values to the parameters in the settings page and hit the Launch button:
1{
2  "name": "S3SinkConnectorConnector_0",
3  "config": {
4    "name": "S3SinkConnectorConnector_0",
5    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
6    "topics": "users-topic",
7    "format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
8    "flush.size": "5",
9    "rotate.interval.ms": "60000",
10    "schema.compatibility": "BACKWARD",
11    "s3.bucket.name": "kafka-connector-tp-bucket",
12    "s3.region": "eu-west-2",
13    "aws.access.key.id": "<AWS Access Key>",
14    "aws.secret.access.key": "****************************************",
15    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
16    "topics.dir": "tp-sink"
17  }
18}
  • Again, verify that the connector status shows Running.

Step 5: Test the solution we just built

  • Log in to the Sftp server and upload the following CSV file to the input folder with the filename as users001.csv.
1firstName, lastName
2Grace, Abraham
3Hannah, Allan
4Heather, Alsop
  • After following the above steps, the file users001.csv should be moved to the finished folder on the SFTP server and a new file in Avro format should be visible in the specified S3 bucket.

Conclusion

Building a data pipeline between an SFTP server and Amazon S3 using Apache Kafka Connect is a reliable way to automate file transfers and reduce operational overhead. By using dedicated source and sink connectors, you avoid the need for custom scripts or manual interventions. This makes your workflow easier to maintain, more scalable and adaptable to future needs.

Once configured, the pipeline monitors your SFTP server for new files, moves them through Kafka and writes them to S3 in a structured format ready for analytics, archiving, or further processing.

Whether you're dealing with periodic batch uploads or planning a more dynamic, real-time data infrastructure, this approach offers a solid foundation for modern data integration.

If you're setting up a Kafka Connect pipeline and need technical guidance from configuration issues to custom connector development feel free to reach out.

Talk to our experts!

Contact our team and discover the cutting-edge technologies that will empower your business.

Get in touch

Nisal Fernando

Nisal Fernando

Share

Talk to our experts

Contact our team and discover cutting edge technologies that will empower your business

Get in touch

Related Articles

Catch up on the latest news, articles, guides and opinions from Claria.