Kafka 101

Kafka Basics

An application (producer) sends messages to Kafka and the messages are stored in the associated topic as records. Each topic is generally divided into a number of partitions (saved on peer-to-peer nodes, known as brokers).

Kafka is a open source technology developed by LinkedIn, it was donated to Apache foundation. Developers at LinkedIn created their own company called Confluent, and this company provides services around Kafka.

A topic can get quite big, so it is divided into partitions of smaller size for scalability. We need to choose the number of partitions (such as 8 for small topics and 32 for large topics) and replication-factor (such as the recommended value of 3; note that replications are done on different nodes called brokers with each partition having one of the brokers as leader and rest of the brokers are followers) at the creation time of a topic. Instead of manually creating topics, we can also configure brokers to auto-create topics when a non-existing topic is published to.

Kafka stores records for a set amount of time using a configurable retention period (default value of 7 days), such as one day or until some size threshold is met.

A kafka topic has several partitions, and each partition holds an ordered, immutable sequence of records that is continually appended to a structured commit log.

Producers

Producers can choose to receive acknowledgment of data writes:

  • acks=0: Producer won’t wait for an acknowledgment (possible data loss)

  • acks=1: Producer waits for leader acknowledgment (limited data loss)

  • acks=all: Leader + replicas acknowledgment (no data loss)

Producers can choose to send a key with the message. A key can be a string, number, anything. Generally, it is some attribute, such as chat_id. If key=null, then the messages are sent in a round-robin fashion to brokers. But if a key is sent, then all the messages for which the key returns the same hash (via a mechanism called hashing) will always go to the same partition. This also means that all updates related to given chat_id will always go to the same partition.

Brokers

Kafka receives, stores, and sends messages on different nodes (different servers machines, like different EC2 machines) called brokers. Each broker is identified by its ID (integer), broker.id.

Kafka APIs

In Kafka, the communication between the clients and the servers is done with TCP protocol.

Producer Configs

  1. bootstrap.servers - A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrappingβ€”this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).

  2. compression.type - The compression type for all data generated by the producer. The default is none (i.e. no compression). Valid values are none, gzip, snappy, lz4, or zstd. Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression).

  3. ssl.key.password - The password of the private key in the key store file or the PEM key specified in `ssl.keystore.key'.

  4. ssl.keystore.key - Private key in the format specified by 'ssl.keystore.type'.

  5. ssl.keystore.location - The location of the key store file.

  6. ssl.keystore.password - The store password for the key store file.

  7. batch.size - The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both the client and the server. This configuration controls the default batch size in bytes. No attempt will be made to batch records larger than this size.

  8. client.id - An id string to pass to the server when making requests. The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

  9. delivery.timeout.ms - An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.

  10. request.timeout.ms - The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted. This should be larger than replica.lag.time.max.ms (a broker configuration) to reduce the possibility of message duplication due to unnecessary producer retries.

  11. sasl.kerberos.service.name - The Kerberos principal name that Kafka runs as. This can be defined either in Kafka's JAAS config or in Kafka's config.

  12. sasl.mechanism- SASL mechanism used for client connections. This may be any mechanism for which a security provider is available. GSSAPI is the default mechanism.

  13. security.protcol - Protocol used to communicate with brokers. Valid values are: PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL.

  14. ssl.enabled.protcols - The list of protocols enabled for SSL connections. The default is 'TLSv1.2,TLSv1.3' when running with Java 11 or newer, 'TLSv1.2' otherwise. With the default value for Java 11, clients and servers will prefer TLSv1.3 if both support it and fallback to TLSv1.2 otherwise (assuming both support at least TLSv1.2). This default should be fine for most cases. Also see the config documentation for ssl.protocol.

  15. ssl.protocol- The SSL protocol used to generate the SSLContext. The default is 'TLSv1.3' when running with Java 11 or newer, 'TLSv1.2' otherwise. This value should be fine for most use cases. Allowed values in recent JVMs are 'TLSv1.2' and 'TLSv1.3'. 'TLS', 'TLSv1.1', 'SSL', 'SSLv2' and 'SSLv3' may be supported in older JVMs, but their usage is discouraged due to known security vulnerabilities. With the default value for this config and 'ssl.enabled.protocols', clients will downgrade to 'TLSv1.2' if the server does not support 'TLSv1.3'. If this config is set to 'TLSv1.2', clients will not use 'TLSv1.3' even if it is one of the values in ssl.enabled.protocols and the server only supports 'TLSv1.3'.

  16. acks - The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed:

    • acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1.

    • acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers.

    • acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.

Security

Apache Kafka allows clients to use SSL for encryption of traffic as well as authentication.

Generate SSL key and certificate for each Kafka broker - The first step of deploying one or more brokers with SSL support is to generate a public/private keypair for every server. You need to specify two parameters in the command:

  1. keystorefile: the keystore file that stores the keys (and later the certificate) for this broker. The keystore file contains the private and public keys of this broker, therefore it needs to be kept safe. Ideally this step is run on the Kafka broker that the key will be used on, as this key should never be transmitted/leave the server that it is intended for.

  2. validity: the valid time of the key in days.

> keytool -keystore {keystorefile} -alias localhost -validity {validity} -genkey -keyalg RSA -storetype pkcs12

Enabling SSL for Kafka

You're likely familiar with SSL from HTTPS websites. When SSL is enabled for a Kafka listener, all traffic for that channel will be encrypted with TLS, which employs digital certificates for identity verification.

Client-Broker Authentication

When a client opens a connection to a broker under SSL, it verifies the broker's certificate in order to confirm the broker's identity. If it checks out, the client is satisfied, but the broker may also wish to verify the client by certificate, making sure that the KafkaPrincipal associated with the connection represents the client’s identity. To ensure that the client’s certificate is checked by the broker, you can set ssl.client.auth=required. (Note that you can also set ssl.client.auth=requested, which isn't recommended as it only authenticates clients who have a certificate, assigning all others the previously mentioned and problematic ANONYMOUS KafkaPrincipal.)

Since TLS uses certificates, you'll need to generate them and configure your clients and brokers with the appropriate keys and certificates. You will also need to periodically update the certificates before they expire, in order to avoid TLS handshake failures. See the Kafka documentation for more details on creating keys and certificates.

Inter-Broker Authentication

Everything in the previous client-broker authentication section applies to authentication between brokers using SSL. Essentially, the broker initiating the connection functions similarly to the client in the client-broker approach. Use the inter.broker.listener.name or security.inter.broker.protocol settings to configure listeners for inter-broker communication.

Enabling SASL-SSL for Kafka

SASL-SSL (Simple Authentication and Security Layer) uses TLS encryption like SSL but differs in its authentication process. To use the protocol, you must specify one of the four authentication methods supported by Apache Kafka: GSSAPI, Plain, SCRAM-SHA-256/512, or OAUTHBEARER. One of the main reasons you might choose SASL-SSL over SSL is because you'd like to integrate Kafka, for example, with an existing Kerberos server in your organization, such as Active Directory or LDAP.

References :

Last updated