Kafka Optimisation in Action — Part I

Arpit Singh
4 min readMay 10, 2020

Achieving High Throughput, Availability, Durability and Low Latency with Kafka.

Different organisations use Kafka to cater to high traffic of data being generated every second.

For example Pinterest stated in 2018, their Kafka cluster was processing 1.2 petabytes per day with 2000 brokers. Linked’s Kafka with over 4000 brokers have surpassed 7 trillion messages per day.

It is understandable that we can never configure Kafka in such a way that we can fully utilise best of availability, throughput, latency and durability. There shall always be some tradeoff that we will have to consider while configuring.

For example, if we want to ensure durability, it would mean writing to all the replicas, sending acknowledgment to the producer etc. This would mean a compromise over throughput. So as long as we are aware of the important components, we can always set the best configuration as per our use case.

Let’s revisit few components that can help understand what woule be the best configuration for Kafka.

Producer

  • Batch Size : Allows producers to curate messages till it reaches a certain size before sending to broker, thereby reducing load on brokers as well as CPU overhead for processing each request.
  • Compression : Allows more data to be sent with less size.
  • Acknowledgment : Determines the number of acknowledgment the leader broker should receive from its followers before sending acknowledgment to the producer. Setting ack=0 would mean the producer wont wait for the leader broker to send ack, whereas ack=1 would imply that leader broker will write to its local logs and send ack to the producer.
  • Retries : This setting will not allow producer to send the failed message and thereby having more bandwidth for unsent messages.
  • Buffer Memory : Determines the amount of memory allocated to store unsent messages. If the memory limit is reached, he producer will block on additional sends until the memory frees up or until the max.block.ms time expires.
  • Linger : Determines the time the producer will wait before dispatching the current set of messages. Longer linger time with high batch size will help increase overall throughput.

Consumer

  • Byte Size — Determines how much data consumer gets from the leader for each fetch request. High value will ensure less fetch frequency and less CPU overhead at broker end to process each request.
  • Consumer Group : Using consumer group can help parallelise consumption because multiple consumers can balance the load and process multiple partitions simultaneously.
  • Session : If a consumer operating in a consumer group goes down, Kafka can detect the failure and rebalance the partition among the remaining group. Lower value of session timeout will allow Kafka to acknowledge the failure asap.
  • Commits : To ensure better durability, we would want to disable the message auto commit, since there is a possibility of consumer failure even if the message has been read, but failed during processing. Hence we can allow consumers to have control over the commit of messages under transaction.

Broker

  • Replicas : We can leverage replication option within Kafka to have multiple copies of the data spread across different brokers. In Sync Replica (ISR) specifies the replicas that are caught up with the leader broker. For each partition, leader brokers will automatically replicate messages to the other brokers which are in their ISR list.
  • Leader Election : In case when leader goes down, Kafka cluster can automatically detect failures and elect new leader from one of running replicas(maybe from ISR if unclean.leader.election.enable=false). Choosing a leader from ISR prevents chances of losing message that were committed but not replicated.
  • Fetcher Threads : Increasing num of fetcher threads increases the degree of IO parallelism in the follower broker, allowing faster message replication across followers from source broker.
  • Topic Creation : If auto topic creation is enable, it useful to set the default replication factor to 3 to ensure availability. However it is sometimes useful to turn off the auto topic creation such that we have more control over the replication and partition settings of each topic.
  • Logs : By default the operating system has the control over flushing the logs from page cache to disk. In some scenario we may want to have control over flush interval in cases where throughput is low.
  • Log Recovery : During broker bootstrapping, it scans for log files in preparation for getting in sync with other brokers. Kafka configures threads for this process, increasing the thread number can help in reducing the log loading time.

--

--