Deep Dive Into Apache Kafka Architecture for Big Data Processing

Deep Dive Into Apache Kafka Architecture for Big Data Processing

Wojciech Gębiś - August 15, 2022 - updated on April 26, 2023

There’s no doubt that big data is changing the world as we know it. The volume, variety, and velocity of data are growing at an unprecedented rate, and businesses are looking for new ways to process and make use of this data.

In the world of big data, Apache Kafka reigns as one of the most popular platforms for processing large volumes of data in real-time. Its event-driven architecture and scalability make it an attractive choice for companies looking to build a robust big data ecosystem. In this article, we will explore the basics of Apache Kafka architecture and discuss some of its key use cases. We’ll also take a look at how you can get started using Kafka for your own big data needs!

Event-Driven Architecture

Before we dive into Apache Kafka, it’s essential to understand the concept of event-driven architecture (EDA). Event-driven architecture is a software architecture based on the production, detection, and consumption of events.

In an event-driven system, events are generated by one or more producers. These events are then detected by one or more consumers, who take some action in response to the event. For example, a consumer might save the event to a database or trigger another process in response to the event.

Event-driven systems have many advantages over traditional architectures. They are highly scalable and can handle large volumes of data with ease. They are also very flexible and can be easily adapted to changing requirements.

Reactive Manifesto

The Reactive Manifesto is a set of principles for building responsive, resilient, and elastic systems. It was created in response to the need for systems that can handle the ever-increasing volume, velocity, and variety of data.

Reactive system as in reactive manifesto

Features of a reactive system

The manifesto states that reactive systems should be:

  • Responsive: The system should respond in a timely manner to user requests.
  • Resilient: The system should be able to recover from failures gracefully.
  • Scalable: The system should be able to scale up or down as needed.
  • Message Driven: The system should use asynchronous message passing to communicate between components.

Apache Kafka is an excellent example of a message-driven, reactive system. It is designed to handle large volumes of data with low latency and high throughput. It is also fault-tolerant and highly available.

If you want to dig deeper into various features of designing data-intensive applications, head over to our recent article: Future According to Designing Data-Intensive Applications.

Apache Kafka

Apache Kafka is an open-source, distributed streaming platform that allows for the development of real-time event-driven applications. In detail, it enables developers to create applications that continuously produce and consume data streams.

Kafka is distributed and runs as a cluster that can span multiple servers or even data centers. The produced records are replicated and partitioned in such a way that allows a high volume of users to use the application simultaneously without any perceptible lag in performance.

Kafka’s architecture is based on the concept of streams. A stream is an ordered, immutable sequence of records processed as they arrive. Each record in a stream consists of a key, a value, and a timestamp. Streams can be divided into partitions, which are ordered subsets of records. Partitions allow for parallel processing of data by multiple consumers. These all mean that Apache Kafka maintains a very high level of accuracy regarding data records; it also keeps the order of the occurrence. It is designed to be highly scalable, with support for horizontal scaling. Kafka is also fault-tolerant, with the ability to recover automatically from node failures.

If you summarize the features mentioned above, you will likely get a picture of a compelling platform.

What Is Kafka Used For?

Decoupling system dependencies: One of the key benefits of using Kafka is that it allows for decoupling system dependencies. Businesses today rely increasingly on the large number of systems that generate data, making the collection process challenging. For example, if you have a service that needs to process data from multiple sources, you can use Kafka to decouple the data producers from the data consumers. This way, all the complex integrations go away, as it is Kafka’s responsibility to stream all the incoming and outgoing data (broadcasting data updates to other services that are subscribed to that stream).

Messaging: Another widespread use case for Kafka is messaging. Apache Kafka is used as a message broker to provide reliable, asynchronous, and scalable messaging services.

Notifications: Another area where Kafka shines is notifications. For example, let’s say you have a service that needs to send out emails or push notifications to users when certain events occur. With Kafka, you can easily decouple the event producers from the notification service by having the event producer publish the events to a Kafka topic and having the notification service consume those events from the topic.

Stream processing: Another common use case for Kafka is stream processing. In this type of application, you can process streams of events with more sophisticated stream operations like stream joins, aggregations, filters, transformations, and conditional processing, using event-time and exactly-once processing. These applications are often used for real-time analytics or fraud detection when you need to store and query data, which is an important capability when implementing stateful operations.

Real-time data pipelines: Kafka is often used as a central hub for collecting data from multiple sources. It can act as a buffer between services that produce and consume data. For example, you may have a service that collects data from numerous sensors and writes that data to Kafka. Then you can have other services that consume that data from Kafka and perform analytics or operations on the data or store the data in a database. It is also used as the central hub for streaming data in a microservices architecture.

Processing big data: Another everyday use case for Kafka is to store high-volume data and process it. Kafka can be used to process big data in batch or streaming mode. For example, you can use Kafka to process log files from multiple servers and store the processed data in a database or search index.

This list is in no way exhaustive, though, as you can see, Kafka has a lot of different use cases. It is a potent tool that can be used to build scalable, real-time applications. If you need a step-by-step guide on the real-life application of Kafka, head over to our blog posts, where we documented the process of building a fast, reliable, and fraud-safe game based on Kafka architecture: Part 1 and Part 2.

Now that we’ve covered the basics of Kafka let’s take a more detailed look at its architecture.

Apache Kafka Architecture

Apache Kafka is a distributed streaming platform that consists of four core components:

  • Producers
  • Topics
  • Brokers
  • Consumers

Kafka Producers are the applications that generate the data records. Producer APIs publish these data streams, creating records and producing them to topics.

Topics are ordered lists of events with a unique name. A topic can persist to disk, where it can be saved for just a matter of minutes if it is going to be consumed immediately or it can be saved for a longer time or even forever, as long as you have enough storage space that the topics are persisted to physical storage. A topic can have multiple partitions, and each partition can have multiple replicas. The number of partitions and replicas is configurable.

Brokers are the servers that run Kafka. They store the data records in their respective partitions and replicate them to their followers (partitions with replicas).

Consumers are the applications that read data from Kafka topics. They can either read all the records in a topic or subscribe to a specific topic partition to ingest the data. Consumers API can read the data in real-time, or it can consume old data records that are saved to the topic. The same published data set can be consumed multiple times by different consumers, which in modern cloud applications is a common scenario - the same data needs to be fed to multiple specialized systems. It persists data to disks and can deliver data to both real-time and batch consumers concurrently, without performance degradation.

Producers can produce data directly to consumers, and that works for simple applications where data doesn’t change. For more complex applications, when you need to transform the data, you need to use the streams API for sophisticated stream processing. The streams API leverages the producers’ and consumers’ APIs to consume real-time data from a topic(s), analyze it, aggregate, and transform it as needed in real-time to produce the resulting transformed streams to a topic (existing or new ones).

The streams API is the powerful feature of Kafka that enables you to develop specialized systems and complex streaming applications.

The connector API enables you to write connectors, which are reusable producers and consumers that simplify and automate the integration work. Whenever you need to integrate the same data sources (e.g., MongoDB or any other database), you can use these pre-built connectors API to get your data source into the Kafka cluster.

Kafka architecture - Kafka message flow through components

Kafka architecture - Kafka message flow through components

Building Big Data Ecosystem Based on Kafka

Kafka acts as a central nervous system and is at the core of modern cloud applications that need to rely on real-time experience and move millions and millions of data or event records across the infrastructure. But Kafka does not stand alone, and it is often used with other technologies in order to create a more extensive streams processing, event-driven architecture, or big data analytics solution.

A big data ecosystem is a set of software components that may be used to create a distributed architecture for the processing of large amounts of data, including structured, semistructured, or unstructured information. These data sets come from several sources and have sizes ranging from terabytes to much, much more.

Modern big data architecture

Modern big data architecture

Big data frameworks such as these are frequently used in high-performance computing (HPC). This technique may be used to tackle complex issues in areas like logistics, engineering, or the banking sector. Identifying solutions to problems within these areas is often dependent on sifting through as much relevant data as possible and doing so in real-time.

Features to Look Out for When Building Big Data Pipeline

To create a reliable and effective big data processing pipeline with Kafka, there are a few key features that you should look out for:

  • Support for various big data sources
  • Low latency
  • High-throughput
  • Scalability
  • Batch-processing and stream-processing features
  • Flexibility
  • A solution that will be cost-effective to handle, though processing large volumes of data

A stream data platform for big data that will address the features mentioned above can rely on solutions like Kafka together with other frameworks - Hadoop, Spark, Flink, and Hive, all of which are open source projects developed by the Apache Software Foundation.

Big data architecture based on Kafka, Hadoop, Spark and other frameworks and DBs

Big data architecture based on Kafka, Hadoop, Spark and other frameworks and DBs

Apache Hadoop for Big Data Ingestion and Processing

Hadoop is an open-source big data processing framework that can deal with batch-oriented data ingestion, storage, and analysis. The Hadoop Distributed File System (HDFS) is its own file system that helps to store data in a cost-effective way across multiple commodity servers. YARN is the resource management layer of Hadoop that helps with job scheduling and cluster resource utilization.

Apache Spark

Apache Spark is a data processing mechanism that can quickly execute data processing tasks on huge data sets and distribute data processing work across several machines. Spark is known for its in-memory processing power. It isn’t a real-time system, but it does execute processes at pre-defined intervals in micro batches. It is an ultra-fast unified analytics engine whose features make it the go-to tool for big data and machine learning solutions. Spark can be used on top of HDFS as it supports various programming languages like Scala, Java, or Python, and it provides an interactive shell.

If you’re looking for a more comprehensive article on Apache Spark, head over to our recent article on this stream processing framework - What is Apache Spark? Architecture, Use Cases, and Benefits.

Apache Hudi

Hudi stands for Hadoop Upserts Deletes and Incrementals. This framework manages the storage of large analytical datasets on DFS (Cloud stores, HDFS, or any Hadoop FileSystem compatible storage), bringing transactions, record-level updates/deletes, and change streams to data lakes. Hudi is a tool that helps manage data pipelines and provides capabilities for incremental processing and time-travel analytics on top of different big data engines.

The Apache Flink framework is designed to perform both batch and stream processing, allowing for stateful computations on streaming data. It features built-in support for event time operations that can be used to handle the late arrival of events with ease.

Real-time stream processing and batch processing with Flink

Real-time stream processing and batch processing with Flink

Apache Hive

Hive is a distributed data warehouse system that helps with batch-oriented querying and analysis of large datasets stored in HDFS. It supports various file formats like ORC, Parquet, CSV, or JSON. Users can access the data using an SQL-like language called HiveQL to perform different types of transformations and analyses efficiently on large datasets. Read more about apache hive here.

Presto

Presto, from Presto Foundation, is a distributed SQL query engine that was designed to run interactive analytic queries against data sources of all sizes. It can connect both to non-relational sources, such as the HDFS, Amazon S3, Cassandra, or MongoDB, and relational data sources such as MySQL, PostgreSQL, Microsoft SQL Server, and Amazon Redshift. Presto can query data where it is currently stored without the need for moving data to yet another analytics system.

Big data architecture with incremental ingestion for modern applications

Big data architecture with incremental ingestion for modern applications

Data pipelines built with the help of these big data processing frameworks allow for the efficient movement of huge amounts of data from multiple sources into a central location where it can be processed and analyzed further. The use of Apache Kafka as part of the architecture helps to provide a fast, scalable, and reliable solution that can deal with large volumes of data and can be used for real-time data analysis.

Kafka Architecture Advantages

Scalability

Kafka is horizontally scalable to support a growing number of users and use cases, meaning that it can handle an increasing amount of data by adding more nodes and partitions to the system.

Performance - High Throughput and Low Latency

Kafka has low latency, meaning that messages are processed quickly. This is important for real-time applications where data needs to be processed in near-real time. Kafka is designed to offer high throughput when processing data. This technology provides high speed when it comes to transportation and data distribution to multiple specialized systems. It achieves this by batching records together and compressing them before they are stored or transmitted.

Fault Tolerance & Reliability

Kafka is designed to be a highly available and fault-tolerant system. It uses a replicated log data structure that ensures that messages are persisted to disk with an immutable log and therefore never lost and that each message is processed at least once.

Kafka has the ability to re-sync nodes that have failed and restore their state from a replica. This helps to minimize downtime in the event of a node failure and ensures that the data is always available.

Trusted Open-Source Project

Kafka is an open-source Apache project that is trusted by thousands of companies, including over 80% of the Fortune 100. Among the heavy users of Kafka, you will find all big tech companies, including Uber, Shopify, Airbnb, Intuit, Zalando, and of course, LinkedIn.

Kafka is constantly evolving, with new features and improvements being added in each release. The community is very active, and there is a lot of support available.

Choosing the Right Big Data Solution

When it comes to choosing the right big data solution, it is important to keep in mind the features that are required for your specific use case. In general, a good big data architecture should handle batch and stream processing, be scalable and economical, offer low latency and high throughput, and provide flexibility. If you are looking for a tool that can help you with event-driven architecture and streaming data processing, then Apache Kafka is definitely worth considering. Combined with other big data solutions like Hadoop, Spark, or Hive, it can provide you with a robust ecosystem for all your big data needs.

Still unsure which big data solution is right for you or how to approach your application modernization? Our team of experts can help you design the perfect big data architecture for your specific needs. Get in touch with us today to find out more.

About the author

Wojciech Gębiś

Wojciech Gębiś

Project Lead & DevOps Engineer

Linkedin profile Twitter Github profile

Wojciech is a seasoned engineer with experience in development and management. He has worked on many projects and in different industries, making him very knowledgeable about what it takes to succeed in the workplace by applying Agile methodologies. Wojciech has deep knowledge about DevOps principles and Machine Learning. His practices guarantee that you can reliably build and operate a scalable AI solution.
You can find Wojciech working on open source projects or reading up on new technologies that he may want to explore more deeply.

Would you like to discuss AI opportunities in your business?

Let us know and Dorota will arrange a call with our experts.

Dorota Owczarek
Dorota Owczarek
AI Product Lead

Thanks for the message!

We'll do our best to get back to you
as soon as possible.

This article is a part of

Becoming AI Driven
98 articles

Becoming AI Driven

Artificial Intelligence solutions are becoming the next competitive edge for many companies within various industries. How do you know if your company should invest time into emerging tech? How to discover and benefit from AI opportunities? How to run AI projects?

Follow our article series to learn how to get on a path towards AI adoption. Join us as we explore the benefits and challenges that come with AI implementation and guide business leaders in creating AI-based companies.

check it out

Becoming AI Driven

Insights on practical AI applications just one click away

Sign up for our newsletter and don't miss out on the latest insights, trends and innovations from this sector.

Done!

Thanks for joining the newsletter

Check your inbox for the confirmation email & enjoy the read!

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
  • "Other Google cookies" – Refer to Google cookie policy: google.com

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team

Close

Want to unlock the full potential of Artificial Intelligence technology?

Download our ebook and learn how to drive AI adoption in your business.

GET EBOOK NOW