What is Apache Flink? Architecture, Use Cases, and Benefits

What is Apache Flink? Architecture, Use Cases, and Benefits

Wojciech Gębiś - November 29, 2022 - updated on May 27, 2024

Apache Flink is a robust open-source stream processing framework that has gained much traction in the big data community in recent years. It allows users to process and analyze large amounts of streaming data in real time, making it an attractive choice for modern applications such as fraud detection, stock market analysis, and machine learning.

In this article, we’ll take a closer look at what Apache Flink is and how it can be used to benefit your business.

Modern Big Data Architecture

Big data is more than just a buzzword- it’s a reality for businesses of all sizes. And to take advantage of big data, you need a modern big data ecosystem.

A modern big data ecosystem includes hardware, software, and services that work together to process and analyze large volumes of data. The goal is to enable businesses to make better decisions faster and improve their bottom line.

Several components are essential to a thriving big data ecosystem:

  • Data Variety: Different data types from multiple sources are ingested and outputted (structured, unstructured, semi-structured).
  • Velocity: Fast ingest and processing of data in real-time.
  • Volume: Scalable storage and processing of large amounts of data.
  • Cheap raw storage: Ability to store data affordably in its original form.
  • Flexible processing: Ability to run various processing engines on the same data.
  • Support for streaming analytics: Streaming analytics refers to providing low latency to process real-time data streams in near-real-time.
  • Support for modern applications: Ability to power new types of applications that require fast, flexible data processing like BI tools, machine learning systems, log analysis, and more.

What is Batch Processing?

Batch processing is a type of computing process that involves collecting data and running it through a set of tasks in batches. Data is collected, sorted, and there are usually multiple steps involved in the process. The result of the batch process is typically stored for future use.

Batch processing has been used for decades to manage large volumes of data and still has many applications. Still, it isn’t suitable for real-time applications where near-instantaneous results are required.

Batch processing

Batch processing

What is Stream Processing?

Before we get into Apache Flink, it’s essential to understand stream processing. Stream processing is a type of data processing that deals with continuous, real-time data streams.

How does stream processing work?

How does stream processing work?

Data streaming differs from batch processing, which deals with discrete data sets processed in batches. Batch processing can be thought of as dealing with “data at rest,” while stream processing deals with “data in motion.”

Continuous stream processing - stream processing tools run operations on streaming data to enable real time analytics

Continuous stream processing - stream processing tools run operations on streaming data to enable real time analytics

Stream processing has several benefits over batch processing:

  • Lower latency: Since stream processors deal with data in near-real-time, the overall latency is lower and offers the opportunity for multiple specific use cases that need in-motion checks.
  • Flexibility: Stream process transaction data is generally more flexible than batch, as a wider variety of end applications, data types, and formats can easily be handled. It can also accommodate changes to the data sources (e.g., adding a new sensor to an IoT application).
  • Less expensive: Since stream processors can handle a continuous data flow, the overall cost is lower (lack of a need to store data before processing it).

Continuous stream processing - stream processing tools run operations on streaming data to enable real time analytics

Continuous stream processing - stream processing tools run operations on streaming data to enable real time analytics

Stream Processing Tools

Now that we’ve covered the basics of big data and stream processing, let’s take a closer look at stream processing frameworks.

Several stream processing tools are available, each with its own strengths and weaknesses. Some of the most popular stream processing tools include Apache Storm, Apache Samza, Apache Spark, and Apache Flink - the framework we want to focus on in this article.

Apache Flink is an open-source stream processing framework and distributed processing engine from the Apache Software Foundation that provides powerful, fault-tolerant, and expressive data processing capabilities. It was designed to combine the strengths of both batch and streaming processes, allowing developers to create applications that process real-time and historical data in a single system.

Process Unbounded and Bounded Data Streams

Apache Flink allows for both bounded and unbounded data stream processing. Bounded data streams are finite, while unbounded streams are infinite.

Bounded and unbounded streams

Bounded and unbounded streams

Bounded Data Streams

Bounded data streams have a defined beginning and end; they can be processed in one batch job or multiple parallel jobs. Apache Flink’s DataSet API is used to process bounded data sets, consisting of individual elements over which the user iterates. This type of system is often used for batch-like processing of data that is already present and known ahead of time - such as a customer database or log files.

Unbounded Streams

Unbounded data streams, on the other hand, have no start or end point; they continuously receive new elements that need to be processed right away. This type of processing requires a system that is always running and ready to accept incoming elements as soon as they arrive. To accomplish this, Apache Flink offers a DataStream API for real-time processing of the streaming data, allowing users to write applications that process unbounded streams of data.

Flink is based on a distributed dataflow engine that doesn’t have its own storage layer. Instead, it utilizes external storage systems like HDFS (Hadoop Distributed File System), S3, HBase, Kafka, Apache Flume, Cassandra, and any RDBMS (relational database) with a set of connectors. This allows Flink to process data from any source at any scale in a distributed manner. At its core is a distributed execution engine that supports various workloads, including batch processing, streaming, graph processing, and machine learning.

The next layer of Flink’s architecture is deployment management. Flink can be either deployed in local mode (for test and development purposes) or in a distributed manner for production use. The deployment management layer consists of components like Flink-runtime, Flink-client, Flink-web UI, Flink-distributed shell, and Flink-container. These components work together to manage the deployment and execution of Flink applications across a distributed cluster. To run as a multi-node cluster, Flink is tightly integrated with resource managers like YARN (Yet Another Resource Negotiator), Mesos, Docker, Kubernetes, or in the standalone mode.

High-level Apache Flink Application

High-level Apache Flink Application

Flink Kernel is the core element of the Apache Flink framework. The runtime layer provides distributed processing, fault tolerance, reliability, and native iterative processing capability.

The execution engine handles Flink tasks, which are units of distributed computations spread over many cluster nodes. This ensures that Flink can run efficiently on large-scale clusters.

Apache Flink: Stateful Computations over Data Streams supporting event-driven applications, streaming pipelines, and stream and batch analytics

Apache Flink: Stateful Computations over Data Streams supporting event-driven applications, streaming pipelines, and stream and batch analytics

Flink uses a master/slave architecture with JobManager and TaskManagers. The Job Manager is responsible for scheduling and managing the jobs submitted to Flink and orchestrating the execution plan by allocating resources for tasks. The Task Managers are accountable for executing user-defined functions on allocated resources across multiple nodes in a cluster.

Apache Flink master/slave core architecture with Flink Master and its JobManager and Resource Manager, and Task Managers for distributed streaming dataflow

Apache Flink master/slave core architecture with Flink Master and its JobManager and Resource Manager, and Task Managers for distributed streaming dataflow

The advantage of this architecture is that it can efficiently scale to process large data sets in near real-time. It also provides fault tolerance and allows for job restarts with minimal data loss - a crucial capability for mission-critical applications.

Flink is not just a data processing tool but an ecosystem with many different tools and libraries. The most important ones are the following:

Apache Flink Ecosystem Components - DataStream API for stream processing and DataSet API for batch processing and supporting libraries: CEP, Table, FlinkML, Gelly,

Apache Flink Ecosystem Components - DataStream API for stream processing and DataSet API for batch processing and supporting libraries: CEP, Table, FlinkML, Gelly,
Flink Layered APIs
Flink Layered APIs

DataSet APIs

The DataSet API is Flink’s core API for batch processing. It is used for operations like map, reduce, (outer) join, co-group, and iterate.

DataStream APIs

DataStream API is used to process streaming data (unbounded and infinite live data streams). It allows users to define arbitrary operations on incoming events, such as windowing, record-at-a-time transformations, and enriching events by querying an external data store.

Complex Event Processing (CEP)

Flink’s Complex Event Processing library allows users to specify patterns of events using a regular expression or state machine. The CEP library is integrated with Flink’s DataStream API so that pattern recognition can be performed on data in real time. Potential applications for the CEP library include network anomaly detection, rule-based alerting, process monitoring, and fraud detection.

SQL & Table API

The Flink ecosystem also includes APIs for relational queries - SQL and Table APIs. They provide a unified way of expressing and executing both stream and batch processing jobs. It allows users to write SQL queries, use the Table API, and easily manipulate data based on table schemas to construct complex data transformation pipelines with minimal effort.

Gelly

Gelly is a versatile graph processing and analysis library that runs on top of the DataSet API. Gelly integrates seamlessly with the DataSet API, making it both scalable and robust. Gelly features built-in algorithms such as label propagation, triangle enumeration, and page rank but also provides a Graph API to ease the implementation of custom graph algorithms.

FlinkML

FlinkML is a library of distributed machine learning algorithms that run on top of the DataSet API. It provides users with a unified way to apply both supervised and unsupervised learning techniques such as linear regression, logistic regression, decision trees, k-means clustering, LDA, and more. FlinkML also features an experimental deep learning framework for building neural networks (packaging TensorFlow).

Apache Flink is a powerful tool for handling big data and streaming applications. It supports both bounded and unbounded data streams, making it an ideal platform for a variety of use cases, such as:

  • Event-driven applications: Event-driven applications access their data locally rather than querying a remote database. By doing so, they improve performance in terms of both throughput and latency. Many of Flink’s outstanding features are centered around the proficient handling of time and state. Flink can be the central point of an event-driven architecture of a stateful application that ingests events from one or more event streams and reacts to incoming events by triggering computations, state updates, or external actions (e.g., fraud detection, anomaly detection, rule-based alerting, business process monitoring, financial and credit card transactions systems, social networks, and other message-driven systems).

  • Continuous data pipelines: Instead of running periodic ETL jobs (Extract Transform Load), you can achieve similar functionalities of transforming and enriching data and moving it from one storage system to another but in a continuous streaming mode.

    Periodic ETL vs. continuous data pipeline

    Periodic ETL vs. continuous data pipeline

  • Real-time data analytics: Flink is a true streaming engine with very low processing latencies that is ideal for processing data in near real-time, making it an excellent tool for monitoring and triggering actions or alerts (e.g., ad-hoc analysis of live data in various industries, customer experience monitoring, large-scale graph analysis, and network intrusion detection).

    Batch and real-time processing with Flink

    Batch and real-time processing with Flink

  • Machine learning: FlinkML provides a library of distributed machine learning algorithms that run on top of the DataSet API, allowing developers to train models quickly with large datasets. FlinkML enables integration with other deep learning frameworks for more complex AI solutions.

  • Graph processing: Gelly is a versatile graph processing and analysis library that runs on top of the DataSet API, providing graph computations.

Apache Flink is a powerful distributed processing system for stateful computations that has become increasingly popular recently. There are many reasons for Flink’s popularity, but some of the most important benefits include its speed, ease of use, and ability to handle large data sets.

We can specify many advantages of using Apache Flink, including the following:

  • Stateful stream processing: Flink’s stateful stream processing allows users to define distributed computations over continuous data streams. This enables complex event processing analytics on event streams such as windowed joins and aggregations, pattern matching, etc.
  • Stream and batch processing: Apache Flink is a great choice for real-time streaming applications that need to process both streaming and batch data.
  • Scalability: Apache Flink can scale up to thousands of nodes with minimal latency and throughput loss due to its efficient network communication protocols.
  • API support: Apache Flink supports APIs for writing streaming applications in Java and Scala.
  • Fault tolerance and availability: Apache Flink framework is built on top of the robust Akka actor system, which provides inherent fault tolerance. Apache Flink’s distributed runtime engine ensures high availability and fault-tolerant stream processing, making it a great choice for mission-critical streaming applications.
  • Low latency and high throughput: Apache Flink’s lightning-fast speed and high throughput processing make it ideal for real-time analytics or processing data from sources like sensor measurements from IoT devices, machine logs, credit card transactional data, or web and mobile click streams.
  • Flexible data formats: Managing data in different formats can be challenging, but Apache Flink supports several different data formats like CSV, JSON, Apache Parquet, and Apache Avro.
  • Optimization: Flink query optimizer provides several built-in optimizations, such as pipelining, data fusion, and loop unrolling, to reduce computation time. Flink Table API and SQL provide additional query optimizations and tuned operator implementations.
  • Flexible deployment: Apache Flink offers first-class support for several common clustered deployment targets, including YARN, Apache Mesos, Docker, and Kubernetes. It can also be configured to run as a standalone cluster.

There are also some limitations and disadvantages to using Apache Spark, including the following:

  • Steep learning curve: Apache Flink is a robust framework with many features and capabilities, which can make it overwhelming for new users.
  • Project maturity and community size: It is not as popular as its competitors but is recently gaining more and more popularity, and the Apache Flink community is steadily growing.
  • Limited API support: Apache Flink currently only supports the Java and Scala APIs, so developers using other languages will have to use wrappers or external libraries.
  • Basic machine learning support: While Apache Flink provides basic machine learning support through the FlinkML library, it is limited compared to more comprehensive frameworks (still, support for deep learning is provided by community projects like TensorFlow on the Flink project).

In the big data environment, Flink is a component that solely focuses on computation and does not offer storage. As part of the big data infrastructure stack, Flink combines with other technologies to provide an end-to-end solution for organizations looking to analyze large datasets quickly and efficiently. Usually, it is set up together with Apache Kafka as an event log and systems such as HDFS or other databases as the storage layer to offer periodic ETL jobs or continuous data pipelines.

Apache Flink in Data Ecosystem with Apache Kafka, HDFS, Elasticsearch, HBase, and others providing data ingestion and ETL functionalities and analytics on both batch and streaming data

Apache Flink in Data Ecosystem with Apache Kafka, HDFS, Elasticsearch, HBase, and others providing data ingestion and ETL functionalities and analytics on both batch and streaming data

Flink is being used by some of the world’s leading companies, including Amadeus, Capital One, Netflix, eBay, Lyft, Uber, and Zalando. Each of these use cases requires a different approach to data processing or support for machine learning solutions, and Apache Flink can handle all of them with ease.

You can implement Flink on your own or use it as a fully-managed service. Fully-managed services are an alternative approach to getting started with Flink without worrying about the underlying infrastructure.

If you seek a managed solution, then Apache Flink can be found as part of Amazon EMR, Amazon Kinesis Data Analytics, Google Cloud Dataproc, Microsoft Azure HDInsight, Cloudera, and Ververica Platform. Although they may be less flexible in some cases, these comprehensive managed services offer the underlying infrastructure for Flink and support for provisioning compute resources, parallel computation, automatic scaling, and application backups.

Conclusion

So there you have it – a quick introduction to Apache Flink, common use cases, and its many benefits. Like most of the stream processing frameworks on the market, it can be used together with other tools to create a more robust bid data processing architecture.

Overall, Apache Flink offers several significant benefits that have made it one of the most popular analytics engines available today. Its lightning-fast speed, the fact that it is a distributed system that can process both batch and streaming data in a fault-tolerant manner, and its ability to handle large data sets make it an appealing option for a wide range of applications.

If you’re looking for a powerful low latency streaming engine that can handle all your workloads (and more), then Apache Flink is definitely worth considering. And if you need help getting started, don’t hesitate to contact our team of experts. We’d be happy to walk you through the basics and help get your Flink program implementation up and running in no time!

References

Apache Flink

Apache Flink GitHub page

Flink Forward Conference

About the author

Wojciech Gębiś

Wojciech Gębiś

Project Lead & DevOps Engineer

Linkedin profile Twitter Github profile

Wojciech is a seasoned engineer with experience in development and management. He has worked on many projects and in different industries, making him very knowledgeable about what it takes to succeed in the workplace by applying Agile methodologies. Wojciech has deep knowledge about DevOps principles and Machine Learning. His practices guarantee that you can reliably build and operate a scalable AI solution.
You can find Wojciech working on open source projects or reading up on new technologies that he may want to explore more deeply.

Would you like to discuss AI opportunities in your business?

Let us know and Dorota will arrange a call with our experts.

Dorota Owczarek
Dorota Owczarek
AI Product Lead

Thanks for the message!

We'll do our best to get back to you
as soon as possible.

This article is a part of

Becoming AI Driven
98 articles

Becoming AI Driven

Artificial Intelligence solutions are becoming the next competitive edge for many companies within various industries. How do you know if your company should invest time into emerging tech? How to discover and benefit from AI opportunities? How to run AI projects?

Follow our article series to learn how to get on a path towards AI adoption. Join us as we explore the benefits and challenges that come with AI implementation and guide business leaders in creating AI-based companies.

check it out

Becoming AI Driven

Insights on practical AI applications just one click away

Sign up for our newsletter and don't miss out on the latest insights, trends and innovations from this sector.

Done!

Thanks for joining the newsletter

Check your inbox for the confirmation email & enjoy the read!

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
  • "Other Google cookies" – Refer to Google cookie policy: google.com

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team

Close

Want to unlock the full potential of Artificial Intelligence technology?

Download our ebook and learn how to drive AI adoption in your business.

GET EBOOK NOW