Stream Processing Frameworks Compared: Top Tools for Processing Data Streams

Stream Processing Frameworks Compared: Top Tools for Processing Data Streams

Wojciech Marusarz - November 4, 2022

When it comes to stream processing, there are a lot of different frameworks to choose from. Each framework has its own unique set of features and benefits. This blog post will compare a selection of the most popular stream processing frameworks available today. We will start by defining stream processing and explaining how it works. Then, we will list solutions for stream processing from the Apache Foundation and the biggest cloud providers. Finally, we will give you some tips on how to choose a stream processor for your application.

What Is Stream Processing?

Stream processing means collecting and processing data in real-time as it is generated or near-real time when it is required for a particular use case, which is vital in circumstances where any delays would lead to negative results. It is one method of managing the ever-increasing volumes of data that are being generated nowadays.

The technology reads and processes a data stream continually from input sources, writes the results to an output stream, and can use multiple threads to enable parallelism. Stream processing can therefore support many applications that require real-time data analysis and decision-making, such as generating reports or triggering responses with minimal latency.

Some tasks that this complex event processing method is commonly used for are loan risk analysis, anti-fraud detection, sensor data monitoring, and target marketing.

Stream Processing Engines – How They Work

A stream processing framework will ingest streaming data from input sources, analyze it, and write the results to output streams. The processor will typically have the following four main components:

  • Input sources – where data is read from (examples include Kafka, Flume, Social Media, and IoT Sensors).
  • Output streams – where the processed data is written to (e.g., HDFS, Cassandra, and HBase).
  • Processing logic – defines how the data is processed (this can be done with Java, Scala, or Python code).
  • State management – allows the processor to keep track of its progress and maintain state information, which can be further used for exactly-once processing (i.e., when the same output is generated regardless of how many times the input stream is read).

Stream processing engine components

Stream processing engine components

The stream processing engine organizes data from the input source into short batches and presents them as continuous data streams output to other applications, simplifying the logic for developers who (re)combine data from different sources and time scales – which are all relative when it comes to real-time analysis.

The processing logic component is where most of the work is done, simplifying the necessary tasks in data management for consistently and securely ingesting, processing, and publishing data. This stage is where you define the transformations that are applied to the data as it is consumed from a publish-subscribe service before it is published back there or to other data storage.

Examples of processes here could be to analyze, filter, combine, transform, or clean the data. For example, you might want to extract certain fields from the data, perform some aggregations, or join different streams together.

State management is important in stream processing because, unlike batch methods, data is processed as it arrives, meaning that the processing framework needs to keep track of its progress. To provide exactly-once processing, the framework needs to store state information somewhere – typically a key-value store – where it can be restored from if necessary.

For example, if the stream processor crashes, it can be restarted from the last checkpoint and will then pick up where it left off. Likewise, if the input stream is replayed, the output stream will be generated correctly, even though the data has already been processed once.

Tools for Processing Streaming Data from Apache Software Foundation

There are various data streaming tools from Apache available.

Apache Kafka Streams

A distributed data streaming platform that comes packaged with Apache Kafka. This Java API can be used to build real-time streaming data pipelines and applications, as well as filter, join, aggregate, and group without any coding.

Kafka Streams has a low entry barrier – you can get started with just a few lines of code to write and deploy basic Java or Scala applications. It is also easy to scale and integrate with other services (yet it doesn’t have to be kept running with a particular cluster manager), is fault-tolerant, and supports exactly-once processing semantics.

Advantages:

  • easy integration with other applications (without requiring multiple)
  • low-latency
  • replaces the need to have standard message brokers

Disadvantages:

  • missing point-to-point queuing or other essential messaging paradigms
  • lacking streaming data analytics features
  • struggles with Kafka Cluster queues

Used by:

  • Zalando
  • Pinterest
  • Uber
  • TransferWise

Apache Spark

Apache Spark is a general-purpose cluster computing system that is suitable for large-scale stream data processing. It can be used for stream processing, batch processing, or interactive queries. Spark has a wide range of applications, including graph processing, machine learning, and SQL.

Spark Streaming is the module of Apache Spark for streaming the processing of data in real-time from various sources such as Kafka, Flume, and Twitter. This module integrates with other Spark libraries to provide a complete set of functionality for building streaming applications.

Advantages:

  • fault-tolerant
  • advanced streaming data analytics
  • multiple languages supported
  • fast performance
  • simple batch processing

Disadvantages:

  • difficult to learn
  • high memory usage
  • lack of built-in caching algorithm

Used by:

  • Uber
  • Shopify
  • Slack

Flink is a stream processor that can be used for batch and stream processing to compute unbounded or bounded data streams from many sources. It has a low-latency processing engine, supports event-time processing, and can handle out-of-order data.

Written in Java and Scala, Flink also includes libraries for SQL, machine learning, and graph processing. It can be deployed in standalone mode, on YARN, or Mesos, plus a streaming connector for Kafka.

Advantages:

  • high throughput with low latency
  • easily understandable UI
  • can analyze and optimize tasks in a dynamic way

Disadvantages:

  • difficult to integrate with YARN
  • only supports Java and Scala (besides experimental Python API)

Used by:

  • Gympass
  • Lime

Apache Samza

A stream processing framework that can process data in real-time from multiple sources, including Apache Kafka, which Samza was developed in conjunction with. It is written in Java and Scala, uses Apache YARN for resource management, and provides exactly-once processing semantics.

Advantages:

  • fault tolerance
  • exactly-once processing
  • stateful processing
  • pluggable architecture allowing for integration with many different systems
  • continuous computation and output result in sub-second response times

Disadvantages:

  • not easy to use if either of Kafka and YARN aren’t in your processing data pipeline
  • advanced streaming features are lacking (e.g., Watermarks, Sessions, and triggers)

Used by:

  • LinkedIn

Apache Hive

Apache Hive is a data warehousing system that runs on top of a Hadoop cluster and can be used for batch or stream processing or interactive queries. It supports SQL-like queries and can be used to process structured or semi-structured data.

Hive Streaming is a feature of Apache Hive that allows real-time data processing by streaming data into Hive from various sources, such as Apache Kafka. The same SQL-like queries as batch processing are then used to process streaming data.

Advantages:

  • familiar SQL interface
  • scalable and efficient
  • can be used for batch, streaming, or interactive queries

Disadvantages:

  • poor performance for interactive queries
  • complicated to update data – due to HDFS, can overwrite partitions or add records
  • high latency and slow

Used by:

  • Facebook (initial development)
  • Netflix
  • the Financial Industry Regulatory Authority (FINRA)

Apache Storm

Apache Storm is a distributed stream processing system that is written in Clojure and Java, using Apache Zookeeper for coordination and allowing for batch, distributed streaming data processing.

An application in Storm is designed as a “topology” acting as a data transformation pipeline in a directed acyclic graph (DAG) shape with custom “spouts” and “bolts” for vertices defining information sources and manipulations, and “streams” for edges directing data between nodes.

Advantages:

  • simple API
  • can process millions of records per second
  • flexible and extensible

Disadvantages:

  • no guaranteed message processing
  • lack of built-in windowing or state management

Used by:

  • Twitter (acquisition)

Apache Apex

Apache Apex is a native YARN stream framework for processing high-velocity and high-volume data streams. It is written in Java and can be deployed on Apache Hadoop YARN clusters.

Advantages:

  • scalable and performant
  • high flexibility
  • support for multiple data sources and sinks
  • minimal API for easy development

Disadvantages:

  • database and platform lock
  • debugging, customization, and version control are difficult
  • uncached page-embedded Javascript and uploaded images

Apache Flume

Apache Flume has a simple and flexible architecture based on streaming data flows for the efficient collection, aggregation, movement, and to process massive amounts of log data.

Advantages:

  • distributed, reliable, and available software
  • robust and fault tolerant
  • tunable reliability mechanisms
  • failover and recovery mechanisms
  • simple extensible model for online data analytics applications

Disadvantages:

  • weak ordering guarantee
  • duplicacy
  • poor scalability
  • unreliable
  • topology is complex

Used by:

  • Blue Cross Blue Shield Association

Fully-Managed Services for Stream Processing from Cloud Providers

Google Cloud Dataflow

A processing platform for developing and executing streaming data processing pipelines in a simple, serverless approach allowing developers to focus on the programming languages. Google Cloud Dataflow has accuracy control for streaming and batch data plus the Apache Beam SDK for MapReduce operations, with AI-powered processing in real-time.

Advantages:

  • infinite capacity for managing workloads
  • reduces operational complexities
  • low latency
  • native integrations with BigQuery and AI Platform
  • highly-accessible stream data analytics

Disadvantages:

  • limited to the service of Cloud Datastore only
  • costly to use DataFlow/BigQuery in streaming mode
  • custom sources incompatible with Google CDN
  • experimental big data processing tasks aren’t suitable

Used by:

  • Spotify
  • NY Times

Amazon Kinesis Data Streams

A fully managed, durable service to ingest, process, and analyze real-time streaming data from multiple sources, including event streams, social media feeds, applications, and their logs. Amazon Kinesis is ideal for building real-time applications (e.g., for fraud detection or behavioral analysis) that require fast decision-making.

Advantages:

  • simple setup and maintenance
  • handles any streaming data volume
  • integration with Amazon’s big data toolset

Disadvantages:

  • per hour, per shard pricing of commercial cloud service
  • complicated documentation
  • lack of direct streaming support

Used by:

  • Deliveroo
  • Lyft

Azure Stream Analytics

A cloud stream processing service that analyzes high volumes of data streaming from multiple connected input and output devices and sensors to derive business insights in near real-time. Azure Stream Analytics is fully managed and serverless, making it easy to set up, use, and scale.

Advantages:

  • low-cost and highly available
  • integrated with Azure IoT Hub
  • support for filtering, aggregating, and joining streaming data

Disadvantages:

  • no support for streaming from on-premises data sources or directly to Azure Blob Storage
  • limited query language support

Used by:

  • Renault-Nissan-Mitsubishi Alliance
  • Volkswagen Group

How to Choose a Stream Processor for Your Application?

With so many stream processors on the market, it can be tough to decide which one is right for your business and specific use case. However, there are some must-have features that any stream processing tool should have:

  • data ingestion with a message broker supported – for an event rate greater than that of a solitary stream processor node
  • streaming SQL – for faster development times and easy maintenance
  • stream processing API and query writing environment – for improved productivity
  • if a system needs a throughput of less than 50K events/second, you could have major savings with a two-node High Availability (HA) deployment
  • reliability and high availability (HA) – for recovery from failures with minimal interruption

Big data architecture based on Kafka, Hadoop, Spark and other frameworks and DBs

Big data architecture based on Kafka, Hadoop, Spark and other frameworks and DBs

You’ll often need a combination of tools to get the job done. A stream data platform for big data that will address the features mentioned above can rely on solutions like Kafka together with other frameworks - Hadoop, Spark, Flink, and Hive. If you’re relying on managed cloud services from your cloud providers, you can use stream processing solutions that are fully integrated with their ecosystems.

Not sure where to start? You can always try out a few stream processing frameworks in a test environment to see which one works best for your needs. Or how about consulting with nexocode data scientists and engineers?

About the author

Wojciech Marusarz

Wojciech Marusarz

Software Engineer

Linkedin profile Twitter Github profile

Wojciech enjoys working with small teams where the quality of the code and the project's direction are essential. In the long run, this allows him to have a broad understanding of the subject, develop personally and look for challenges. He deals with programming in Java and Kotlin. Additionally, Wojciech is interested in Big Data tools, making him a perfect candidate for various Data-Intensive Application implementations.

Would you like to discuss AI opportunities in your business?

Let us know and Mateusz will arrange a call with our experts.

Thanks for the message!

We'll do our best to get back to you
as soon as possible.

This article is a part of

Becoming AI Driven
29 articles

Becoming AI Driven

Artificial Intelligence solutions are becoming the next competitive edge for many companies within various industries. How do you know if your company should invest time into emerging tech? How to discover and benefit from AI opportunities? How to run AI projects?

Follow our article series to learn how to get on a path towards AI adoption. Join us as we explore the benefits and challenges that come with AI implementation and guide business leaders in creating AI-based companies.

check it out

Becoming AI Driven

Insights on practical AI applications just one click away

Sign up for our newsletter and don't miss out on the latest insights, trends and innovations from this sector.

Done!

Thanks for joining the newsletter

Check your inbox for the confirmation email & enjoy the read!

Find us on

Need help with implementing AI in your business?

Let's talk blue circle

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
  • "Other Google cookies" – Refer to Google cookie policy: google.com

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team

Close

Want to unlock the full potential of Artificial Intelligence technology?

Download our ebook and learn how to drive AI adoption in your business.

GET EBOOK NOW