We are in the era of Stream processing-as-a-service and for any data-driven organization, Stream-based computing has becoming the norm. In the last three parts https://bit.ly/2WgnILP, https://bit.ly/3a6ij2k, https://bit.ly/3gICm88, I had explored Lambda Architecture and its variants. In this article let’s discover Streaming in the big data. ‘Real-time analytics’, ‘Real-time data’ and ‘Streaming data’ has become mandatory in any big data platform. The aspiration to extend data analysis (predictive, descriptive, or otherwise) to streaming event data has been common across every enterprise and there is a growing interest to find real-time big data architectures. Kappa (K) Architecture is one that deals with streaming. Let’s see why Real-Time Analytics matter more than ever and mandates data streaming and how streaming architecture like Kappa works. Is Kappa an alternative to lambda?
“You and I are streaming data engines.” – Jeff Hawkins
Questioning Lambda
Lambda architecture fits very well in many real-time use cases, mainly in re-computing algorithms. At the same time, Lambda Architecture has the inherent development and operational complexities like all the algorithms must be implemented twice, once in the cold path, the batch layer, and another execution in the hot path or the real-time layer. Apart from this dual execution path, the Lambda Architecture has the inevitable issue of debugging. Because operating two distributed multi-node services is more complex than operating one.
Given the obvious discrepancies of Lambda Architecture, Jay Kreps, CEO of Confluent, co-creator of Apache Kafka started the discussion on the need for new architecture paradigm which uses less code resource and could perform well in certain enterprise scenarios. This gave rise to Kappa (K) Architecture. The real need Kappa Architecture isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. In fact, Kappa is not taken as competitor to LA on the contrary it is seen as an alternative.
What is Streaming & Streaming Architecture?
Modern business requirements necessitate a paradigm shift from traditional approach of batch processing to real-time data streams. Data-centric organizations mandate the Stream first approach. Real-time data streaming or Stream first approach means at the very moment. So real-time analytics, either On-demand real-time analytics or Continuous real-time analytics, is the capability to process data right at the moment it arrives in the system. There is no possibility of batch processing of data. Not to mention, it enhances the ability to make better decision making and performing meaningful action on a timely basis. At the right place and at the right time, real-time analytics combines and analyzes data. Thus, it generates value from disparate data.
Typically, most of the streaming architectures will have the following 3 components:
- an aggregator that gathers event streams and batch files from a variety of data sources,
- a broker that makes data available for consumption,
- an analytics engine that analyzes the data, correlates values and blends streams together.
Kappa (K) Architecture for Big Data era
Kappa (K) Architecture is one of the new software architecture patterns for the new Data era. It’s mainly used for processing streaming data. Kappa architecture gets the name Kappa from the Greek letter (K) and is attributed to Jay Kreps for introducing this architecture.
The main idea behind the Kappa Architecture is that both the real-time and batch processing can be carried out, especially for analytics, with a single technology stack. The data from IoT, streaming, and static/batch sources or near real-time sources like change data capture is ingested into messaging/ pub-sub platforms like Apache Kafka.
An append-only immutable log store is used in the Kappa Architecture as the canonical store. Following are the pub/sub or message buses or log databases that can be used for ingestion:
- Amazon Quantum Ledger Database (QLDB)
- Apache Kafka
- Apache Pulsar
- Amazon Kinesis
- Amazon DynamoDB Streams
- Azure Cosmos DB Change Feed
- Azure EventHub
- DistributedLog
- EventStore
- Chronicle Queue
- Pravega
Distributed Stream processing engines like Apache Spark, Apache Flink, etc. will read the data from the streaming platform and transform it into an analyzable format, and then store it into an analytics database in the serving layer. Following are some of the distributed streaming computation systems
- Amazon Kinesis
- Apache Flink
- Apache Samza
- Apache Spark
- Apache Storm
- Apache Beam
- Azure Stream Analytics
- Hazelcast Jet
- Kafka Streams
- Onyx
- Siddhi
In short, any query in the Kappa Architecture is defined by the following functional equation.
Query = λ (Complete data) = λ (live streaming data) * λ (Stored data)
The equation means that all the queries can be catered by applying Kappa function to the live streams of data at the speed layer. It also signifies that the stream processing occurs on the speed layer in Kappa architecture.
Pros and Cons of Kappa architecture
Pros
- Any architecture that is used to develop data systems that doesn’t need batch layer like online learning, real-time monitoring & alerting system, can use Kappa Architecture.
- If computations and analysis done in the batch and streaming layer are identical, then using Kappa is likely the best solution.
- Re-computations or re-iterations is required only when the code changes.
- It can be deployed with fixed memory.
- It can be used for horizontally scalable systems.
- Fewer resources are required as the machine learning is being done on the real-time basis.
Cons
Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.
On finding the right architecture for any data driven organizations, a lot of considerations were taken in. Like most successful analytics project, which involves streaming first approach, the key is to start small in scope with well-defined deliverables, then iterate. The reason for considering distributed systems architecture (Generic Lambda or unified Lambda or Kappa) is due to minimized time to value.
About the author
January 7, 2021
Bargunan Somasundaram
“ Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.“