The Future of the Modern Data Stack | by Luhui Hu | Sep, 2022

How does the data platform evolve? And what’s the future of the modern data stack?

Photo by Jarrett Kow on Unsplash

Big Data has been a boiling-hot field for over a decade, from unlocking business insights to feeding deep learning. However, the term became dim with two turning points: the merger of Cloudera and Hortonworks in 2019 and the challenge of DL large models. We set high hopes for big data, predicting that the next largest company should be a big data company. We were excited about Cloudera and Hortonworks as big data pioneers, but none of their market caps were up to $10 billion. They ended up merging, and the new Cloudera is now a private company. On the other hand, we were moving to data-centric AI for the increasing challenge of large models.

In September 2020, a San Mateo-based company, Snowflake, went public as the largest software IPO in history. Its market capitalization once exceeded $120 billion. This is phenomenal. It is the first time a data company other than Oracle has reached this milestone. The pursuit of big data has rekindled in our hearts.

Snowflake claims as a data cloud company. Along with the highly performant and highly scalable data cloud, a new era has emerged with a nascent ecosystem of data platforms, Modern Data Stack. So what is the modern data stack? Was there a traditional data stack? And what is the future of the modern data stack?

The term “data stack” was never formal or sounded decent before the advent of the modern data stack. The data platform has evolved into three stages in the era of big data. Now it should be Data 3.0 regarding Modern Data Stack.

Data Platform Evolution (by author)

Data 1.0 was the era of Apache Hadoop. It was a booming era. Therefore, if you are familiar with the massive data components in the Hadoop ecosystem, you will not be surprised by the abundance of tools and services in the modern data stack. It mainly focused on big data batch processing, though there were some initial cloud solutions, such as Amazon EMR, Azure HD Insight, and Azure Databricks.

Data 2.0 was the era of real-time big data, typically represented by Spark and Flink. It could support both streaming and batch data processing in separate systems. Spark and Flink can now support both streaming and batch processing, though Spark started with batch processing while Flink began with streaming processing. For related streaming data collection and ingestion, Apache Kafka and Amazon Kinesis lead in the open source community and the cloud, respectively.

The modern data stack is Data 3.0. It is the data cloud age. Undoubtedly, it should be cloud-native, easy to use, secure, and scalable. This new ecosystem is primarily driven by Snowflake, Databricks, and leading cloud providers (e.g., AWS, Azure, and GCP). The stack extensively supports cloud storage, data query, processing, ELT/ETL/reverse ETL, BI/Analytics, observability, discoverability, orchestration, governance, ML Ops, and more.

The modern data stack is cloud-native open data platforms and services. Cloud and modular stack are their apparent features. So what are the unique characteristics of the modern data stack? There are four mainstreams to be the modern data stack.

Four characteristics of the modern data stack: cloud-native, unification of batch and streaming processing, data Lakehouse (data lake and data warehouse integration), and comprehensive data engineering.

  1. Cloud Modernization: First, it should be cloud-native beyond cloud-based, cloud-managed, or cloud-hosted. For instance, Amazon EMR, Azure HD Insight, and Azure Databricks are cloud-based or cloud-managed but not cloud-native and become awkward to impact the modern data stack. Snowflake employs a compute and storage decoupled cloud-native architecture, which enables it to scale out computing and storage independently and enhance service performance and reliability. Amazon Redshift, launched earlier but used a cluster-hosted architecture like EMR, was unfortunately far behind Snowflake, regardless of performance or revenue.
  2. Unification of Batch and Steaming Processing: Batch and streaming processing has been the main course of the data platform evolution over the years. Before the arrival of the current Spark and Flink, we were thrilled about Hadoop for first handling massive data, Spark for accelerating map-reduce, and Storm for initiating the streaming data processing. It is time to unify processing batch and streaming data using Spark or Flink without two separate systems or lambda architecture. Given the speed of complex data tasks, data transformers (aka ELT tools) have become handy tools in the ecosystem.
  3. Consolidation of Data Lake and Data Warehouse: Data Lake debuted to support unstructured, semi-structured, and structured data. It usually employs object storage with in-place queries, but they are slow. Cloud data warehouse excels well in query performance for structured data. Consolidating Data Lake and Data Warehouse is a fantastic strategy. Data Lakehouse is coined for an integrated data cloud architecture with data lake flexibility and data warehouse performance. Furthermore, the ETL process from Data Lake to Data Warehouse can be mitigated, besides eliminating multiple storages or data silos.
  4. Flexible and Comprehensive Data Engineering: The modern data stack is flexible open data platforms and services, supporting comprehensive data engineering from data integration to storage, processing, BI/analytics, observability, governance, and others. For example, Fivetran and Airbyte for ETL, dbt for ELT, Census for reverse ETL, Snowflake and Databricks Delta Lake for storage and query, Spark and Flink for processing, and more. All of these are orchestrated together to fulfill the entire data engineering tasks.

The modern data stack has made a giant leap forward as Data 3.0, but it is still nascent. There are many challenges and opportunities. The top three challenges are as follows.

  1. Complicated: The modern data stack is comprehensive, flexible, and open. But it is still too complex to choose or integrate due to different languages, protocols, regulations, and even infrastructure. This reminded us of the Hadoop ecosystem, where dozens of tools existed for various or similar purposes. We were overwhelmed with continuous package upgrades and interface integrations.
  2. Far Behind for Data Science: The modern data stack is currently focused on data engineering, and the tools for data science are far behind. Data is the core of AI/ML, but feature engineering and feature/parameter/model management are still another domain space. For example, can we merge BI data marts with AI-specific feature stores? Can we have a single/unified catalog, lineage, and observability for data, features, parameters, models, signals, and hyperparameters? And there is more disconnect between data engineering and data science. Fortunately, data-centric AI engineering is emerging and will address the data lifecycle and extend to AI/ML.
  3. Giant Cloud Silos: The modern data stack resides in the cloud, and it is hard to avoid cloud silos due to disparate cloud providers. We may move from the issue of organizational data silos to the problem of giant cloud silos, which can be related to processes, SLAs, and compliances beyond data. But this can open up opportunities for multi-cloud metadata and orchestration.

With the above features and challenges in mind, let’s look into the future of the modern data stack. What is Data 3.5 or Data 4.0?

There are seven meaningful and exciting areas for the future: holistic analytics, focus on value, multi-cloud virtualization, open data platform, open source strategy, speed is king, and rising of SQL.

The Future of the Modern Data Stack (by author)

Holistic Data Analytics (HDA)

Data analytics is not just BI. With the current data processing and machine learning capabilities, we can move from BI or traditional analytics to advanced holistic data analytics.

We can integrate business intelligence and intelligent analytics to enable holistic analytics, including descriptive, diagnostic, predictive, and prescriptive analytics.

Can we use past and current data to generate future information without explicit training or serving? ML analytics (predictive and prescriptive) will be the trend for data analytics and platforms.

Focus on Value with Engineering

How to unlock business value out of data is always the primary goal of data platforms. The journey from data to business value often involves multiple people or teams. Should we provide different tools for each team? The answer should be no.

There are three practical ways to accelerate data business value:

  1. close the data flow with an end-to-end loop to maximize business value. Although we were aware of this approach in the industry several years ago, there is no straightforward solution. It requires connecting all relevant points and flowing high-quality data over the nodes in the pipeline.
  2. integrate data stacks and unify data to simplify the process and operation. This can effectively reduce the overall cost and improve productivity.
  3. provide low code/no code tools to democratize. We moved the data stack and platform to the cloud and enabled it as Data as a Service. For most business users, this may not be enough. Low-code/no-code is the solution and can provide a painless out-of-the-box experience for them.

Multi-Cloud Virtualization

The cloud has significantly enhanced the scalability and reliability of the modern data stack. But as a matter of fact, there are and will be multiple cloud providers. Unfortunately, they do not have direct connections like shared domain layers or secure tunnels.

In addition, some data must be stored locally or regionally due to specific regulations. More and more enterprises are adopting a multi-cloud strategy. Check out the detail of the multi-cloud strategy here. With these large cloud providers, public and private clouds need to be federated efficiently and securely.

It will form a connected virtual cloud layer on top of multiple public and private clouds. It seems interesting to virtualize the virtualized clouds. Two inspiring frontlines may exist for dual virtualization:

  1. consensus meta platform for multiple clouds. In this approach, data can be shared and used without moving. The meta platform layer contains the consensus functions of governance, observability, and discoverability.
  2. multi-cloud resource virtualization with a unified orchestration. This will be more challenging than managing resources across VPCs, which has become feasible within a single cloud provider. For the data storage resource, it may be relatively simpler to distribute and retrieve via a few proxy interfaces. Furthermore, we can run queries on top of them.

Flexible but Cohesive Open Platform

The future of the data stack will be an open platform for easy integration, secure sharing, low latency, high reliability, and consistent governance.

For example, it should be simple and with no extra effort for the data flow from data source to storage via ETL, data transformation via in-place ELT, and feedback with results via reverse ETL. It should dramatically enhance data quality and engineering productivity through observability and discoverability.

And then, data lineage, semantics, statistics, metrics, and other knowledge information can become first-class citizens. Also, it can extend from ML analytics to AI engineering and machine learning, such as graph knowledge. So it’ll be flexible but cohesive open platforms.

Open Source Strategy

Big data and data platforms should start from open source and move to the cloud later. As the nature of the open source and the limitation of the cloud, the future of the modern data stack should embrace both cloud and open source for business success and user engagement. Now, startups with open source strategies are highly attractive to venture capital. e.g., TDengine from TAOS Data has grown exponentially through open source since its inception. There are many more successful open source stories in the community of the modern data stack, such as Databricks, Starburst, and Dremio.

Speed is King

Over the past few decades, we have significantly improved the performance of data processing and querying. In the modern data stack, speed is still king. It’s not just about user experience but decision speed and cost. We still remember that we preferred Spark over Hadoop because of its performance. It’s the same story for Snowflake over Redshift. With the large volume and complexity of data, a new breakthrough velocity will be the next milestone in the modern data stack. e.g., Firebolt has risen rapidly for its potentially higher speed.

Rising of SQL

SQL stemmed from data management and databases. Its elegant simplicity and widely-used standards make it the most common language in the modern data stack. More and more data services and platforms are starting to support SQL. For example, querying and analyzing streaming data using SQL is not new; a few startups adopt SQL to retrieve instant predictive analytics results. We can expect increasing SQL in data engineering and Python in AI engineering.

We can see many opportunities for the future of data platforms. The above seven areas come up based on the core perspectives of business value, data infrastructure, user experience, and team collaboration.

The modern data stack emerged after the business success of Snowflake. It’s proliferating but is still in its early stage. There are full of challenges and opportunities to work with data and AI. The modern data stack is Data 3.0.

The evolution of data platforms will never stop. We can see seven predictable changes in the future: advanced holistic analytics, always value-driven, multi-cloud virtualization, open data platform, open source strategy, speed still of the king, and the rising of SQL. The future of the modern data stack will be even more exciting, from infrastructure to user experience, performance, and beyond.

News Credit

%d bloggers like this: