Is Enterprise Data Operating System (EDOS) a possible future evolution from the current Modern Data Stack (MDS) ? — Part 1
Modern Data Stack, Enterprise Architecture and Evolution
Background
This post is part of a 3 part update on understanding the enterprise data stack evolution over years and making the case for a possible future evolution of this stack from the current state loosely called as Modern Data Stack (MDS) to what I call Enterprise Data Operating System (EDOS).
To do so, I split my thoughts in 3 parts series
Part 1: This post, understand the evolution of the data stack/architecture over the past two decades from a birds eye perspectives, issues they addressed and problems they solved, underlying design patterns that turned out to be winners and the learning there off from the current state, Identifying Opportunities for improvement with the current MDS value proposition.
Part 2: Learning from the opportunities for improvement and unresolved issues from Part-1, I lay out my proposal for the Enterprise Data Operating System(EDOS). This post shall include First Design Principles including design considerations, core principles and cost-performance levers that should drive forward such an architecture and a proposed alternate Enterprise Data Stack at a high level.
Part 3: A Glimpse into lower level design details, detailing core ideas of EDOS. I wrap up with architecture Review/Comparison/Trade-offs between MDS and EDOS.
Key Takeaways from this post
Part-1 (this post) covers the following agenda:
Historical Context & Evolution of data stacks leading up to the current state of Enterprise Data Architecture.
Birds eye view of growth of Modern Data Stack(MDS) and Rise of MDS
Learnings, challenges & opportunities emerging from experience with Modern Data Stack.
Current State of Enterprise Data Architecture.
Issues which were solved and the ones that still remains as opportunity for the future?
Evolution and the Story So Far
We have seen the evolution of enterprise data stacks almost every 3-5 years in the past 2 decades, starting from the emergence of big data stacks built around hadoop in the later part of the 2000 and early 2010s to movement from legacy data warehouses locked inside the enterprise to cloud data warehouses.
Starting with the cloud data lakes - These data lakes made a fundamental design choice which is to store tabular and non-tabular data all in the way files are stored in the traditional file system but improving the file formats for tabular data to be based on columnar file formats similar to what existed in traditional columnar cloud warehouses. While this was a desired change for various other workloads, It didn’t evolve enough to pick up the query workload of legacy data warehouses where primary use cases were Traditional BI and ad-hoc analysis. They still lacked the metadata and other optimizations of traditional data warehouse engines which made querying large data workloads via SQL less performant and not fit for ad-hoc query workload on a larger table.
That led to the emergence of the enterprise Cloud Data warehouse (CDW) which separated SQL workloads from cloud data lakes and we have been on that journey from 2012 until now. During this timeframe, We saw the emergence of various forms of cloud warehouses, with popular one’s being AWS Redshift, Google BigQuery, Snowflake. Each of these engines innovated and competed on one or other axes to offer best price performance and design choices as they saw would work for the enterprises.
However, One central theme emerged across these CDW engines. All of them over years, made one key design choice, which is separation of data storage from the query/engine layer. This separation would enable enterprises to leverage infinite scalability of cloud storage and compute independently, starting and stopping cloud based VMs dynamically, as the workloads need to be processed. Google’s BigQuery took the approach of a serverless query engine working directly on top of the object storage and pricing based on per query workload. Snowflake extended the centrality of the idea of separation of storage from SQL query engine but kept the simplicity of UX layer which is what enterprise DW users were used to, while enabling them to leverage parallel workloads and infinite scalability of cloud by simplifying/automating starting and stopping of engines based on the workload needs.
These Cloud EDWs promised to offer the most optimal blend of architectural scalability, UX simplicity (for what enterprises were used to from the past) and price performance compared to the legacy enterprise DW and earlier innovation on Cloud Data lakes. They further gained momentum and popularity both due to push by the vendors as well as positive reception generated by migration from traditional DW to cloud based DW further boosted/accelerated by pandemic towards the later part of last decade. This was further accelerated by lower entry barriers, almost zero data management and infinite scalability and ease of use with a few of those.
Tools from many of the cloud based EL vendors and DBT from dbtlabs further lowered the entry by installing rails to bring data into these warehouses from various SAAS applications and further lowered the entry barrier for SQL based data modelers(popularly now refered as analytics engineers) to transform data right within the data warehouse.
Subsequently, various data applications such as Data Catalog, Data Observability etc started to emerge on top of the cloud data warehouses addressing the gap with the base stack. This is now popularly clubbed/termed under the umbrella - Modern Data Stack (MDS).
While in the parallel world of data lake, Hadoop stacks started to become obsolete for the level of skill required to use them. This was then replaced by a simpler architecture where Apache Spark became largely the only performant engine for larger non-SQL (or less performant SQL) workloads, primarily for ETL vs Ad Hoc Query workload often running directly on datalakes built on top of object storage with columnar file formats. Data Lakes + Spark continued to grow in a variety of non-SQL query workload, often as a landing area for data such as event stream, large scale and differentiated transformations (typically harder to acheive with limited SQL transformations) as well as for data science/machine learning workflows.
Various applications for ML workloads evolved, however for most part the workloads existed as notebooks written in Client Libraries around Apache Spark reading/writing from/to the native data lake. Applications built on this layer were mostly productivity enablers for Data Scientists/ETL engineers.
In another parallel world of data science/ML modeling, Python with its rich set of libraries continued to be the popular programming language for smaller datasets with Pandas/Scikit-learn lowering the bar for a data scientist to discover and model. While ease of use of functional programming languages like python brought it more users, lack of distributed/scalable engines to work within the native ecosystem ensured engineers moved workload in and out of native python libraries to Apache Spark or other Engines
To address the gap, various python native distributed compute engines for Data Science and Feature Engineering workloads have started to emerge promising to scale workloads natively and yet keeping the simplicity and popularity of the pandas/scikit-learn and other DS libraries which Data Scientists/Engineers are used to. Some of the native engines which are continuing on that path include Dask, Ray and Modin.
Appreciating the popularity of the pandas ecosystem, existing cloud data lake and cloud data warehouse vendors started releasing wrapper libraries which would enable data scientists/engineers to leverage the simplicity of Pandas interface and yet transfer the workload to scalable cloud engines using these libraries.
Quick Review of Modern Data Stack (MDS)
So what is Modern Data Stack(MDS) really and why is there so much interest in it?
There are various definitions and it keeps evolving.. But like the term suggests It was put together to really be a “modern replacement” of on-prem legacy “Custom ETL Pipelines + Data Warehouse + BI stack”. It clearly extends the story while retaining a few core characteristics of the original stack.
a) It continues to be SQL based as a User Interface for all the functionalities Data Storage & Management (DDL), Data Authentication/Authorization (DCL), Data Manipulation/Transformation(DML), Data Analytics/Expressions(DEL), centered around an MPP Datawarehouse.
but adds the following in the new incarnation
a) Its cloud native
b) Separation of data storage and compute with multiple virtual warehouses instead of single warehouse enabled infinite scalability and concurrency.
c) Follows the EL + T paradigm in most part than ETL in the previous version. T continued to be in-data warehouse SQL based models with improvements on how we build it, version it and manage often large models.
Note: There is much more to write about the separate 'T' and the tradeoffs thereof but that would be another post/another topic.
d) A loose collection of applications glorified as best in class (backed by the VC marketing dollar) evolved around the SQL based MPP for data ingestion, data visualization being the top 2 as in the block diagram above.
MDS Growth
There has been an explosive growth in adoption of modern data stacks over the past 3-4 years, more so in the last 2 years of pandemic. Multiple events contributed to the success of MDS. Below are some of the key drivers as I see it.
Pandemic accelerated the move to cloud even among laggard organizations
There has also been increased digital investment coupled by the need to modernize. This has led to increased adoption of SAAS Apps.
Movement of legacy workload from On-prem Datawarehouses significantly accelerated during the pandemic. Need for modernization of these legacy on-prem data warehouses was long due.
More and more data continued to be created, making legacy warehouse softwares staring at near obsolescence, in part due to lack of dynamic scalability in such solutions to cater to data growth, user growth etc.
Organizations started rolling out to analytical/business users even the lesser skilled ones.
But clearly the original concept of the MDS was not sufficient. There were glaring gaps or evolution pending in the original vision as the users started to build a story around core MDS applications. Data Cataloging, Data Workflow Orchestration, Data Observability, Metric Store, Feature Store, Data Governance and Privacy, Reverse ETL, Real time analytics and more.
Turning the page into the year 2021, More and more apps arrived to join the MDS party, each promising to fill a gap/opportunity not addressed by the base idea.
Matt Truck published the Machine Learning, Analytics and Data Landscape, 2021, showcasing newer landscapes of what he calls a “MAD” landscape as shown below and talks about it here. Landscape of tools each solving one unit of the problem now looks bigger than the Big Data Landscape that we shunned in 2012-2014 for the complexity it brought in.
The obvious questions are:
Is MDS failing organizations in their bid to modernize and become data driven organizations?
Has it made it more complex and therefore more time consuming and requiring more effort?
How many of those applications/tools do I need to be truly a data driven organization?
How does the organization/CDO/Data teams calculate the price-performance of such a stack and explain ROI to CFOs in their conversation?
What are the newer problems emanating because of loosely coupled architecture of applications on cloud?
The answer to many of those questions, I would not have a clear “yes” or a “no” or a specific number. I would say, It depends 😏
What stage of data maturity and evolution your enterprise is today?
What are the internal skill sets available to manage such a stack? Are you hearing FinOps louder than DBA like I do?
Do you hear too often that while simplifying the access and onboarding of data and models, we have created a newer set of problems/challenges?
Emerging Issues with the MDS Architecture
Data Quality: Does MDS have a much bigger data quality problem than conceived? Have we reached a point of no return on Garbage In-Garbage-Out ? Ease of use/lower skill requirement translating to lack of better planning/requirements/design?
Data Privacy: Have enterprises lost control over Data Privacy in the MDS world? Given that the regulatory requirements for customer data handling are becoming stricter? Are these concerns becoming even more important in the data cloud?
Copies, Copies and Copies: Depending on the number of solutions you have, You end up making as many copies if not more of data/metadata to create the final value. While a few copies cannot be avoided, as the number of tools grow, data/metadata often tend to spread across multiple silos of tools, which often leads to risk of unenforceable governance policies and potential financial and regulatory risk over time.
Pricing Model Concerns: Most Tier-1 SAAS vendor in the MDS ecosystem have more or less moved to consumption based pricing model compared to subscription based in the legacy environment. While this lowers the entry barrier to adoption of such tools, as the consumption grows so does the SAAS bill. It earlier started with disk/compute on the Cloud, but now is pretty much the norm across the newer Data/SAAS applications. Given the consumption pricing model, this has also led to anxiety on uncertainty of upcoming bills as every organization becomes data driven and more data users are enabled to use such a system.
TCO (Total Cost of Ownership) Concerns: While MDS makes it real easy to run concurrent workloads by allowing for automated spawning of newer compute resources, removal of the limit has had an undesired effect of bloated bills to the tune of 2x-5x of original spend on Data Warehouse.
To be fair to the SAAS vendors, it was hard to grow in the existing accounts with the traditional subscription model and getting a price increase from enterprise on every upgrade to a newer version was not easy. So growth was capped.
However, the pendulum of pricing model has now shifted to the other extreme, where enterprises have to watch the usage of such systems, often not allow junior engineers to use such a system, often projects/experiments where ROI is not clear, are often done outside of the production stacks in a parallel cheaper alternatives?
Multiple Stacks: MDS with Data warehouse at the center obviously didn’t address the concerns of Data Science and AI, which continues to live in a parallel world of applications/lake. This is further aggravated by cost concerns, where enterprises are spinning off even more platforms for experimentations to create parallel lower cost experimentation tools/platforms.
Reduced Flexibility on Engine Choice: As with many things, the devil always lies in the details. If you have a use case for real time dashboard along with traditional BI, MDS choice of Data Storage and engine can be a bottleneck in building such a dashboard both because of frequent updates needs which is not without a cost as well the latency for queries given the I/O choice with many of them.
State of Enterprise Data Architecture Today
A quick review on where we stand on enterprise architecture today. While this may represent the best of the possible typical architecture, In reality, it is often a more complex combination of multiple legacy and newer versions of architecture/tools deployed running in parallel.
Quick note on important learnings and also emanating opportunities to improve thereoff:
Compute Engines: MPP Architecture based engines continue to be central to Cloud Datawarehouse. Simplicity of UX and infinite scalability at a reasonable price-performance seems to be the key driver for the cloud dw vendor choice. Other Distributed General purpose engines such as Apache Spark, Flink, Dask, ModIn, Ray live in the parallel world of data lake.
Separation of Data and Compute: Separation of Data Storage and SQL Engine in the case of Cloud DW or Separation of Data Storage and Compute Engine in case of Enterprise Data lake are now widely accepted design pattern for infinitely scalable design.
Data Formats: Even though data and compute are technically separable or seperate, they often continue to reside in different standalone cloud warehouses, and enterprise data lakes with no interoperable or widely accepted standard format, often leading to multiple proprietary formats in these individual tools
Multiple Repositories and Copies: This one emanates from the previous point, multiple proprietary storage formats exist despite SQL/Processing engines technically being separable from the data storage, leading to multiple copies and multiple sources of truth.
Data Modeling Tools: Data Modeling continues to live in the parallel world of warehouse and data lake, a common pattern over most of the last decade. The choice of modeling is often driven by complexity of the modeling use case, tool proficiency and/or limitation and skills available within the enterprises to build such a model. There is little cross-leverage/collaboration between different modeling efforts in most cases. Even in cases where teams do collaborate, concerns on accuracy, lack of trust often leads to silos and duplication of efforts. Little differentiation between long term models and short term ad-hoc models.
Lack of Embedded Data Discovery/Observability: Lack of default data discovery tools lead to need for more tools through which data users often discover data present in these repositories impacting seamless user experience for data developers.
Data Governance & Cataloging: Lack of unified data governance/cataloging lends itself to broken data models, dashboards and downstream applications.
Emergence of Newer Stores: Newer stores emerge as data in data warehouse breaks often and not considered reliable for upstream use cases.
Too Many Tools and Added Complexity in Enterprise Architecture
This is bigger one and close to my heart and hence the title of this article. We have moved from the monolithic solutions in the old world to best of breed solution for each of the individul point functions in the MDS world. While the argument of the proponents is this thrives innovation within the enterprises and the choices for enterprises increase and gives them the flexibility to pick what suits their requirement. It has now become a problem of plenty, since each of these monolithic best of breed solutions together don’t form the Data Operating System that the organisations truely need.
Often many of these solutions evolve independently with the find and fix approach, identifying the gap in the current MDS solutions and problems users face at the moment in time. They solve the issue at hand in many cases, but add to enterprise data architecture complexity which keeps growing. Governance and colloborations become a mess and hard to fix. User Experience/Presentation layer progress is hindered since data users need to move across these point tools. Too many tools, also leads to specialised tool based skills.
Summary
A quick recap.
Data driven enterprise continues to be a top priority and life blood for making organizational decision making and there’s no moving away from that goal.
We have come a long way on evolution on data architecture from legacy world and traversed through to Modern Data Stack of today.
Modern Data Stack while bringing best of breed point solutions for each of the possible problem space, and having removed the bottleneck by leveraging the cloud, bring in newer challenges and live the opportunity for further evolution of enterprise stack
Visible opportunities for improving enterprise data platform architecture as well as many unresolved challenges(new and old) which include the more serious ones such as Garbage In/Out remaining, too many tools problems, broken data governance and lack of seamless user experience.
Total Cost of Ownership(TCO)/Pricing Model issues have started emerging and likely grow further as the data size/users/usage grows.
A 100 billion dollar question - Data ownership/locality question - Should the enterprise architecture gravitate towards data warehouse or the engines gravitate towards enterprise data? Instead of I own all your “databases”, Can SAAS be re-imagined to help enterprises own their data architecture in their own cloud environment ?
References
Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape
Apache Spark - Unified engine for large-scale data analytics
Modin - Distributed Compute Engine for scaling your python workload
About the Author
Rajesh Parikh is the Founder and CEO of Cynepia Technologies