Solving Complex Data Problems with Data Mesh Using Data Fabric

By Subramanian V posted 03-31-2023 19:29

Like

We all know the extent of the increase in generation and consumption of data. There is also an increased need to ingest huge volumes of data from varied sources in varied formats, comply to governance norms, and perform fast and accurate analytics on data present on different platforms, leading to a deluge of big data projects in the last five years.

Most new big data projects, be it in cloud or on-prem, start with creating a team, most likely a centralized group including designers, architects, developers, testers and infrastructure and project managers for implementations such as a data lake, data warehouse or a combination of both for reporting and analytics purposes.

While this approach might accomplish some short-term goals for the enterprise, it inadvertently fails to meet the planned objectives due to reasons including:

1. A general lack of understanding of the data domain by the centralized team, as they are at best a technical team.

2. Inability to scale quickly, which becomes the single point of bottlenecks for all activities.

3. The resulting system is monolithic since it tries to address the needs of all the teams in the organization.

Mitigating the problem

There are a number of ways to prevent this from becoming a problem:

1. Move from a single centralized system to multiple decentralized subsystems organized around data domains.

2. Each data domain consists of a combination of code, workflows, repositories and processes, and teams create “data as a product” that is specific to the business domain.

3. With knowledge of the data, domain teams are better equipped to establish the right data models, governance policies, quality checks and access controls, and to expose data using well-defined APIs.

This federated approach eliminates centralized, monolithic systems and operational bottlenecks. The resulting architectural approach is called a data mesh.

Data mesh-based architectures, in many ways, bring a microservices view to the data platform. Data mesh, as coined by ThoughtWorks in 2019, and later accepted by other benchmarking organizations, outlines four basic principles illustrated in Figure-1 below.

Data mesh offers numerous benefits and advantages

· Decentralization of data resulting in a “data as a product” vision.

· More agile data platforms resulting in the ability to scale and grow.

· Decentralized teams resulting in zero bottlenecks that are otherwise created by centralized teams.

· Realizing the vision of overall lower technical debt.

· Data democratization.

· Cost efficiencies.

· Increased interoperability.

Yet, all of these advantages do not necessarily mean that data mesh offers a silver bullet to solve all the problems of creating an end-to-end data ecosystem. The truth is, data mesh alone can’t solve everything.

Challenges of having a data mesh-only approach

· Siloed data domains make it necessary to integrate and cooperate to get meaningful insights on the overall data.

· Too much authority for individual teams results in governance issues.

· Cross-domain reporting can become technically challenging.

· Creation of an enterprise catalog or data dictionary across the domains becomes a challenge.

· Proliferation of dark data.

Adding data fabric to mitigate the challenges

Data fabric is a technology solution that enables data mesh to work properly. Leading industry analysts call data fabric the “future of data management.”

Gartner defines data fabric as follows:

“A data fabric is an emerging data management design for attaining flexible, reusable and augmented data integration pipelines, services and semantics. A data fabric supports both operational and analytics use cases delivered across multiple deployment and orchestration platforms and processes. Data fabrics support a combination of different data integration styles and leverage active metadata, knowledge graphs, semantics and ML to augment data integration design and delivery.”

In their ebook, Understanding the Role of Data Fabric (Guide 4 of 5), Gartner lays out the benefits as illustrated in Figure-2 below:

According to Gartner, data fabric is an emerging data management design that delivers numerous benefits

At Hitachi Vantara, we see data fabric as the technology/architectural construct that helps realize a cohesive view of data from disparate data mesh domains. Data fabric itself is realized as integration, orchestration, data virtualization and federation layers built on top of multiple, disjointed data repositories, data lakes or data marts to provide a unified view of all enterprise data. It is independent of current physical implementation and agnostic to existing data environments and processes.

A data fabric supports both operational and analytic use cases delivered across multiple deployment and orchestration platforms. It also helps in data integration styles by using active metadata, knowledge graphs, semantics and machine learning.

The key foundation for the data fabric is a DataOps platform that brings DevOps practices to the data pipeline using a consolidated set of processes and tools. To a large extent, DataOps automates the end-to-end process of discovery, integration, storage, governance and self-service consumption of data, thereby maximizing the value derived in a meaningful and cost-effective way.

In all, data fabric is the engine that powers the data mesh. Data fabric makes the data mesh better by automating key mesh concepts to create data products faster, in a globally governed way, while providing a seamless link between all data components and users.

Figure-3 below illustrates some of the salient features of the data fabric:

Advantages of data fabric

Implementing a data fabric offers numerous advantages, including:

· Intelligent integration using data catalogs

· Automation of ingestion, lifecycle, quality and access

· Removal of silos by bringing a level of standardization and governance

· Ensuring data quality by having policies for ingestion, storage and extraction

· Cost efficiencies in storage and compute

· Leaner model by using flexible and reusable data integration pipelines

· Data protection by having security measures

· Reliable pipelines with monitorability

· Faster and secure SDLC using DevSecOps

· Data lifecycle for leaner data storage and purge

· Cloud and AI-ready architecture

Summarizing data fabric and data mesh concepts

A data fabric is designed to provide an integrated view of all the data components and to make it easy to ingest, store, transform, access, manage and analyze that data, while a data mesh is designed to provide a more decentralized way to ingest, store and manage data, and make it easier for users and applications to access.

In a data mesh, each service or team has its own data store and data model, and data is shared through APIs. In contrast, in a data fabric, data is managed centrally, and access is granted through a centralized layer. Data fabric focuses on a unified view of data and ease of access, while data mesh focuses on flexibility, scalability and ownership.

Figure-4 provides a visual representation of how data mesh and data fabric can work together to create a better data ecosystem:

Combination of architectural principles of data mesh and data fabric

Implementing data mesh using data fabric on Azure

Figure-5 provides an architectural approach to implementing a decentralized data lake platform based on data mesh, combining data fabric’s capabilities to ingest, integrate, store, manage, analyze, secure and govern the entire data platform using Azure’s service components.

Technical architecture of a data mesh implementation using data fabric on Azure

This architecture on Azure provides an idea of how it can be implemented on other clouds, and also on a platform containing both on-prem and cloud components.

Reasons to consider this data fabric plus data mesh architecture

While the need to have a data mesh or data fabric has been strengthened by the factors below, each of these current problems is solved in a novel way using the collaborative approach:

· Increasing data volume

Traditional approach would mean you perform ingestion into different layers, performing ETL and then making changes to all components to start deriving value.

Data mesh plus data fabric architecture means automated data ingestion policies and virtualization for users querying the data, thereby reducing redundancy of ETL and storage, and a self-serve BI layer to ensure faster readiness of data for reporting and analytics.

· Variety of data types and formats

Traditional approach would necessitate extensive standardization and manual processes to integrate with one another for analytics increasing development and testing costs.

Data mesh plus data fabric architecture means modularizing the entire data lifecycle and treating each set of data as an entity owned by the team that generates it. This makes sure teams have the data in a way the analytics layer can consume as is, while only the mandatory summarizations are done for combined analytics.

· Expanding IoT data and edge, core and third-party data streams

Traditional approach would mean more layers of data storage, which would increase cost without knowing if there is a value to be derived from that.

Data mesh plus data fabric architectures have data governance policies, and different storage types and retention periods for different types of data, for cost savings and automated compliance.

· Diversified analytics

Traditional approach would mean a fixed set of KPIs, forcing all the data on different platforms to comply to standards to satisfy those KPIs.

Data mesh plus data fabric architectures provide the advantage of accessing and processing the data as a single version of truth, close to the source, and provides different platforms to cater to different levels of KPIs feeding different categories of users.

· Multiple data locations including on-prem and cloud

Traditional approach forces organizations to spend millions of dollars just to integrate the data to even understand if it can provide any value.

Data mesh plus data fabric architectures solve integration issues by implementing a framework to decide which data should reside where, and how they need to be merged. As well as how virtualization can be used to first understand the combined value before making investments.

Conclusion

Data fabric was listed among Gartner’s Top Strategic Technology Trends for 2022. Together with data mesh’s key principles, it can serve as the next beacon for organizations planning to embark on their journey towards data modernization.

Data mesh and data fabric are two related concepts that aim to address the complexity and scalability challenges that arise when building and maintaining large, distributed systems. Together, they offer a powerful approach to building data-driven systems that are more scalable, adaptable and resilient. By breaking down data silos and promoting autonomy, data mesh allows teams to move quickly and independently. And by providing a common foundation for data management, data fabric makes it easier for teams to share data and collaborate effectively.

However, despite the advantages, implementing data mesh and data fabric can be challenging. It requires a significant shift in organizational culture and practices, as well as adapting to newer technologies and infrastructure services. It also requires a deep understanding of the data domains and business requirements.

On that topic, Hitachi Vantara is here to help. We have extensive expertise handling complex data projects and designing and implementing a data fabric and data mesh-based approach. We are happy to help and guide customers for all data needs.

Subramanian (Subbu) Venkatesan

Senior Architect – Technology & Solutions

email : Subbu.Venkatesan@hitachivantara.com

#ApplicationModernization

0 comments

67 views

Blogs