We all know the extent of the increase in generation and consumption of data. There is also an increased need to ingest huge volumes of data from varied sources in varied formats, comply to governance norms, and perform fast and accurate analytics on data present on different platforms, leading to a deluge of big data projects in the last five years.
Most new big data projects, be it in cloud or on-prem, start with creating a team, most likely a centralized group including designers, architects, developers, testers and infrastructure and project managers for implementations such as a data lake, data warehouse or a combination of both for reporting and analytics purposes.
While this approach might accomplish some short-term goals for the enterprise, it inadvertently fails to meet the planned objectives due to reasons including:
1. A general lack of understanding of the data domain by the centralized team, as they are at best a technical team.
2. Inability to scale quickly, which becomes the single point of bottlenecks for all activities.
3. The resulting system is monolithic since it tries to address the needs of all the teams in the organization.
Mitigating the problem
There are a number of ways to prevent this from becoming a problem:
1. Move from a single centralized system to multiple decentralized subsystems organized around data domains.
2. Each data domain consists of a combination of code, workflows, repositories and processes, and teams create “data as a product” that is specific to the business domain.
3. With knowledge of the data, domain teams are better equipped to establish the right data models, governance policies, quality checks and access controls, and to expose data using well-defined APIs.
This federated approach eliminates centralized, monolithic systems and operational bottlenecks. The resulting architectural approach is called a data mesh.
Data mesh-based architectures, in many ways, bring a microservices view to the data platform. Data mesh, as coined by ThoughtWorks in 2019, and later accepted by other benchmarking organizations, outlines four basic principles illustrated in Figure-1 below.
Adding data fabric to mitigate the challenges
Data fabric is a technology solution that enables data mesh to work properly. Leading industry analysts call data fabric the “future of data management.”
Gartner defines data fabric as follows:
“A data fabric is an emerging data management design for attaining flexible, reusable and augmented data integration pipelines, services and semantics. A data fabric supports both operational and analytics use cases delivered across multiple deployment and orchestration platforms and processes. Data fabrics support a combination of different data integration styles and leverage active metadata, knowledge graphs, semantics and ML to augment data integration design and delivery.”
In their ebook, Understanding the Role of Data Fabric (Guide 4 of 5), Gartner lays out the benefits as illustrated in Figure-2 below: