A Guide to Data Reliability Engineering

By Subramanian V posted 09-15-2023 03:54


Data Reliability Engineering

After the first wave of a deluge of big data projects a few years ago, we now have an even bigger pipeline of data modernization projects. With modernization and cloud migration projects, the analytical framework needs to run on accurate and reliable data. Data Reliability Engineering ensures that the data systems are reliable, available, and performant.

In this technical document, I discuss the concepts of Data Reliability Engineering (DRE), my thoughts on what are its architecture components, principles, and metrics, and explain Hitachi’s DRE service offering which ensures the data can be trusted and acted upon in a timely and meaningful manner.

What is Data Reliability Engineering?

I see DRE as a practice which ensures the processes that load the data, the platforms that hold and analyze the data, and the data itself are all reliable - and the three work cohesively by complementing one another”.

Contrary to the popular belief that DRE is an additional layer on top of other engineering layers, if you think of DRE’s implementation and its benefits, it in fact is a function that is a core part of other major engineering functions:

And within the realm of data, DRE can be seen as encompassing -

·        data artifacts’ creation and validation through DevOps, and

·        automation and governance of managing a data ecosystem through DataOps, while

·        implementing code that makes data reliable at each layer through DRE principles.

How Data Reliability is an oversight today?

Many enterprises starting with their data and analytics journey either from scratch or by modernizing their existing workflows jump directly to creating the storage platform followed by an analytics platform. Most projects do the following common mistakes:

·        Creating a data ingestion pipeline without a reconciliation mechanism.

·        Not having any auditing framework to give data and operational health score.

·        Doing all data ingestion error handling manually.

·        Addressing the issues only after deep penetration into the system and after users raise them.


Such an approach has both immediate and long-term repercussions. Here are some statistics:

·        Gartner research has found that organizations believe poor data quality results in $15 million per year in losses.

·        Nearly 60% of organizations don’t measure the annual financial cost of poor quality data, according to the Gartner survey

·        MIT’s article sites a survey which estimates the cost of bad data to be 15% to 25% of revenue for most companies and employees waste 50 percent of their time coping with mundane data quality tasks.

·        Experian reports that companies across the globe feel that 26% of their data is dirty which contributes to more losses.

·        This book states that 40–60% of a service organization’s expense may be consumed as a result of poor data.

To summarize, there are losses in terms of money, time, and resources by handling data quality issues reactively or by not handling them at all.

Metrics of DRE:

Now that we know what DRE is, let’s look at the parameters to evaluate Data Reliability. There are five main metrics we can look at as the ones that directly validate a system’s DRE readiness or identify the level of maturity of DRE in an application.

Achieving a high score on all the metrics is critical, as even one bad metrics can pull down the reliability factor.

Here’s a one-liner on each of the five metrics to help arrive at a scorecard of our application:

Accuracy: How well can the data be trusted as a single version of truth.

Completeness: How comprehensive the data is in terms of volume, measures and attributes required to make analytical decisions.

Freshness: How soon has the data been brought to the data ecosystem from the source OLAP/OLTP systems, and how out of date is the data compared to the most recent source inputs.

Validity: How well does the data platform conform to organizational standards such as business rules and compliance norms.

Consistency: How well can each data entity be individually trusted and if data from all entities give the same picture.

To achieve these metrics, we need to implement the following Engineering Principles, that will be part of Hitachi Vantara’s DRE Service offerings.

Key Principles of DRE (Verticals):

               There are 5 main tenets which make an application DRE-ready. A set of DRE Enablement Practices, which we will list down next, forms the canvass to enable achieving data reliability.

Data Observability

·        Quality and Freshness: Implement data reconciliation checks, automatic quality alerts, anomaly detection, ML infused self-heal mechanisms based on trends, automated data loads, KPI based audits, quality thresholds for each pipeline, and need based scaling.

·        Distribution: Implement automatic indexing and partitioning based on thresholds, and modularize the data stores for faster DevOps, streamlined access and self-serve reporting.

·        Schema: Maintain schemas, their relationships, definitions, and usage. Create a semantic layer for business reporting and KPI tracking.

·        Volume: Application must store and analyze in real-time. Optimize for scale, granularity and latency.


Data Envelope

·        Create & leverage metadata: Create data definitions, data sources and data relationships for data discoverability and data-driven decision making.

·        Data lineage: Map data sources, define data transformations, capture metadata, and maintain data lineage.

·        Data contracts and agreements: Define rules for how the data can be used, shared, and managed, including rules for data ownership, security, quality, privacy, and compliance.

·        Data segmentation:  Segment data based on line-of-business, demographics or usage patterns. This helps in faster decision making, and increased efficiency.


Data Resiliency and Performance

·        Tolerance: Develop self-heal engineering tools for Anomaly detection, log based automatic resolution, automatic data validation, and dynamic scheduling.

·        Scalability: Implement, automatic backups and offload, automatic scaling of storage and resources.

·        Availability: Implement automatic data sync between environments, checks and alerts for data loads, performance improvements.

·        FMEA: Implement comprehensive audit dashboards, dependencies in pipelines,


FinOps for Data

·        Data Redundancy: Maintain non-siloed data stores which are a single version of truth. Implement de-duplication checks at each layer.

·        Transfer cost: Reduce I/O between layers unnecessarily, implement self-serve BI layer, keep file and data transfer between applications and environments to a minimum.

·        Data Tiering: Automatically move data from expensive storage to cheaper storage, implement rules based on age, type, and source of data.

·        Data Usage:  Reduce expenses by using cloud usage analytics, cost forecasting, performance monitoring, right sizing and implementing automation for each of these.


Data Service Management

·        Incident Response: Implement pipelines’ log based automatic resolution, automatic resolution for known issues, alerts triggered to communicate to users.

·        Release and Change management: Define release schedules, scope, testing guidelines, approval process, deployment guidelines, and post-deployment validation checklist.

·        Blameless post-mortem: Implement automatic monitoring tools, user reports, pseudo-testing for bugs recreation, branching and version control for bugs deployment.

·        Production readiness: Implement standards for deploying code, databases, pipelines, networking components, compute and storage services, roles and user groups.



Enablers to achieve DRE (Horizontals):

In implementing the DRE principles, enablement practices play a very important role. We talked earlier about how DRE is a culmination of Software Engineering, Data Engineering and Platform Engineering. Let’s look at how the principles of each of those engineering verticals come together to ensure that the application is DRE-ready.

Let’s now discuss each practice in brief along with a sample list of implementation facets and their benefits.


For implementation and maintenance of DRE constructs -

-         Move from a single centralized system to multiple decentralized subsystems organized around data domains.

-         Implement “data as a product” that is business domain centric.

I have written another detailed whitepaper on the architecture, benefits and implementation of a Data Mesh based approach using a Data Fabric.


-         Data Quality and Observability measures, and faster implementation of these measures in future

-         Removal of unwanted overhead of extensive standardization.

-         Not having a monolith ensures data issues do not penetrate to unmanageable levels.

-         Bringing a microservices view to the data platforms for ability to scale faster.

Automate (and use ML):

Within the realm of DRE, automation refers to implementation of the following:

-         Self-healing pipelines for data quality and data ingestion issues

-         ML-infused dynamic DRE

-         Automated alerts for data discrepancies

-         Dashboard for data readiness and completeness of data pipelines

-         Automated audits for reconciliation, referential-integrity, trends-based and time-based checks


-         Data Quality at each data layer

-         Dynamic DRE using source’s metadata.

-         Reduction in manual efforts (and errors) arising from data discrepancies with interfaces.

-         Observability at one place to make informed decisions.

-         Faster DCUT.



Amounts of data, frequency of data and types of data are all extremely dynamic in today’s environment. The ability to scale the infrastructure to ensure all the data gets loaded on time is paramount. This is where flavors of Site Reliability Engineering (SRE) help in achieving DRE.

Hitachi’s Hitachi Application Reliability Centers provide a suite of services that helps in resolving the underlying data storage, compute and platform issues even before they have occurred, to achieve DRE.



-         Automatic resolution of Data issues arising from inability to scale

-         Reduction in cost of Ops.

-         Improved productivity and efficiency.

-         Reduced risk due to Observability.



DataOps will help in wrapping around the entire ecosystem, a set of practices to work in an agile, automated, and collaborative way. DataOps should enforce:

-         Automated data pipelines

-         Lineage, Security

-         Governance

-         Alerting

-         Access Controls

-         Cost and Performance improvements.



-         Constant data validations

-         Improved data quality.

-         Improved Data traceability.

-         Faster time to value.

-         Enhanced security


Release and Scope standards:

This is the most important governance pillar wherein we establish:

-         An architectural governance framework to define what data needs to go where.

-         Data Modeling to ensure data is loaded in the form that fits the purpose.

-         Compliance to ensure data is stored and accessed appropriately.

-         Metadata.



-         Avoid data redundancies, orphan data marts, cost overruns.

-         Better analytics, semantics, and self-serve reporting layer.

-         Compliant and Reliable data platforms.


DevOps it the glue that will bind all the architectural principles together, to ensure all the pillars move individually in an automated and efficient way without friction. It should implement:

-         Automated data pipelines and testing.

-         Automated platform provisioning.

-         Data artifacts and data copy from one environment to another for SIT and UAT.

-         Agile ceremonies and planning releases.



-         Faster time to value.

-         Parameterizable schedules for easier data incorporation.

-         Faster testing.

-         Governed project delivery.




Architecture to implement DRE:

DRE as a concept has architectural components to cater to all layers of data ecosystem: Ingestion layer or Storage Layer, Lake Layer (lor the Medallion Architecture), a dedicated Data Warehouse Layer for Summarized Analytics and a Reporting Layer for consumption and visualization.

The DRE Framework has two main engines: a Process-based engine and Rule-based dynamic engine. This architecture is built to perform, scale and be platform-agnostic.



Process-based DRE Engine:

This DRE engine works by implementing the facets from all the six principles outlined above. This makes the entire system extremely dependable.

The Process-based DRE takes inputs from all below sources:

1.      Code-level attributes such as pipeline information, code logs, and database performance log.

2.      Process-level attributes such as CMDB, Metadata, and Database Catalogs.

3.      Requirement-level attributes such as business requirements, user KPIs, and SLAs.

Architecturally, this is the core engine of the DRE which runs on an IaaS or a PaaS platform.

Rule-based dynamic DRE Engine:

This DRE engine provides features to dynamically implement functionality, by augmenting Machine Learning and Prompt Engineering. This makes the entire system extremely robust as it provides power to users and data owners to seamlessly validate at run-time and to integrate data into the current platforms in a dynamic manner.

There are two main inputs to this engine:

1.      Metadata from Source which has descriptive information and instructional information.

Based on metadata encapsulated information and commands, DRE engine will perform following actions automatically

o   Dynamic Update Strategy (Insert vs Upsert vs Replace).

o   Reconciliation at each layer.

o   On-demand scheduling.

2.      Audit and Quality Database which has issues, alerts, trends, and lineage information updated in every pipeline, along with other details such as escalation matrix and parameters for scheduling.

Based on this information, DRE engine will automatically perform the following:

o   Proactive alerting based on trends.

o   Record and remediate frequent issues.

o   Alerts based on Trends or on missing or delayed source data.

Architecturally, this engine will have a lot of serverless components working together and running dynamically based on triggers and events.



Benefits of DRE:

DRE as a service, when implemented end-to-end has innumerable benefits including addressing the immediate and long-term repercussions we discussed above with statistics.

 DRE benefits all the personas in the entire data value chain, helping organization to effectively optimize time, money and resources. 


Hitachi’s DRE Services Catalog:

Hitachi’s DRE service offerings listed below include a complete suite of services for all types of needs based on customer’s existing data platform maturity, and enterprise requirements.

Happy to help!

By now, we all know the benefits of having reliable data and how DRE can’t be considered an afterthought. Hitachi Vantara has extensive experience in implementing reliability practices for data and application for different customers. We are happy to guide customers around Hitachi’s implementation of DRE as part of HARC.



Subramanian (Subbu) Venkatesan

Senior Architect – Technology & Solutions