Making the Right Data Available to the Right Consumers at the Right Time

By Nishant Kohli posted 12-08-2019 17:31


Making the Right Data Available to the Right Consumers at the Right Time

Paul Lewis makes a great point in a recent blog post. He notes that traditional data management infrastructure just isn’t designed for self-service, let alone for the DataOps paradigm, which presupposes a continuous process of development, testing, and deployment.

For one thing, the technologies and practices don’t align with one another and can’t be reconciled with one another. Data integration in traditional data management has different priorities and a fundamentally different purpose than data engineering in the context of self-service.

DataOps, on the other hand, is essentially a successor paradigm to traditional data management.

Although the concepts and techniques it uses do have some of the same names – e.g., “data integration,” “data profiling,” “data quality,” “data lineage,” etc. – they have new or, at least, subtly different meanings; they involve different kinds of “creators;” they are consumed by different kinds of “users;” they are embedded in or associated with different processes and practices; and, on the whole, they occur at a radically different cadence. Let’s just say that traditional data management has a different set of priorities and a different purpose than either self-service or DataOps. [1]

Distribution’s the thing

Paul’s blog also gets at the most significant difference between the old and the new data management regimes: distribution. People are distributed. Data is distributed. Processing is distributed.

DataOps expects people, data, and processing to be distributed, be it in the context of self-service discovery, guided self-service analysis, or any of a dozen similar self-service use cases. DataOps views the problem of data management through this same lens. It expects data, storage, and processing to be distributed. It expects data to live in multiple places or contexts, not all of which are physically – or, for that matter, virtually – local to one another. In the DataOps regime, data movement is chaordic: e.g., data processed at the edge is vectored to one or more destinations in the cloud; data sourced from one or more cloud services is processed in a separate cloud service and vectored to disparate destinations in the on- and off-premises enterprise, the cloud, the Web – and all points in between.

DataOps is able to accommodate both repeatable, reusable data flows – the bread and butter of traditional data management – and the (usually one-off) data pipelines that are the products of data science and other advanced analytics practices. DataOps makes it possible to shift from what thought-leader Donald Farmer calls a “gatekeeper” governance model – which emphasizes control (and, especially, restriction) at the expense of access and freedom – to a so-called “shopkeeper” model, which tries to accommodate the reasonable needs of self-service users and other consumers.

On its own, DataOps is just a tool: a paradigm, a mindset, a methodology, and a set of tools and practices for managing data and for supporting data work (particularly, the self-service experience) in a new era of “distributedness.” There are different ways of “doing” DataOps. One common approach is to attempt to integrate DataOps-like technologies, concepts, and practices into an existing data management practice. The problem with this stems from the essential mismatch between the design, optimization, and goals of traditional data management and those of the new DataOps regime. Integrating DataOps into an existing practice risks canceling out the ease-of-use, ease-of-access, and ease-of-agency features that make self-service so disruptive. It likewise complicates the continuous development, testing, and deployment philosophy that is at the heart of DataOps itself.

It’s a lot like when an organization says it’s practicing agile project management but, in reality, is as anti-agile as it gets. Or consider another example. Ten or 15 years ago, many organizations used relational database systems to store non-relational data. Yes, the RDBMS could store a native JSON object, but at what cost? A database management system is designed to optimize for the storage, retrieval, and processing of data. Most RDBMSs couldn’t do this with JSON or similar non-relational data types. Instead, they stored JSON and other objects (uncompressed) in BLOB storage, where they consumed considerably more space than relational data. They cost more to retrieve and process, too.

A data infrastructure just for DataOps

But the RDBMS was just a bad fit for non-relational data. No matter how hard they tried, organizations couldn’t make it work. They needed something new – a cost-effective and scalable solution optimized for non-relational data storage. This was why most large organizations experimented with dedicated NoSQL platforms. These systems had their own problems, however. Either they were optimized for a single purpose (Hadoop, Cassandra); or they were optimized for certain types of storage (documents, files, images or video, etc.); or they were optimized for certain types of storage use cases.

Besides, NoSQL is, in a sense, passé. There’s now a better way: the distributed object store. An object store gives us the biggest advantage of NoSQL (an ability to store data of any type) without any of the drawbacks. The object store is the perfect substrate for a DataOps practice centered on a modern data lake. Enriched with complementary technologies – data virtualization, metadata cataloging, self-service data preparation, the ability to catalog and manage unstructured and semi-structured data – a data lake functions as a scalable performant, resilient, cost-effective data hub for storing, managing, and processing data of any type. Data can be vectored to it from all directions: from on-premises databases, applications, and services; from RESTful services; from stage-to-stage in a self-service data engineering or self-service analytics pipeline; and, most important, from the edge.

Paul gets at this in his blog. He makes the point that – instead of trying to retrofit an existing data management infrastructure for DataOps – it makes more sense to build a parallel environment just for DataOps. There’s virtually no risk of redundancy. In practice, the DataOps-only environment functions as a kind of “factory” in which new types of data and new kinds of analytics can be prototyped and hardened for use in production. Once they’ve demonstrated value (and dependability) in the DataOps environment, they can be slipstreamed into the traditional data management environment. In the same way, data integration workloads, operational and analytical reports, dashboards, KPIs, and other core decision-support assets can, over time, be replicated, tested, and hardened in the DataOps environment. Once they’re tested and vetted, the organization can consider the pros and cons of eliminating them from the traditional data management environment.  It may be impossible to move some core decision-support workloads; over time, however, the DataOps environment will host a larger and larger proportion of decision-support workloads: over time, the two environments will converge.

The very model of a modern (major) data lake

The modern data lake functions as a central hub for data collection, preparation, provisioning, and access. It gives organizations greater flexibility with respect to how they store data, as well as which kinds of data they store. It is an ideal source (and target) for the DataOps paradigm and supports traditional decision-support use cases, too. Above all, it makes it possible for organizations to maintain data in its original format. This is important because not everybody needs the clean, consistent data that is the byproduct of traditional data integration. Data scientists, data engineers, and business analysts need access to data in its raw form. Certain types of advanced analytics (such as data mining and machine learning) require raw data as input, too – and so do ETL processes, for that matter. Less sophisticated users require that data be prepared for them, however. So, too, do operational reporting workloads, or, for that matter, any workload associated with governed decision-support processes.

To sum up, the modern data lake, based on a distributed object store and enhanced with complementary technologies, is the ideal substrate for a DataOps practice. It is flexible enough to accommodate the DataOps, self-service, and traditional decision-support use cases. It is cost-effective and scalable and can easily span the on- and off-premises enterprise. Thanks to its cost-effectiveness, it can also complement – as a separate, parallel environment – a traditional decision-support practice.

Nishant Kohli

[1] For what it’s worth, self-service and DataOps aren’t mutually exclusive. Self-service tools are commonly used in DataOps. Self-service is a type or kind of use; DataOps is a paradigm in which different kinds of use take place.

1 comment



05-04-2022 14:34