Self-Service, Data Architecture and the Modern Data Lake

By Nishant Kohli posted 09-30-2019 15:28

Like

Self-Service: The Unfinished Revolution

Few would dispute that self-service has been an incredibly positive disruption. Self-service tools and features make it easier for users to access, explore, and analyze data. Self-service technologies simplify many aspects of data preparation and data movement; for certain use cases, self-service technology also automates the experience of analysis itself. But self-service has a few drawbacks, too.

First, in making things possible for users, self-service technologies create considerably more work for them, as well. A data scientist might spend the bulk of her time discovering, moving, and preparing data for analysis. An analyst might discover a valuable source of data only to be stopped short by regulatory red tape: to make use of the data, she first has to redact, transform, hash, or mask it.

Second, self-service requires a fundamentally different approach to data governance. Traditionally, data governance gave priority to control—with an emphasis on restriction—at the expense of access. Analytics thought leader Donald Farmer calls this “gatekeeper” governance. Its aim is to put up roadblocks to data access. Farmer contrasts this with “shopkeeper” governance, which—rather than putting up roadblocks to access—tries to accommodate the reasonable needs of self-service users.

Legacy data management technologies are not equipped to deal with disruption of this magnitude. A new data architecture, grounded in a new approach to managing and governing data, is required.

This is the impetus for the modern data lake: it’s the central hub in a distributed data architecture.

Towards a self-serviceable data architecture

From the perspectives of self-service users, developers, and even rank-and-file business people, the modern data lake is a lot like Grand Central Station: it’s where you go to get data—or, for that matter, to shop for data—because it’s the central site into which data is vectored from all directions.

This contrasts with a data management status quo in which self-service users spend more time searching for and preparing data than actually doing analysis. This dysfunction brings to mind the old proverb about fishing for a person versus teaching her how to fish for herself. Today’s self-service tools don’t just teach the user how to fish, but require that she assemble and string her own rod and reel, procure her own fishing line, and fabricate her own lures, tackle, etc. too. What the self-service user needs is a data architecture that complements what she’s doing (or wants to do) with her self-service front-end tools. A data architecture that provides the equivalent of pre-fab rods, reels, line, lures, and tackle, simplifying access to data and automating routine, tedious, or complex tasks.

Until recently, this was a pipe dream. The original data lake, which was supposed to serve as a central site for data ingest, provisioning, and access, did little to help with this. The first-generation data lakes failed because they were conceived with the legacy gatekeeper data management paradigm in mind.

The modern data lake aligns with the premise of shopkeeper governance. It is a radically new—a self-serviceable—take on the data lake. The modern data lake is a data hub: a central destination not only for on-premises OLTP systems and cloud services, but for data that originates at the enterprise edge, too. It’s a source for “consumers” of any and every kind, whether they’re self-service users; rank-and-file business people; downstream repositories—such as data warehouses and data marts or document- and content-management databases—for which it functions as a site for landing and staging data; and RESTful programs or services, which expect to use APIs to exchange data. The modern data lake incorporates ease-of-use and user-assist features to simplify (and in some cases automate) data access, preparation, movement, and analysis for a wide variety of users and use cases.

The modern data lake is undergirded by a distributed object store, a scalable storage technology that is more performant and cost-effective than legacy technologies. The object store is “smarter” than these legacy data stores, too, automating data access and preparation for many common use cases.

Not all object stores are alike, however. Some run only in the cloud. Some perform best in the on-premises enterprise. Some run in both contexts, and, moreover, support popular cloud APIs, such as Amazon’s S3 API. This helps facilitate interoperability between on-premises and cloud deployments. Some are faster than others, boasting superior raw throughput and I/O bandwidth. Some are “smarter” than others. They use intelligence to automate certain kinds of data transformations (converting raw document or audio files into common formats such as PDF or MP3) or to simplify certain tasks (importing data into a self-service discovery tool). And some are more extensible than others.

Extensibility matters. The distributed object store is not the only component of the modern data lake. In practice, it is complemented by a raft of other technologies—such as data virtualization; metadata cataloging; content management; relational and natural-language search; self-service data preparation, and AI-driven automation—that make it possible to knit together distributed data sources into a virtual data fabric. Users and consumers see unified views of (or enjoy unified access to) data, irrespective of its geophysical location. The integration and interoperability of these technologies is the modern data lake. And the distributed object store is its foundation: the rock on which an organization will build the (user self-serviceable) modern data lake; the central hub of a distributed data architecture.

We’ll discuss the requirements of the users (and services) that will depend on the modern data lake, as well as the use cases it will enable, in an ongoing series of blogs.

To whet your appetite for this, let’s enumerate the core criteria of the modern data lake itself. It:

Deploys in the on-premises enterprise and in the cloud;
Supports multi-cloud deployments;
Functions as a scalable, performant, fault-tolerant landing zone for ingesting data of any type;
Ingests data from any/all possible vectors: e.g., core systems, cloud, and the enterprise edge;
Provides a central, scalable context for managing and governing data of any/all kinds;
Provides a means of automating or triggering repeatable data flows, such as ETL flows from core OLTP systems and edge systems, data exchange via RESTful services, etc.;
Automates certain types of ad hoc data flows, such as (e.g.) CSV extracts that are automatically generated from source Word, Excel, and PDF files, JSON objects, etc.
Exposes a useful set of self-service data discovery and preparation tools/features.

This last is especially important. Again, the modern data lake must help, not hinder, the self-service use case. This is its raison d’etre. It can do so in several ways: for example, by automatically generating an appropriately typed table schema whenever a user imports CSV, JSON, XML, etc data. Or by annotating a schema based on the semantics of known/profiled data. (This is a company name, that is a product name, etc.) Or by flagging—and, optionally, automatically removing—duplicate data. Or by generating a basic statistical profile (min/max, histograms, outliers, duplicates, nulls) of data during import. In the conventional self-service model, too much of this work gets offloaded to the user.

The modern data lake is a step in the direction of finishing what the self-service revolution started.

Nishant Kohli

#IoTSoftware
#ThoughtLeadership
#Blog

3 comments

6 views

Permalink

Comments

Tanmoy Panja

05-05-2022 07:46

Interesting !!!

Dipta Kundu

05-04-2022 14:37

Thanks for the writeup

Mark Taylor

05-28-2020 02:43

Wow, so nicely written.. Thank you so much bp credit card

Blogs