Data Speaks for Itself

An article in the November 16th issue of the Data Administration Newsletter, written by
Dr. John Talburt, has some interesting points to make about data governance, drawing inspiration from Waldo, an early robotic device created in 1945 to let a human remotely operate mechanical hands in a sealed chamber.

So, Talburt asks, what does this have to do with data governance today? “While the Waldo was not an electronic device, it did enforce a certain level of governance on the user in terms of controlling the user’s access to materials and the actions the user could perform. One of the biggest problems we have with data governance today is the lack of automation…In my opinion, the two largest gaps in trying to automate the enforcement of policies and standards are the lack of software controls to limit user actions on data, and the lack of automatic metadata generation and capture of the action that are allowed.”

The current approaches to DG automation focus primarily on user authentication and data source access, Talburt points out. “All organizations implement some type of user authentication mechanism and most have some way of limiting access to certain data sources, or even specific fields within the data… But once a user is authenticated and his or her role grants access to data, the compliance to policies regarding user actions performed on the data to policy is largely voluntary. In addition to the lack of automated controls on data actions, there is also the issue of operational context. The same user might have authorization to perform certain actions on a given dataset for one purpose, but not for another purpose. These scenarios can be difficult to manage through role assignment. Systems for automating DG need to accommodate all four of these factors including the user, the data, the actions and the context.”

So, to be clear, he continues, “I am suggesting that DG automation needs to be more like Waldo where users only interact with the data through an intervening layer. You can think about this as an extension of the growing trend to implement self-service analytics. Self-service systems allow authenticated users access to a limited number of datasets to perform a limited (predefined) set of actions in a fixed context, usually business analytics. In the same way we have seen the data warehouse migrate from the backend of operational systems to the frontend (i.e., lake houses), I predict we will see self-service analytics expand to include more datasets and more upstream data operations.”

A very useful model for controlling data tools is Ranger, Talburt says, “a software application designed to communicate with and control other tools such as Hive and Spark through a stored digital policy.” In addition to controlling the actions of a data tool, Ranger can also collect metadata describing the actions the tool performed. Some commercial applications for the big data space have followed this model to develop very robust automated DG polices that can control data, users and actions, but mostly in the context of data analytics, not as an enterprise solution.”

But just for a moment, he continues, “try to image a world in which all your organization’s data are safely enclosed in a controlled data space and users can only manipulate the data through a software control layer that encloses the space and automates compliance to digital policies. And suppose the control layer’s digital policies only allows authorized users to perform the operations allowable for their roles and given operational context. And, as long as we are imagining, suppose that every action taken inside of the controlled data space is captured as metadata and sent back to the control layer where a series of filters forward updates to the data catalog, dashboards, and other monitors in real, or near-real, time.

While developers and data scientists might shudder at the thought of having to go through an intermediate software layer that limits their data access and actions, Talburt believes that eventually this will all happen. “The increasing risks and costs of not having automated DG policy compliance is simply unacceptable.”

Read the whole article at Data Speaks for Itself: Waldo, Where are You? –