DataOps and the Data Deluge From AI

By Hubert Yoshida posted 08-19-2019 16:03

I was talking to one of our data scientists, Mike Foley and I was surprised by how many files are created in the course of building an AI model. He said that in his experience, his relatively small data science teams would begin by designing a schema of approximately. 30 analytical dataviews for their data science workbench.  This would be the key features anticipated for predictive modeling. From that relatively small start, over the course of a couple of years models would be developed, trained, tested and validated to produce predictions; and the volume of data generated would approach 10,500 dataframes!  If you thought Big Data was a lot of data, think of what this means when you start to build AI models on top of Big Data!

Why do you need so many files? Mike explained some of the reasons. The first is exploratory data analysis (EDA). This is an approach to analyze data sets and “let the data be your guide” so that the model can follow the data by visually representing the data and looking at descriptive statistics to see how the data is behaving.  All of this influences model selection and design. According to Mike, this is a critical stage prior to  formal modeling or hypothesis testing. Mike referred me to a book by Bruce Ratner entitled Statistical and Machine-Learning Data Mining, but I preferred to take the easy route through Wikipedia for more information!

The Wikipedia example is an analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is:

     Tip rate = 0.18 - 0.01×party_size

which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%. 

Setting aside this linear regression model for Tip Rate for the moment; exploring the data reveals other interesting features not described by this model.

There were variations whether you looked at the tip amount in $1 or in $0.10 increments. This indicated that many customers round up their tips. Scatter plots of tips versus bill amount tended to show that more customers are very cheap than very generous. Scatterplot of tips vs. bill separated by payer gender and smoking section status showed that smoking parties have a lot more variability in the tips that they give. Males tend to pay the higher bills, and the female non-smokers tend to be very consistent tippers.

So, we may want to build a more elaborate model with predictor variables for gender, smoking status, and other demographics.  We may even want to build two different models to address very distinct differences among the dining population -- or combine several variables that are highly correlated into a single predictor.  This is the area of variable selection and feature engineering.

The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data. EDA can spawn many new data sets as we use different combinations of variables.

On top of this there is the training, testing and validation of models to see if they have learned enough to predict accurately. In AI and machine-learning, there may not be enough data for training and testing; so new data may need to be created. We also run the data through different models (like logistic regression, Support Vector Machine or Random Forest, for example). Mike uses at least three models to find the technique with the highest predictive accuracy. All of these create data files as we go through iterative training and testing cycles.

Once the feature engineering is completed and the final model developed, predictions are output – creating another group of datasets.

AI will generate a tremendous amount of data. Fortunately, this type of data doesn’t need high performance so we can use the cloud or low-cost object storage. Typically, we would use HCP Anywhere which acts like a bottomless local filer. When the local store surpasses a capacity threshold, the older data is stubbed out to HCP where it can be stored on an erasure coded storage node that uses low cost commodity disk drives or it can be stored in a public cloud using S3. HCP has governance controls which include, encryption, multitenancy, and hashing for immutability. Duplication is provided for availability and customer meta data tagging is available for additional governance and transparency. These features are important as users are beginning to question the ability to trust AI. AI can no longer be a “black box” since it has the ability to make decisions that affect our lives.

Having the governed storage capability of an object store that can scale across low cost erasure coded storage nodes and/or public cloud storage are some of the DataOps advantages that come from partnering with Hitachi Vantara when working with the data deluge of Big Data and AI.