This post was written by Steve Miller and originally published on Wednesday, August 2, 2017
A few weeks ago, I participated in a conference call with a customer to discuss their new initiative in predictive analytics. We’ve already been helping them build a web-based analytic app for about 9 months. The partnership has been quite successful, and with analytics, once the stakeholders start to see their data and statistics, they often say “aha, this is great, now I’d like to see….” Or in this case, “now that I’ve seen some of the drilldowns, I’d like to produce go-forward predictive models…” The discussion was wide-ranging, spirited, and productive.
The analytics in the app today are primarily exploratory, with slice and dice/drilldowns from important business dimensions such as product, geography, and vendor into financial targets like billings, revenue, and profit. The aim is to identify those dimension categories that are business leaders and laggards. In analytics parlance, the drilldown dimensions are independent variables or features, while the financial targets are the outputs or dependent variables.
Going forward, the mandate is to be able to predict the performance with regression or machine learning models.
I had my statistical hat on early in the call, translating “dimensions” into categorical features, and “billings/profit” into numeric regression dependent variables. In concept, the work would be pretty straightforward, producing models that looked something like: profit = catvar1 + catvar2 + …..
After the call I flashed back to grad school, where we spent a lot of time working with precisely the same regression model challenge of continuous target variables against categorical or qualitative features. The tricky part was transforming the categorical variables so they made sense in the regression setting. What we ended up doing was coding “dummy variables” of 0’s and 1’s to represent qualitative variable categories. We then regressed the output variables of interest against those dummy variables. Back then, it was quite tedious to set up; today, thankfully, such “model matrices” are generated automatically by statistical software regression functions.
I also remembered a class assignment we were given to “prove” that the predictions derived from a regression of a numeric dependent on a qualitative feature were in fact equivalent to the means of the dependent summarized across levels of the categorical. In other words, in this simple case, the results of the predictive analytics would be identical to the exploratory.
So I just had to dig up a data set to test the conjecture. Lo and behold, I resurrected a blog from a few months back that used a data set Ames Iowa Housing Data that perfectly captured the need: a continuous dependent variable (house sale price), with scores of mostly categorical feature variables. So I loaded the data into R and ran a few models to “test” the exploratory = prediction conjecture.
What follows are the analyses. I’ll first explore the relationship between selected categorical predictors and house sale price visually. I’ll then compare house sale price means by those features with the unique predictions derived from regression models to determine if memory serves.
First, set a few options, load some packages, and read the data into an R data.table named housing. Compute several new variables, changing character vars to factors and dates as appropriate. Also, create binned variables for grlivearea (ground living area size).
Define a frequencies function and examine bathrcde, quantgrlivearea, and the 2-way combination. Note that while the distributions of bathrcde and quantgrlivearea are individually reasonable, the 2-way “crosstab” is quite sparse, since living area size and the number of bathrooms are correlated. This is a common problem with survey data, reflecting collinearity among features.
First look at the density violin plot of log10saleprice by bathrcde, ordering by median log10saleprice left to right. Not surprisingly, median sale prices increase with the number of bedrooms. From the x-axis, note that there are 955 1-bedroom homes with a median sale price of $127,500, compared to 48 homes with 4 or more bedrooms and a median sale price of $294,859. bathrcde seems to be highly related to log10saleprice.
Next look at log10saleprice by quantgrlivarea. Not surprisingly again, median sale price increases pretty dramatically with the percentile-binned living area size.
Finally, consider the two-way, faceted violin plot of log10saleprice by bathrcde and quantgrlivarea, which affirms both the positive relationships between # bathrooms, floor space and sale price, and cross-feature sparsity.
Now contrast the log10saleprice means broken down by bathrcde in the top table with the regression predictions of log10saleprice on bathrcde in table 2, cell below. With the regression table, only the distinct predictions are cited. Confirming the math, the means by group are identical to the regression predictions.
Ditto for log10saleprice by quantgrlivarea.
And finally, the exploratory means summarized by both bathrcde and quantgrlivarea are identical to the predictions derived from the two-way regression of log10saleprice on bathrcde and quantgrlivarea plus the interaction term.
The takeaways from this simple exercise? First, I haven’t as yet completely lost my memory. Also, sometimes it may be prudent not to obsess on the distinction between exploratory and predictive analytics, which in certain situations produce similar results. Rather, a focus on the design and execution of asking/answering business analytic questions with pertinent visual and predictive tools might be optimal.