The M2M (Connected Machines) team is one of the core Social Innovation Business Units at HDS. The fact of the matter is that we, as a team, sometimes consider ourselves as a very well funded start-up with of course backing by multinational conglomerate, Hitachi. We are moving at the speed of light in hopes of transforming the way companies around the world analyze and interpret their machine data.
The first phase in achieving this goal is commercialization of Hitachi Live Insight for IT Operations, a cloud-based machine data analytics platform. Our team of Data Scientists becomes an important piece of the puzzle when it comes to working with actual customer data. The team builds specific dashboards that provide insight into IT infrastructure performance and availability because of their adept knowledge on HDS infrastructure. Based on customer data, our Data Science team determines and builds the KPIs that are necessary to gain real-time insights.
When I met with one of our Data Scientists, Xiaoling Huang, I asked her about the best story she has ever told with data. She addressed one of the many exciting projects our Data Science team is working on, which is building machine learning models to predict storage performance. Our Data Science team builds machine learning models offline using different algorithms like decision tree, neural network, support vector regression and random forest for VSP performance. They then publish this vital information into the Hitachi Live Insight for IT Operations platform. Why is this so valuable for customers?
According to Xiaoling Huang, “The model could predict or forecast array behavior based on current workload and the customer could use the model to see what might happen if the array is wrongly configured.”
The graph on the left shows training the model using neural back-propagation algorithms. In the forward phase, the neurons are activated in sequence from the input layer to the output layer, applying each neuron’s weights and activation function. In the backward phase, the network's output signal is compared to the true target value in the training data. The error is propagated backwards to modify the connection weights between neurons and to reduce future errors. The graph on the right shows tree-growing algorithms. Data is randomly divided into a training set and testing set. To grow the largest tree, basic tree-growing algorithms are used and cross-validation is lastly utilized to prune the tree.
In my discussion with Xiaoling, I was curious about her background specifically, and what it takes to become a Data Scientist. Here are the highlights from my interview:
Q. How did you become a data scientist?
A. Many considered me a child math prodigy. Not surprisingly, mathematics became my major in both college and graduate schools. Mathematics is beautiful, as it is said. I had a lot of fun just by working on pure mathematics even without thinking of its real-world consequences. I had a friend in the statistics department, and I went to statistics classes with her sometimes. It amazed me to see how to apply mathematics to the real-world to do statistics, modeling and machine learning and I knew that would be my career in the future.
Q. What excites you about data science?
A. To tell a story behind the data. Most of my job in data science is machine learning and predictive modeling. When working in Marketing Science, for example, I built models on propensity to buy and propensity to be a lead. It is a great feeling to gain deep insights from large amount data and to predict the future statistically, yet reliably. I also like to try different algorithms to compare the results. I really enjoy that process.
Q. What is your education background?
A. I have a PhD in Mathematics and Master in Statistics.
Q. What skills do you think are necessary to become a successful data scientist?
A. 1) Science background: Mathematics, Statistics, Computer Science, or Engineering
2) Programming skills: Pyhton, R, Java , Big Data technologies such as MapReduce, and any coding skills to manipulate both structured and unstructured data
3) Communication skills
4) Business acumen
5) Team work
6) Last not least, a passion with numbers.
Q. What is your favorite part of being a data scientist?
A. Data Science is not only my job, but also part of my life. Working daily on something that I truly love and am passionate about is enjoyable.
What do you think of Data Science now?
Have any other questions for Xiaoling Huang? If so, send them my way!