Hu Yoshida

Forget the Rules, Listen to the Data

Blog Post created by Hu Yoshida Employee on May 10, 2019


Rule-based fraud detection software is being replaced or augmented by machine-learning algorithms that do a better job of recognizing fraud patterns that can be correlated across several data sources. DataOps is required to engineer and prepare the data so that the machine learning algorithms can be efficient and effective.


Fraud detection software developed in the past have traditionally been based on rules -based models. A 2016 CyberSource report claimed that over 90% of online fraud detection platforms use transaction rules to detect suspicious transactions which are then directed to a human for review. We’ve all received that phone call from our credit card company asking if we made a purchase in some foreign city.


This traditional approach of using rules or logic statement to query transactions is still used by many banks and payment gateways today and the bad guys are having a field day. In the past 10 years the incidents of fraud have escalated thanks to new technologies, like mobile, that have been adopted by banks to better serve their customers. These new technologies open up new risks such as phishing, identity theft, card skimming, viruses and Trojans, spyware and adware, social engineering, website cloning and cyber stalking and vishing (If you have a mobile phone, you’ve likely had to contend with the increasing number and sophistication of vishing scams). Criminal gangs use malware and phishing emails as a means to compromise customers’ security and personal details to commit fraud. Fraudsters can easily game a rules-based system. Rule based systems are also prone to false positives which can drive away good customers. Rules based systems become unwieldy as more exceptions and changes are added and are overwhelmed by today’s sheer volume and variety of new data sources.


For this reason, many financial institutions are converting their fraud detection systems to machine learning and advanced analytics and letting the data detect fraudulent activity.Today’s analytic tools with modern compute and storage systems can analyze huge volumes of data in real time, integrate and visualize an intricate network of unstructured data and structured data, and generate meaningful insights, and provide real-time fraud detection.


However, in the rush to do this, many of these systems have been poorly architected to address the total analytics pipeline. This is where DataOps comes into play. A Big Data Analytics pipeline– from ingestion of data to embedding analytics consists of three steps


  1. Data Engineering: The first step is flexible data on-boarding that accelerates time to value. This requires a product that can ETL (Extract Transform Load) the data from the acquisition application which may be a transactional data base or sensor data and load it using a data format that can be processed by an analytics platform. Regulated data also needs to show lineage, a history of where the data came from and what has been done with it. This will require another product for data governance.
  2. Data Preparation: Data integrationthat is intuitive and powerful. Data typically goes through transforms to put it into an appropriate format, this can be called data engineering and preparation. This is colloquially called data wrangling. The data wrangling part requires another set of products.
  3. Analytics: Integrated analytics to drive business insights. This will require analytic products that may be specific to the data scientist or analyst depending on their preference for analytic models and programming languages.


A data pipeline that is architected around so many piece parts will be costly, hard to manage and very brittle as data moves from product to product. 


Hitachi Vantara’s Pentaho Business Analytics can address DataOps for the entire Big Data Analytics pipeline with one flexible orchestration platform that can integrate different products and enable teams of data scientists, engineers, and analysts to train, tune, test and deploy predictive models.


Pentaho is open source-based and has a library of PDI (Pentaho Data Integration) connectors that can ingest structured and unstructured data including MQTT (Message Queue Telemetry Transport) data flows from sensors. A variety of data sources, processing engines, and targets like Spark, Cloudera, Hortonworks, MAPR, Cassandra, GreenPlum, Microsoft and Google Cloud are supported.  It also has a data science pack that allows you to operationalize models trained in Python, Scala, R, Spark, and Weka.  It also supports deep learning through a TensorFlow step.  And since it is open, it can interface with products like Tableau, etc. if they are preferred by the user. Pentaho provides an Intuitive drag-and-drop interface to simplify the creation of analytic data pipelines. For a complete list of the PDI connectors, data sources and targets, languages, and analytics, see the Pentaho Data Sheet.


Pentaho enables the DataOps team to streamline the data engineering, data preparation and analytics process and enable more citizen data scientists that Gartner defines in “Citizen Data Science Augments Data Discovery and Simplifies Data Science” . This is a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics. Pentaho’s approach to DataOps has made it easier for non-specialists to create robust analytics data pipelines. It enables analytic and BI tools to extend their reach to incorporate easier accessibility to both data and analytics. Citizen data scientists are “power users” who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise. They do not replace the data science experts, as they do not have the specific, advanced data science expertise to do so, but they certainly bring their individual expertise around the business problems and innovations that are relevant.


In fraud detection the data and scenarios are changing faster than a rules based system can keep track of, leading to a rise in false positive and false negative rates which is making these systems no longer useful. The increasing volume of data can mire down a rules based system, while machine learning gets smarter as it processes more data.  Machine Learning can solve this problem since it is probabilistic and uses statistical models rather than deterministic rules. The machine learning models need to be trained using historic data. The creation of rules is replaced by the engineering of features which are input variables related to trends in historic data. In a world where data sources, compute platforms, and use cases are changing rapidly, unexpected changes in data structure and semantics (known as data drift) require a DataOps platform like Pentaho Machine Learning Orchestration to ensure the efficiency and effectiveness of Machine learning.


You can visit our website for a hands on demo for building a data pipeline with Pentaho and see how easy Pentaho makes it to “listen to the Data.