Skip navigation

This was originally published by Dave Reinke & Kevin Haas on Monday, February 22, 2016

 

Our previous blog on the Rise of Enterprise Analytics (EA) created quite a stir. Many readers had strong reactions both for and against our perspective. Several pointed comments about the continued importance of centralized, enterprise data repositories (lakes, warehouses, marts) gave us pause. To summarize: “ How dare you even consider throwing away years of best practice data design and engineering and return us to an age of inconsistent spreadmarts and siloed Access databases. You should be ashamed!”

 

The critics will be heartened to learn we’re not advocating giving up entirely on IT-managed enterprise data. On the contrary, we believe the adoption of Enterprise Analytics mandates even more attention to and extension of best practice enterprise data management and engineering.

The Power of Analytic Producers Is Reforming How IT Manages Data

The subtle differentiation we’re making is between the data itself, and the analytics on that data. EA is about shifting analytic production toward those in the organization who drive innovation from data, i.e. the Analytic Producers. Analytic Producers are typically business analysts, data scientists and others responsible for measuring and forecasting performance and identifying new, data-driven products and opportunities.

 

Most of our recent projects have revolved on the enablement of Enterprise Analytics through a modern, extensible data architecture. One that relies on a foundation of governed dimensional data warehousing and modern big data lakes, while simultaneously enabling analysts to create and blend their own datasets. As Analytics Producers find value in adjunct datasets, that data is then integrated into what we call the “enterprise data ecosystem” (EDE). In contrast to the traditional EBI ecosystem, the vitality of the EDE is driven by business priorities and analytic innovations -- not the other way around.

 

 

ea-flow_0.png

 

The picture above depicts how the old EBI and new EA worlds integrate. The blue elements should look very familiar to EBI colleagues. This is the domain of stewarded enterprise data and standardized access mechanisms for “Analytics Consumers”. Most are probably familiar with the classic reporting, OLAP and dashboards provided by legacy BI vendors.

 

New Analytics Technologies Have Also Upset Traditional Data Semantic Governance

In addition to core EBI, we’ve added boxes for the increasingly prevalent statistical tools and libraries such as R, Python and Spark used by data scientists. Further, analytical apps built with R/Shiny, Qlik, Tableau, etc. provide tailored, managed access to data via highly visual and ubiquitous web and mobile interfaces. Indeed, Inquidia’s business is now more focused on analytical app dev than it is on dashboards and reports enabled via legacy commercial EBI tools. (More on this in an upcoming blog…)

 

The orange elements of the diagram depict new architectural elements driven by Enterprise Analytics clients. Ad-hoc data discovery and the ability to experiment with new data sources drives the need. Depending on the Analytics Producer, the new data sources range from simple spreadsheets to data scraped from the web -- and curated using agile programming languages like Python, R, Alteryx and even freely-available ETL software such as Pentaho Data Integration. Additionally, for some of our clients, we help create “data sandboxes” where Analytics Producers combine extracts (we often are asked to build) of enterprise data with their new, embellishing datasets for ease of blending.

 

A Modern Approach to Collaborative Enterprise Analytics Yields Benefits for Analysts and IT

Central to EA is the ability for Analytic Producers to share discoveries and collaborate. The Shared Analytics Results Repository provides this functionality. Many of our clients enable this sharing using Tableau server, though the same results could be attained through other low cost approaches including Tableau desktop with file sharing, internal wikis, Google Drive & Docs, etc. There’s certainly no need to reinvent collaboration technology.

 

Inevitably, a new “hot” analytic will emerge from EA initiatives -- one that is in demand by traditional Analytics Consumers. This is where expert enterprise data architecture and engineering is critical -- and often where data integration expertise plays a helping role. The gray boxes depict the escalation process with outputs detailing new data integration and semantic requirements. The orange “New Sources” box represents the extensibility of the data ecosystem. Depending on the nature of the data, it may land in the classic data warehouse or become part of the big data lake (e.g. Hadoop). The orange “Integrated User Models” box shows the extension of the enterprise semantic driven by the newly integrated data. These data may manifest in cubes, ad-hoc report metadata, or new analytical app requirements.

 

We hope this deeper dive into the nature of emerging Enterprise Analytics will allay fears of our colleagues that data architecture and engineering are no longer critical. The revolutionary concept of EA is not rampant decentralization of enterprise data, but rather an acknowledgement that for many business organizations (and perhaps yours), significant analytic expertise resides outside of IT. These analytics constituents must be serviced with more flexibility and agility for an enterprise that wishes to drive innovation through analytics.

This post was originally published by Dave Reinke & Kevin Haas on Wednesday, January 27, 2016

 

Is Enterprise BI dying? That’s the question our colleagues have been debating the past few months. We’re heavily wired into the business intelligence marketplace and have seen the nature of our projects change recently. Fewer clients are asking for classic ad-hoc query, reporting and analysis provided by enterprise BI platforms such as BusinessObjects, Cognos, Microstrategy and Pentaho.

 

Rather, clients are obsessed with providing data services to a growing potpourri of visual analytic tools and custom built analytic apps. More organizations expect data-driven, tangible evidence to support their decisions. The fundamental shift is from an IT mandated common data semantic via a monolithic BI platform to an assortment of “BYO” analytics technologies that depend on the creativity and self-reliance of business analysts and data scientists to meet enterprise analytical needs. Perhaps we are seeing the rise of a new analytics philosophy. Are we witnessing the Rise of Enterprise Analytics?

 

A Brief History of Enterprise BI

Enterprise BI platforms began life in the 1990’s when upstart disrupters like BusinessObjects and Cognos promised to “democratize data access” and introduce “BI to the masses.” For the most part, they delivered on these promises.

 

The core concept was a centralized, shared data semantic, enabling users to interact with data without requiring an understanding of the underlying database structure or writing their own SQL. Yes, SQL. All data for these platforms had to be queried from relational databases, preferably dimensionalized data warehouses that were designed and populated by IT.

 

The Enterprise BI platforms provided tremendous value to organizations that were starved for consistent data access. Once the underlying data was organized and a semantic defined, users were guaranteed conformed data access via ad-hoc and canned query, reporting and analysis modules. Additionally, complex reports and dashboards could be stitched together from common components. Nirvana...unless you were paying the license fees or wanted to switch to a new platform.

 

Excessive licensing fees and lock-in began the undoing of the monolithic BI platforms as open source technologies like Pentaho and Jaspersoft aggressively commoditized. However, even the open source options were still bottlenecked by a dependence on centralized IT to organize data and define a common semantic. Time for the next disruption…

 

The Trend is not Enterprise BI’s Friend: Five Trends Sparking the Rise of Enterprise Analytics

For context, consider how radically user’s expectations of technology have changed since the Enterprise BI platforms took shape in the 1990’s. We’ve identified five “megatrends” that are particularly relevant for analytics. First, technology has become high touch and amazingly intuitive. High touch as in actually touching via tablets and phones. Apps and websites don’t come with binders or user manuals. You download and use, figuring it out along the way.

 

Second, technology is perpetually connected, enabling interaction with people and things anywhere. We expect to be able to do things at any time, any place and on any device. We change our home thermostat from across the country and speak with colleagues on the other side of the globe for little or no cost. Simply amazing if you stop to think about it.

 

Third, technology answers questions now. We’ve become impatient, no longer willing to wait even for the simple latency of an email exchange. Ubiquitous connectivity and Google are now taken for granted by a new generation of perpetually informed consumers.

 

Fourth, the increasing compute power in the hands of every business analyst is changing their problem solving approach. Data scientists can solve business problems by processing even more data with vastly more sophisticated algorithms than ever before. This has yielded technologies that visually depict these advanced analytics, resulting in greater experimentation, and an embrace of the scientific method.

 

Finally, technological sharing and collaboration is the new norm. Social networks have taught us that if we participate, then we will get more than we give. The open source software development model has spilled into just about every domain, harvesting the benefits of collaboration and improvement via derivative works. The trends empower and embolden the individual and stand in stark contrast to the command and control deployment inherent in classic Enterprise BI platforms.

 

Enter Enterprise Analytics

The legacy, centralized approach of Enterprise BI simply hasn’t recognized and responded to these trends.

 

Imagine an enterprise that leverages IT and engineering resources to provide a shared, secure data asset, but also fosters an ecosystem where analytics evolve through creativity, exploration, collaboration and sharing. An ecosystem where analytics take on a life of their own; where the “best-fit” analytics thrive and pass their “DNA” on to new generations built from an expanding data asset. Markets are won by data-driven organizations that learn faster and execute better than their competitors.

 

This is the vision for Enterprise Analytics.

 

As long time BI veterans, we were taught to root out siloed analytics, spreadmarts and the like. One of the commonly argued benefits for Enterprise BI platforms is that reported metrics are guaranteed to be consistent no matter how many users ask for them. There would be no more arguing over whose numbers are right. Rather, energy was to be spent interpreting and acting. We found this to be true with users rapidly absorbing the IT-managed metrics. However, just as quickly as we delivered the standardized, IT-governed metrics, users demanded new metrics requiring the rapid integration of increasingly varied and voluminous data. Few IT organizations could respond with the necessary agility, and innovation was stifled.

 

Adopting Enterprise Analytics changes the dynamic between user and IT. IT becomes an enabler, providing a shared and secure data infrastructure while users are empowered to create, share and improve analytics. For sure, the path is not straight. There are bumps along the way with arguments over whose metrics are more apt, etc. But the benefits of rapid innovation overpower the stagnation that comes from lack of analytical agility.

 

With a platform that enables collaboration, users are more apt to reuse and then extend metrics as they produce new analytics -- experimenting to improve an organization’s understanding. The costs of a little less governance are far outweighed by the benefits of rapidly improving actionable insight.

 

What's Next in Enterprise Analytics

Although the opportunity of Enterprise Analytics is staggering, Enterprise BI is not going to disappear overnight. We’ll still need pixel perfect and banded reports to satisfy regulations, official documents, operational norms and tailored communication. Privacy requirements still demand secured and managed access for wide swathes of enterprise data -- access that likely requires a stable, common semantic for which Enterprise BI platforms excel. And, we’ll increasingly see analytics delivered via “apps” with tightly scoped, but well-directed functionality to address a specific business process and targeted audience. Not everything will be 100% ad-hoc.

 

But, in our view, the reality of how business analysts and data scientists work, the tools they use, and the information they have access to is inciting a real change in the way that individuals are using information. Enterprise Analytics is at hand, and organizations that do not respond to this reality will find themselves increasingly irrelevant.

 

In future blogs, we’ll expand on the concepts introduced here, elaborating on the benefits of maintaining an Enterprise Analytics portfolio that consists of business-led Data Exploration, Data Science, Analytical Apps and governed data access and reporting. We’ll also discuss how to start and grow your own Enterprise Analytics ecosystem discussing technologies and techniques that work and the changed but still critical role of central IT. Along the way we’ll share insights and experiences as we enter this unquestionably exciting new age.

One of the most heavily discussed topics in machine learning and data mining today is sentiment analysis. For the uninitiated, sentiment analysis is a goal to classify text as positive or negative based only on previously classified text. In this article, I will attempt to classify the sentiment of Twitter comments about a certain movie, based only on a dataset of 10,662 movie reviews, released in 2005. This solution will be demonstrated using 2 methods—once using only Pentaho Data Integration (with some R), and a more sophisticated solution will be built using Weka.

Understanding the Naïve Bayes Classifier

Although many machine learning algorithms become very complex and difficult to understand very quickly, the Naïve Bayes classifier relies on one of the most fundamental rules in statistics, allowing its results to be highly interpretable, while also maintaining a high degree of predictive power. It is based upon Bayes’ Rule, which can be used to predict conditional probability. The equation reads:

Applying Bayes’ Rule to sentiment analysis to classify a movie as bad given a specific review of “I hated it” would be:

The classifier is called “naïve” because we will assume that each word in the review is independent. This is probably an incorrect assumption, but it allows the equation to be simplified and solvable, while the results tend to hold their predictive power.
Applying Bayes’ Rule has allowed us to dramatically simplify our solution. To solve the above equation, the probability of each event will be calculated.
  • P("I"|negative) can be described as the total number of times “I” appears in negative reviews, divided by the total number of words in negative reviews
  • P(negative) is the total number of words that are in negative reviews divided by the total number of words in the training data
  • P("I") is the total number of times “I” occurs in all reviews divided by the total number of words in the training data

 

We can then do the same above equation and replace the occurrences of negative with positive.  Whichever probability is higher allows us to predict a movie review’s sentiment as negative or positive. The expectation would be that hated occurs significantly more often in the negative reviews, with the other terms being similar in both classes, thus allowing us to correctly classify this review as negative.

 

Build a Naïve Bayes Model using Pentaho Data Integration

To build the model in Pentaho, there are a few steps involved.  First, we must prepare the data by cleaning the data. Once this is done, we then build the terms for each word in the classifier. Lastly, we test the performance of the model using cross-validation.

Step 1: Cleaning and Exploring the Source Data

To perform the sentiment analysis, we’ll begin the process with 2 input files—1 for negative reviews and 1 for positive reviews. Here is a sample of the negative reviews:

 

To clean the data for aggregation, punctuation is removed and words are made lowercase to allow for a table aggregated by class and word. Using the data explorer, we can start to see the word count differences for some descriptive words. These numbers intuitively make sense and help to build a strong classifier.

 

 

Step 2: Building Terms for the Classifier

Next, we build the various terms for the classifier. Using the calculator steps, we need the probabilities and conditional probabilities for each word that occurs either in a negative review or positive review (or both) in the training data. The output from these steps then creates the parameters for the model. These need to be saved, so eventually they can be used against testing data. Here is a sample:

 

It can be noted that some of these word counts are null (zero). In the training data, this only occurs if a word count is zero for one of the two classes. But in the test data, this can occur for both classes of a give word. You will notice that the conditional probabilities for these null words are nonzero. This is because Add-1 Smoothing is implemented. We “pretend” that this count is 1 when we calculate the classifier, preventing a zero-out of the calculation.

 

To calculate the classifier for a given instance of a review, like the formula previously explained, we must apply the training parameters to the review—that is match each word in the review being classified with its probability parameters and apply the formula. It can be noted that when we solve the equation, we take the log of both sides because the terms being multiplied are very small.

 

Step 3: Model Accuracy using Cross-Validation

You will notice there is a Section #3 on the transformation to see how well our classifier did. It turns out, that this is not the best way to check the accuracy. Instead, we will check the results using cross-validation. When building a model, it important not to test a model against the training data alone. This will cause overfitting, as the model is biased towards the instances it was built upon. Instead, using cross-validation we can re-build the model exactly as before, except only with a randomly sampled subset of the data (say, 75%). We then test the model against the remaining instances to see how well the model did. A subset of the predictions, with 4 correct predictions and 1 incorrect prediction, from cross-validation can be seen here:

 

 

Ultimately, using cross-validation, the model made the correct prediction 88% of the time.

 

Test the Naïve Bayes Model on Tweets using Pentaho Data Integration and R

To read Tweets using R, we make use of two R libraries, twitteR and ROAuth. A more detailed explanation to create a Twitter application can be found here.

 

This allows for stream of tweets using the R Script Executor in PDI. We will test the model using Jumanji: Welcome to the Jungle, the movie leading the box office on MLK Jr. Day Weekend. Using the following code, we can search for recent tweets on a given subject. The twitteR package allows us to specify features, like ignoring retweets and using only tweets in English.

 

tweetStream = searchTwitter(‘Jumanji’ ,lang='en' ,n=100) 
dat = do.call("rbind", lapply(tweetStream, as.data.frame)) 
dat = dat[dat$isRetweet==FALSE,] 
review = dat$text 
Encoding(review) = "UTF-8" 
review = iconv(review, "UTF-8", "UTF-8",sub='') ## remove any non UTF char 
review = gsub("[\r\n;]", "", review) 

 

Here is sample of the incoming tweets:

 

 

Clearly, these tweets are not of the same format as the training data of old movie reviews. To overcome this, we can remove all @ mentions. Most of these are unlikely to affect sentiment and are not present in the training data. We can also remove all special characters—this will treat hashtags as regular words. Additionally, we remove all http links within a tweet. To keep only tweets that are likely to reveal sentiment, we will only test tweets with 5+ words.

 

To get predictions, we now follow the same process as before, joining the individual words of a tweet to the training parameters and solve the classifier. Here is a sample of the results, along with my own subjective classifier:

 

Predicted Class

Subjective Class

Tweet

negative

positive

I have to admit this was a fun movie to watch jumanji jumanjiwelcometothejungle action httpstcopXCGbOgNGf

negative

negative

Jumanji 2 was trash Im warning you before you spend your money to go see it If you remember the first you wont httpstcoV4TfNPHGpC

negative

positive

@TheRock @ColinHanks Well the people who have not seen JUMANJI are just wrong so

positive

positive

Finally managed to watch Jumanji today Melampau tak kalau aku cakap it was the best movie I have ever watched in my life

negative

negative

Is Jumanji Welcome to the Jungle just another nostalgia ploy for money Probably httpstcoDrfOEyeEW2 httpstcoRsfv7Q5mnH

positive

positive

Saw Jumanji today with my bro such an amazing movie I really loved it cant wait to see more of your work @TheRock

positive

positive

Jumanji Welcome to the Jungle reigns over MLK weekend httpstcoOL3l6YyMmt httpstcoLjOzIa4rhD

 

One of the major issues with grabbing tweets based on a simple keyword is that many tweets do not reveal sentiment. Of the 51 tweets that were tested (the other 49 were either retweets or did not contain 5 words), I subjectively determined only 22 of them contained sentiment. The successful classification rate of these tweets is 68%. This is significantly less than the success rate in the cross-validation, but can be explained by the different use of language between the training set and the tweets. The slang, acronyms and pop culture phrasing used on Twitter is not prevalent in the movie review training data from 2005.

 

Enhancing the Model with Weka:

The Naïve Bayes model can be greatly enhanced using Weka. Weka provides powerful features that can be applied within a simple interface and fewer steps. Using their pre-built classifiers, the parameters can be easily tuned. Here, Multinomial Naïve Bayes is used. First, the reviews are split by word, as required by Naïve Bayes, by using the StringToWordVector filter. Additionally, 10-fold cross validation is used. Instead of building the model once as we did before the data is randomly partitioned into 10 sets. The model is run 10 times, leaving 1 set out each time and then the ten models are averaged out to build the classifier. This model will reduce overfitting, making it more robust to the tweets.

 

Here is the output from the model:

 

 

When the tweets are scored using the PDI Weka scoring step, the subjective successful prediction rate increased slightly to 73%.

Pentaho Data Integration comprises a powerful high throughput framework for moving data through a network of transformation steps, turning data from one form into another.  PDI is an excellent choice for data engineers because it provides a intuitive visual UI so that the engineer gets to focus on the interesting business details.  PDI is also easily extendable.

 

PDI traditionally focuses on large scale re-aggregation of data for data warehousing, big data, and business intelligence, but there are lots of other interesting things you can contemplate doing with embedded PDI.  Coming from a high frequency trading background, when I looked at the default list of transformation steps I was surprised to not see a generalized UDP packet processor.  Data transmitted over UDP is everywhere in the trading world - market data often comes in UDP and so do custom prompt trading signals.   So I ended up creating a toy implementation of transformation steps UDPPacketSender and a UDPPacketReceiver to send/receive data over UDP.

 

Part I of this blog post will introduce some concepts describing what UDP is and is not, an introduction to packet handling in Java, and how to write transformation steps for PDI.  It is intended to be an introduction, and is not exhaustive.  Where applicable, links are provided to further reading.  Code is provided separately in a GitHub repository for educational purposes only, no warranty is expressed or implied.   Part II of this blog post will include some kettle transformations and demonstrations of the UDPSender and UDPReceiver in action.

 

What is UDP?

 

So what is UDP anyway and why should we care about it?  Information is generally sent over the internet in one of two protocols: UDP and TCP.  UDP stands for User Datagram Protocol, and it is a lightweight, connectionless protocol that was introduced in 1980 by David Reed (RFC 768 - User Datagram Protocol  ) as an alternative to the comparatively high overhead, connection-oriented network Transmission Control Protocol, or TCP.  Among the properties of UDP:

  • Connectionless
  • Does not guarantee reliable transmission of packets
  • Does not guarantee packet arrival in order
  • Multicast/Broadcast

By contrast, TCP is connection oriented and has protocol mechanisms to guarantee packet transmission and ordered arrival of packets.  But it cannot do Multicast/Broadcast, and TCP has higher overhead associated with flow control and with maintaining connection state.  So UDP is a good candidate when latency is more important than TCP protocol guarantees, or when packet delivery guarantees can be ignored or engineered into an enterprise network or into the application level.

Much of the time, UDP packet loss occurs simply because intervening network devices are free to drop UDP packets in periods of high activity when switch and router buffers fill up.  (It's rarely due to an errant backhoe, although that does happen, and when it does it's a problem for TCP as well :-)

 

For many traditional UDP uses like VoiP, packet loss manifests as hiss and pop and is mitigated by the filtering and natural language processing capabilities of the human brain.  For many other uses, like streaming measurement data, missing measurements can be similarly imputed. But, it is important to note, UDP packet loss may not be a problem at all on enterprise networks provided that the hardware is deliberately selected and controlled to always have enough buffer.  Also it should be noted that socket communication is an excellent mode of inter-process communication (IPC) for endpoints in different platforms on the same machine.

 

Finally, if packet loss is a problem, a lightweight mitigation protocol may be implemented in the application on top of UDP.  More information on the differences between UDP and TCP can be found here: User Datagram Protocol or Transmission Control Protocol.

 

Creating PDI Transformation Steps

 

There is excellent documentation on how to extend PDI by creating your own transformation steps https://help.pentaho.com/Documentation/8.0/Developer_Center/PDI.  The most relevant information for creating additional PDI plugins is located at the bottom of the page under the heading "Extend Pentaho Data Integration".

 

In a nutshell, you'll need a working knowledge of Java and common Java tools like eclipse and Maven.  First we'll explore sending and receiving packets in Java, and move on to a discussion of threading models.  Then we'll move on to a discussion of implementation of the four required PDI interfaces.

 

Sending and Receiving UDP Packets in Java

 

Since buffer overflow is usually the main culprit behind lost UDP packets, it is important to grab packets from the network stack as quickly as possible.  If you leave them hanging around long enough, they will get bumped by some other packet.  So a common pattern for handling UDP packets is to have a dedicated thread listening for packets on a port, read the packets immediately and queue the packets up for further processing on another thread.  Depending on how fast packets are coming in and how burst-y the incoming rate is, you may also need to adjust buffer size parameters in the software or even on the network device directly. 

This implementation uses classes in the java.nio package.  These classes are suited for high performance network programming.  Two classes of interest are ByteBuffer and DatagramChannel.  You can think of ByteBuffer as a byte array on steroids:  it has get and set methods for each basic data type and an internal pointer so that data can be read and written very conveniently.  Also, ByteBuffer has the possibility to use native memory outside of the Java heap, using the allocateDirect() method.  When passed to a DatagramChannel, then received data can be written to the native memory in a ByteBuffer without the extra step of copying the data initially into the Java heap.  For low latency applications this is very desirable, but it comes at the cost of allocation time and extra memory management.  (For both of these reasons,  it's a good idea to pre-allocate all of the ByteBuffers anticipated and use an ObjectPool.)

 

Threading Model

 

A word of caution is in order here.  Multi-threaded application programming is challenging enough on its own, but when you are contemplating adding threads inside of another multi-threaded application that you didn't write, you should be extra careful!  For example, PDI transformations are usually runnable within Spoon, the UI designer for PDI transformations.  Spoon has start and stop buttons, and it would be extra bad to cause a transformation to not stop when the user hits the stop button because of a threading issue, or to leak resources on multiple starts and stops.  Above all, don't do anything to affect the user experience.

If at all possible, when designing a transformation step that involves some threading, look at a similar example or two.  When creating this implementation, I looked at the pentaho-mqtt implementation on GitHub located at pentaho-labs/pentaho-mqtt-plugin.  This was very useful because I could look at how the mqtt-plugin developers married a network implementation capable of sending/receiving packets to PDI.  The model I ultimately settled upon has the following characteristics:

  • A startable/stoppable DatagramChannel listener with a single-use, flagged thread loop, and an "emergency" hard stop in case a clean join timeout limit was exceeded.
  • Assumes that the putRow() method is thread safe.
  • Uses an object pool of pre-allocated ByteBuffers to avoid inline object construction costs.
  • Uses a Java thread pool (ExecutorService) to both process packets and to serve as a packet queue.

Looking at the pentaho-mqtt-plugin and understanding the flow of data and the initialization sequences greatly speeded up my own development efforts.

 

Implementation

 

Each PDI transformation step has to implement all of the following four interfaces.

  • StepMetaInterface: Responsible for holding step configuration information and implementing persistence to XML and PDI repository.
  • StepDataInterface: Responsible for keeping processing state information.
  • StepInterface: Implements the processing logic of the transformation.
  • StepDialogInterface: Implements the configuration UI dialog for the step.

I'm not going to go into a lot of detail on the coding here, but the source code is provided for reference at the GitHub repositories

  • ggraham-412/ggutils v1.0.0 contains utility classes that implement logging that can be injected with alternative logging implementations at runtime, an object pool implementation, and UDP packet sender/receiver objects that wrap DatagramChannel and ByteBuffer objects together with a simple packet encoder/decoder.
  • ggraham-412/kettle-udp  v1.0.0 contains the implementations of the UDPReceiver and UDPSender steps

(Again, the code is provided for educational purposes only and no support or warranty is expressed or implied.)

The basic flow for UDPSender is simple.  We're expecting data to come in from PDI into the UDPSenderStep.  So in the processRow() method (which is invoked by the framework on every row) we basically get a row of data, find the fields we're interested to send, and send them on a UDPSender from the ggutils library.  If the row we get is the first one, we open the UDPSender ahead of processing.  If the row we get is null (end of data) then we close the UDPSender and signal that the step is finished by returning false.

 

The flow in UDPReceiver is a little more complicated because it is ingesting externally generated data from the point of view of PDI.  Going from the pentaho-mqtt-plugin model, the processRow() method is used only to determine a stop condition to signal to the framework.  Since there are no rows coming in, we use the init() method to start up a UDPReceiver from ggutils listening on the configured port.  The UDPReceiver has a handler method so that whenever a packet is received, it is passed to the handler.  The handler will unpack the packet into a row and call putRow().  The packets are handled on a thread pool, so we do potentially process fast arriving packets out of order.  When the number of packets received hits a maximum, or when a time limit expires, processRow() will return false to the framework stopping the transformation.  The UDPReceiver is stopped with a boolean flag, but just in case the thread does not stop for unknown reasons, it is hard-stopped after a join timeout to avoid leaking resources.

 

The settings UI dialogs are created using the Standard Widget Toolkit (SWT), which is a desktop UI library available for Java.  Note that there are many excellent custom widgets that already exist for use with Spoon, such as the table input widget.  To maintain the look and feel of PDI, it is best to work from an existing example dialog class from GitHub using these widgets.

 

Build and Deploy

 

In order to build the PDI transformation steps, import both of the above ggutils and kettle-udp projects into eclipse.  kettle-udp uses Maven, so make sure that the m2e eclipse extension is installed.  Finally, if you're not familiar with building Pentaho code using Maven, you should visit the GitHub repository Pentaho OSS Parent Poms for important configuration help.  (Big hint: you should use the settings.xml file given at the above repo in the maven-support-files in your ~username/.m2 folder to make sure your build tools can find the appropriate Pentaho repositories.)

 

Once everything is building, export a jar file called UDPReceiver.jar.  In order to see the new steps in Spoon, copy the jar file to the folder data-integration/plugins/steps/UDPReceiver.  The folder should have the same name as the jar file it contains.  You should be able to see the UDPReceiver step in the Input category and the UDPSender step in the output category when you start up Spoon the next time.

 

i18n and l10n

 

Pentaho contains facilities for internationalization and localization of strings and date formats.  The repository above contains en_US strings, but alternates could easily be provided depending on the language/locale of your choice.  In order to provide for localized strings in the application, simply include a messages package in your project structure (it should match the name of the target package plus ".messages") containing a messages_en_US.properties file.  The file should contain properties like "UDPSenderDialog.GeneralTab.Label=General", and the dialog code retrieves the string with a call to BaseMessages.getText(class, key), where class is a Java class in the target package, and key is a string in the properties file.  So for example,  BaseMessages.getString( UDPSenderMeta.class, "UDPSenderDialog.GeneralTab.Label") will return the string "General" if the locale in Spoon is set to en_US.

Note: to troubleshoot this during development, make sure that the messages properties file is actually deployed in the jar file by inspection with an archiver like 7Zip, and that the properties files are included in the build CLASSPATH.  Also you may have to change the locale back and forth away from and back to the desired locale in Spoon to flush cached strings.

 

Conclusions

 

Data sent via the internet is generally sent using either TCP or UDP protocols.  In this blog post, we looked at the characteristics of UDP, and highlighted the basic techniques for handling UDP packets in Java.  Then we showed how those techniques could be incorporated into a PDI transformation step and provided a reference implementation (for educational purposes only.)

 

In part II of this post, we will put these ideas into action with some demonstrations using Kettle transformations.