Basketball Reference, part of the Sports Reference LLC network, is a beautiful site. It’s uncluttered, organized, and comprehensive. There’s updated data from the NBA, along with statistics from the ABA and BAA days. There’s basic basketball statistics (points, rebounds, and assists) along with advanced statistics like true shooting percentage, usage rate, and win share. Such information can be exported as Excel or CSV files through simple pointing and clicking. However, there are currently 4,500 players in Basketball Reference’s database and therefore 4,500 pages that need to be perused in order to download a complete dataset. To perform in-depth analysis, it would be much more efficient to use Pentaho Data Integration to scrape this data.
The Process: Overview
Overall, the process to scrape data from the web involves finding the relevant URL’s, extracting the information, and then loading it into a database. Loading the database along with cleaning and analyzing the data will not be the focus of this article. It will focus on scraping the data from the web.
In this instance, two PDI transformations were used: one to retrieve the relevant URL’s and another to extract and load the data. The entire process could have been undertaken in one transformation, but two transformations were used for the sake of organization and debugging ease.
The Process: Seeding and Gathering the URL’s
The first step in use is the data grid step, where URL’s can be handily stored. Fortunately, Basketball Reference offers a handy index of players organized by last name. And each page of this index follows the same format: https://www.basketball-reference.com/players/a/, https://www.basketball-reference.com/players/b/, etc.
Therefore, the data grid step appears as below:
Each URL in this list is fed into the HTTP Client step, where, upon receipt, the HTML is returned. Once the HTML is returned, it can be parsed for relevant information.
As seen in the HTML, the relevant portions begin with ‘<a href=”/players/’ and then have a letter and then an abbreviated combination of the player’s first and last name.
Note that the HTML returned by the HTTP Client step had all the newline and return characters removed. It’s unlikely that a newline or return character would be in the middle of the portion that we want, so this step is for the pessimists.
At this point, the URL’s leading to each player’s personal statistic page has been extracted and exists in one row per letter, with each URL separated by a comma.
This transformation could conceivably end, but it would only leave more room for the next transformation to undertake. Hence, the next steps, such as reformatting the URL and removing duplicate URL’s, make the next transformation simpler to execute.
The final result is a TXT file with unique web address snippets stored on one row each.
There are separate transformations for creating this file of web snippets and loading data based on these web addresses for the sake of debugging ease. A quick glance at the output of the first transformation should can reveal anything askew.
This second transformation loads the text file and fully forms a URL based off the web address snippet. It then sends this URL to the HTTP Client step, where the HTML of the web address is retrieved. Please note that in between the creation and the sending of this URL is a Delay Row step, so that Basketball Reference doesn’t refuse the connection for excessive requests.
After the HTML is retrieved, a process similar to the first transformation’s is undertaken. Again, regular expressions are used to extract relevant portions of the retrieved HTML and then reformatted.
The Process: Harvesting Statistics
This article is concerned with extracting the per game tab of the player statistics from Basketball Reference.
Looking at the relevant HTML reveals that extracting the lines from ‘<tr id="per_game.’ to ‘</tbody>’ returns the per game statistics in question, which is what the Modified Java Script Values step does.
The extracted lines are then split into rows, so that each row constitutes the HTML for one season of data, organizing the process and aligning the data with the grain of the database.
After organizing the data into the desired grain, the transformation extracts specific statistics and stores them as new columns or fields in the same row.
Unfortunately, the Modified Java Script Values doesn’t support backreferences in regular expressions, so the first result extracts the data along with extraneous formatting.
Fortunately, the formatting can be easily cleaned up using regular expressions.
Using such regular expressions to clean the data leaves one with a neat canvas of results ready to be inserted into a database, analyzed, and then visualized. However, the techniques covered in this article can be applied to other domains so it’s not only Basketball Reference that can be scraped. Let me know if these techniques have worked for you in other areas or what other variations you have tried!