Over at Beyond the Bets several people were interesting in learning to scrape stats off of websites.  This is called WebScraping or data mining.  This is something we’re pretty familiar with and wanted to help others out.  Our example is in PHP.  There is really no “correct” language for WebScraping.  Most languages will work, we just prefer PHP.  Python may be easier for beginners.  This first post has more to do with looking at actual code.  In future articles, we’ll discuss different approaches for running these scripts.

This particular tutorial pulls back offensive stats for all MLB teams.

Warning 

This isn’t necessairly something that is easy.  Remember, I’m a computer programmer for my 8-5 gig so it comes easy to me.  Advanced baseball sabermetrics isn’t something that comes easy to me but does to others.  So if you struggle to understand this, you’re not alone.  It takes research and a dedication to learn.  It’s also worth nothing that this isn’t a perfect solution by any means.  This is just 1 quick and dirty example in PHP.

Yet to Come

In Part 2, I plan to discuss some utilities I use to obtain stats, some tips I’ve learned, the easiest way to run this stuff, etc.

Show me the Code!

Let’s take a line by line look at what the code is doing.  I’ve split this into sections just to make the explanation a little easier.  The bold numbers correspond to each line of code.

1.  This just tells a browser or server that the following code is PHP.  There will be a similar “tag” as the very last line on the script.

2.  Statement 2 allows my program to use an html parser.  This is a third party script that makes life much easier when scraping data.  I believe Python has something similar.

3.  In PHP, the dollar sign represents a variable declaration.  I could have named this variable anything I wanted.   Think of a variable as just a temporary storage location.  I can use this temporary storage location other places in the program.  I defined the variable as an array.  An array is basically a set of rows and columns, similar to what you would see in an excel spreadsheet.  Please note that I’m not currently using $table for anything really.  I was going to use it for something else later, but ended up going a different route and not cleaning up my code.  Again, this could be deleted without any adverse affects.

4.  Again we have a variable declaration.  file_get_html is a part of simple_html_dom.php I mentioned earlier.  Basically this allows me to pass in a website address (which you see in the parenthesis) and refer to different html tags within the website.  The URL show in parenthesis  you could copy and paste into your internet browser and go to the stat page I’m scraping.  More on this later.

5.  Blank line.  Nothing is done.

6.  This opens a database connection for my script.  Before I can write to or update any of my databases or tables, I need to open a connection to let my script know which database I’m using.

7.  This is just an error catch.  This will let me know if I’ve miskeyed a password or something and can’t connect to the specified database.  Lines 9 and 11 are all error handling too, very similar to line 7.

8.  A database has multiple tables.  This line of code lets my script know which particular table I want to connect to in my script.  For example, I may have a basketball stat table and a football stat table.  For this particular script, I’d want to update my baseball table.

10.  This basically deletes my table before re-populating.  Another method to do this would be to update the contents.  For simplicity, I just wipe the table data and repopulate.

13.  This gets a little complicated but follow along.  Line 13 basically starts a loop.  This will loop through $offData until no more criteria is found.  If you recall, $offData basically imports the html code of the website I selected.  I’ve declared every place “.cell1″ is included as a $row.  So basically this line is going to loop through every instance where “.cell1″ is found in the statfox html code.  The curly bracket ‘{‘ indicates that everything within the the open and closing curling bracket is in the loop.

The HTML above is from the statfox website address I mention above.  You see the TR tag for a table row.  Each TD listing you see is basically a cell within the row.  These values match up with the output of the screen you see.  The first TD tag is Team number, the second is team name, the third TD tag is Runs, etc.

By looking at the TD tags above, you count 15.  Looking at the table screen shot from stat fox, you see 15.  (note the numbers vary between the html and the output.  That’s because I haven’t been running this script lately to obtain new stats.  Had I been running this every day, the values in the HTML and the screen output would be the same.)  Alright – back to the code.

14.  In programming terms, generally you start counting a 0 rather than 1.  In this HTML, cell 0 is just the team ranking/identifier.  I don’t need that, so I start with cell 1.  I assign the $team variable with the name of the team.  So basically, this code is telling the script to find cell 1 and copy the contents of that cell into my work variable.  Remember, if I wanted the team ranking/number, I would have started at 0.

15-27.  These are much like line 14.  I’m just taking the value of each cell and saving them to a work variable.  Each variable is named something that makes sense to me as far as what particular “stat” the field contains.

29-31.  This is basically just error checking.  When I ran this, I was picking up some garbage data.  This is basically saying “if the $team variable is blank, start my loop over because I want to omit this data”.  Continue tells the script to to back to the top of the loop.

33-35.  This is a SQL statement to insert this particular row into my database table.  As you notice, db_name and dbMLBOffStats are the connections we opened previously.  The first set of parenthesis is my field names within the table.  The parenthesis after the key word “values” tells which variables from the script I want to populate the table with.  Basically I perform a table insert for each row on the website.  Remember, my SQL statement doesn’t run unless the data is valid.

36.  Error catching in case my SQL statement is incorrect

38.  This just accumulates each variable into my array that we discussed earlier.

39.  This is the closing curly bracket that signifies the end of my loop.

41-67.  On StatFox, they use alternating table colors to make it easier to read.  If you recall from above, we looked for the .cell1 tag.  Well, the alternate rows are .cell2.  So the above image is just repeated code from above.  I could have done something different here rather than duplicating the code, but it is what it is.

69.  This just tells my script to close the database connection I had opened.

71.  End of PHP code.

 

There you have it.  71 lines (counting blanks) to scrape offensive stats for every MLB team.