Section D) How to scrape the internet for data
One of the most important aspects of research is the data that you have. Without data, there can't be any model. Fortunately, most data is free -- Unfortunately, most data isn't immediately in the best computer parsable formats [like .csv, or .xml]. To get the data into formats we can use we will need to "scrape" websites for it.
A couple "packages" have been created that will greatly improve our ability to scrape webpages. It can certaintly be done in python without them -- but they will make your life a whole lot easier:
Mechanize - This will allow us to open webpages easily (
http://wwwsearch.sourceforge.net/mechanize/)
Beautiful Soup - This will allow us to parse apart the webpages (
http://www.crummy.com/software/BeautifulSoup/)
Installing Beautiful Soup is pretty easy, you can just put the
http://www.crummy.com/software/Beaut...lSoup-3.0.0.py Beautiful soup python file in the same directory you are running your code from.
Installing Mechanize is a little tougher, on a *nix machine, cd to the directory of where you downloaded it and extract it (tar -xzvf [filename]). Then cd into the extracted directory and install it by typing "sudo python setup.py install" It should install, you can post here if you have any problems. As far as windows goes, you may be on your own -- I can't imagine it's very tough, and there's probably a tutorial somewhere online.
Now that the installation is out of the way, it's time to get down to business. I'll give you the basics here, and you should be able to refer to the documentation for more complicated examples. I'm going to assume you have a basic familiarity of html -- if you don't, you may want to search for a quick tutorial. Let's make our first example getting a list of today's injuries from statfox for MLB baseball:
PHP Code:
from BeautifulSoup import BeautifulSoup, SoupStrainer ## This tells python to use Beautiful Soup
from mechanize import Browser ## This tells python we want to use a browser (which is defined in mechanize)
import re ## This tells python that we will be using some regular expressions.
## .. Regular expression allow us to search for a sequence of characters
## .. within a larger string
import time
import datetime
## The first step is to create our browser..
br = Browser()
## Now let's open the injuries page on statfox. This one line will open and retreive the html.
response = br.open("http://www.sbrodds.com/StoryArchivesForm.aspx?ShortNameLeague=mlb&ArticleType=injury&l=3").read()
## Now we need to tell Beautiful Soup that we would like to search through the response.
## .. This next line will tell beautiful soup to only return links to the individual inuries.
## .. We know that all the links to the injuries have "ShortNameLeague=mlb&ArticleType=injury"
## .. in their url, so we search for these links. Each of these links has a title that describes
## .. the injury which we will use in the next line.
linksToInjuries = SoupStrainer('a', href=re.compile('ShortNameLeague=mlb&ArticleType=injury'))
## This will put the title of all links in the "linksToInjuries" into an array.
## We then call Set on our array to change the array to a "set" which by definition has no duplicates.
injuryTitles = set([injuryPage['title'] for injuryPage in BeautifulSoup(response, parseOnlyThese=linksToInjuries)])
## Finally let's print all the injuries out that are for today's date.
today = datetime.date.today()
# the function strftime() (string-format time) produces nice formatting
# All codes are detailed at http://www.python.org/doc/current/lib/module-time.html
date = today.strftime("%m/%d")
## Now let's print out the injuries that we have.
for title in injuryTitles:
## See if the date is in the title, if it is: print it.
if re.search(date, title):
print title
It might seem like a lot at first, but it's not much code. Take it slow and use google when you dont know what a function does. Googling "python [some piece of code you dont understand]" will work magic. Ask here and i can further break down any slice of code.
Sorry I haven't had much time -- If anyone can post an example of what kind of data they would like to be scraped, I will create one more example using both BeautifulSoup and Mechanize.