My first hands on experience with web scraping was one of uncertainty and a significant amount of ‘on-the-job’ learning. Initially I was working as a tech support agent but once the operations manager caught wind of the fact that I’m a programmer, I was moved to the offline team and tasked with writing some sort of script that would scrape a relatively large amount of data from one of the company’s sites and store it in a spreadsheet for easier analysis.
I knew that this was definitely possible to do, it was just a matter of finding out how. I was given a week to determine if I was up to the task. I started where any sane person would, the Google search bar.
The first potential candidate I found was a Chrome plugin called webscraper.io. The examples they include on their site are very helpful and things were looking hopeful after a few hours of testing. The issue that led to this idea being scrapped was that it is only designed to scrape data from a website and has no functionality that would allow it to populate search fields and submit a form before commencing a scraping session. Although this tool didn’t suit my specific requirements, it could be of great help to someone looking to scrape an online store of all its products, for example.
Digging deeper and deeper through pages of Stack Overflow, I kept coming to the same conclusion; that my best bet of automatically navigating through webpages, populating search fields and trawling through pages of results, documenting the contents as I went, was to use Visual Basic for Applications (VBA).
I had no prior experience with VBA but there’s nothing particularly interesting about it or its syntax that caused me any trouble. I simply added references to ‘Microsoft Internet Controls’ and ‘Microsoft HTML Object Library’ and I had everything I needed within Excel and its IDE for VBA. The following code shows how to create an Internet Explorer object, set it to visible and navigate to a URL.
Set IE = New InternetExplorerMedium IE.Visible = True IE.navigate ""
Okay so now that we’ve covered how to get to the site primed for scraping, we need to figure out how to interact with elements within the page. What better way to showcase this than to attempt to login. The first thing to consider is how to address a particular element within a webpage. To do this we get the Document Object Model (DOM) and from there we can drill down to identify specific elements. Depending on the website in question and the prerogative of the developer(s) of said website, the complexity of this varies. The following code snippet works given the assumption that each element has been given a unique id.. sometimes this isn’t the case but I’ll talk about that in a moment. First, the easy way:
Dim doc: Set doc = IE.document Dim passwordTextField = doc.getElementById("password") passwordTextField.Value = "hunter2" Dim submitBtn: Set submitBtn = doc.getElementById("submit") submitBtn.Click
Looking at the below image, it’s clear to see how elements in a webpage are related. Each nested element is considered a child of its parent, meaning we can access any element by traversing through this tree structure.
Let’s imagine that the case is the same as above but the submit button has not been given an id. The first place you ought to look is to the parent of that element. Let’s say that element is a
with an id of “submitBtnDiv”. In this instance we can still access the submit button a number of ways but it’s a little trickier:
' We can get the first element in the set of children elements Dim submitBtn: set submitBtn = IE.document.getElementById("submitBtnDiv").Children(0)
If we can guarantee that the element will always be the nth instance of a given class, then we can do the following as well:
' If we know that the element in question is the first instance of an input tag Dim submitBtn: set submitBtn = IE.document.getElementsByTagName("input")(0)
Note: using EI9 or later, you can also use the method
Now that we can access any individual element within a webpage, writing a web-scraper just comes down to the individual site in question. For this example, let’s use IrishRail.ie. Quickly scanning through the source (using Google Chrome’s nifty inspect element feature), we can quickly determine how difficult it’s going to be to access the table of data we want.
The first thing that jumps out at me here is that there’s a div with an id of ‘content’, that’s a starting point. The next thing I see is that the table tag is its 11th child. Already, we can access the table:
Dim table ' To get the 11th child element: Set table = IE.document.getElementById("content").Children(10)
Now that we have the table, it’s just a matter of accessing each individual td and placing that value into the corresponding cell in your spreadsheet now. The following code shows a simple loop that will extract all data from the above table:
Dim trs, tds, tbody tbody = table.getElementsByTagName("tbody")(0) trs = tbody.getElementsByTagName("tr") For r = 0 to trs.Length - 1 tds = trs(r).getElementsByTagName("td") For c = - to tds.Length - 1 ThisWorkbook.Sheets(1).Cells(r, c) = tds(c).Value Loop Loop
In the above snippet, the
getElementsByTagName method is used in two different ways. Firstly, we request the very first instance of tbody within the table tag. Secondly, we request a collection of elements of type tr. This gives us an ordered list of all tr tags nested within tbody. Looping through each row and then doing the same for each td, we can quickly grab all of the table’s contents without specifically referencing each individual piece of data.
And that’s basically all there is to it. There are other aspects one must think about when writing a web scraper that has to work reliably and consistently but this post should give you a basic understanding of the fundamental concepts used in web scraping.