Create a webpage monitor with Google Sheets using ImportXML
RSS feeds have completely changed the way we consume information on the web. You no longer have to visit the New York Times or CNN every hour of the day to check the news headlines, because the feed reader does it for you behind the scenes.
The only problem is that not all web content is available through feeds. For example, Amazon, eBay, and Google Product Search (Froggle) are good places to find discounts on books and electronic gadgets, but unfortunately none of those shopping sites have feeds.
Monitor HTML Web Pages with Google Docs
Problem: Say you are looking for a discount on the iPod Nano. One option here is to open the Google Shopping page and search for iPod. If you can’t find the right price, repeat the same cycle the next day. It might sound easy, but imagine doing it for ten other products on five different shopping sites. It’s tedious, isn’t it?
Solution: What we can do here is create a simple spreadsheet in Google Docs that will monitor the prices on all of these search pages and present them in a table so that you not only follow the prices, but you also compare at the same time.
To get started, you need to have access to Google Docs and have some basic knowledge of XPath. Don’t let this scare you – XPath is an easy way to access information in HTML web pages. For example, if you want to know all the URLs mentioned on a webpage, the XPath expression would be
//a[@href]. Some more examples:
//strong designates all the elements of the web page with strong html tags
//@href designates all the elements of the web page with href element, that is, the URLs of this page.
If you think writing XPath expressions is tricky work, get the XPath Checker add-on for Firefox that will help you easily determine the XPath of any element on a web page.
Scrap web pages with Google Docs using ImportXML and XPath
This is the search page for “ipod nano” in Google products. As you may have already noticed, the title of the result is formatted with the CSS class
ps-large-t while the price of the product using the class
ps-larger-t – you can easily find these class names through Firebug or from HTML source.
We are now going to create a table in a Google Spreadsheet that will contain the name, price, and URL that will link to this list of products in Google Docs. You can use the same approach to get product data from other sites like Amazon, eBay, Buy.com, etc.
This is what the final spreadsheet looks like: All of this data is live and will be updated automatically if the corresponding information is updated on Google products.
Get external data in Google Docs with ImportXML
As you might have seen in the previous tutorial on Google Docs, there are built-in spreadsheet functions to help you easily import external data into Google Docs. One of these useful functions is Import XML which, like ImportHTML, can be used for screen capture.
The syntax is
=ImportXML("web page URL", "XPath Expression")
Going back to the spreadsheet, in order to retrieve the price of “ipod nano”, we type the following formula:
You can replace “ipod nano” with any other product name like “harry + potter”, “nikon + d60”, etc.
To enter this function in Google Docs, click on a blank cell, press F2, and paste. Watch this Google Docs movie:
Likewise, for the name of the product, we use this formula:
And for the URL (product hyperlink), the formula is:
You must concatenate this with
http://www.google.com since Google products use relative URLs. This can be easily fixed by adding another column with the formula
Related: Import Data From HTML Web Pages Into Excel
Subscribe to webpage changes through feeds
You don’t need to manually check this Google Docs spreadsheet to see if the prices are as of yesterday – just select publish followed by “Automatically repost when changes are made” and subscribe to the document in your favorite RSS reader.
The author is an Excel genius and blogs at Chandoo.org. This site is a gold mine of advice related to manipulating and visualizing data using Excel and other spreadsheets.