I’m writing the post as a recap of what I uncovered while learning to scraping web pages for content on the Internet. I did a lot of research and it all started here. Priceconomics sells data crawling as a service (DCAS). Not sure if DCAS is a thing yet, but I’m pretty sure some people will start calling this service that. I looked at what Priceconomics was doing and thought it shouldn’t be hard to gain a basic understanding of web scraping.
There are many open source libraries and tools available. You can be easily overwhelmed just examining the landscape. My programming language of choice was python and the library I choose to use was lxml.
Why did I choose Python?
I believe in learning from others. I figured a good place to start was Hacker News. Hacker News is a place where many innovative ideas and solutions are shared. A quick search for web scraping yielded some great results.
Here’s the result from my search on Hacker News. The highlights in yellow match for the term ‘web scraping’ and the red underlines match for ‘Python’. The search results are really good and many of them are based on Python. This was my reasoning.
Why did I choose to use LXML?
I came across a posted titled Scraping with Urllib2 & LXML. A search on Google turned this up. This post was very similar to what I wanted to accomplish. It felt like an easy win and I decided to give it a try. LXML is used by many other libraries and software packages. You can check some of the uses in the LXML FAQ page.
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
The problem I wanted to solve
I wanted to scrape Amazon search results. If I wanted to compare prices, I would have to visual do this or copy and paste results into some other place to do analysis. That’s not very appealing to me.
Looking at the search results for ‘lego 21115’, you would see the following web page. I needed to find div for each item; luckily Chrome developers tools made that easy. I just inspected the first item. I had to walk up the tree, but I found the node. For me, it was a div tag with a class of a-fixed-left-grid-col a-col-right.
Results - Titles
Results - Prices
Results - Overall
Within a few hours of reading and coding, I was able to accomplish my original goal. I now have a base to grow from to do more complex scraping.
Be careful not to abuse scraping. Most companies like Amazon have API for you to use. This will allow to use to bypass design and style changes made to the site. Yes, my scrape will break when Amazon changes their page. It’s not a question of if, but when they will do this. Using an API puts you right next to the data. Always use an API.
This example is simple and pulls one page from Amazon’s site. Amazon would probably not block my IP. If my script started to crawl Amazon’s site, that’s another story.
Learning the craft
Here are some additional resources that gave me insight into web scraping.
XPath and XSLT with lxml - This contained a great example of using XPath
Elements and Element Trees - While traversing the tree, you will receive elements. Here’s a good overview and examples.
Requests python library - Retrieving web pages shouldn’t be a task. The Requests library makes it super easy. There’s a great example if you need to send a multi-part post request in the post.
HTML Scraping — The Hitchhiker’s Guide to Python - A good place to start if you want to get coding immediately and skip the stuff above.