I’m writing the post as a recap of what I uncovered while learning to scraping web pages for content on the Internet. I did a lot of research and it all started here. Priceconomics sells data crawling as a service (DCAS). Not sure if DCAS is a thing yet, but I’m pretty sure some people will start calling this service that. I looked at what Priceconomics was doing and thought it shouldn’t be hard to gain a basic understanding of web scraping.
There are many open source libraries and tools available. You can be easily overwhelmed just examining the landscape. My programming language of choice was python and the library I choose to use was lxml.
Why did I choose Python?
I believe in learning from others. I figured a good place to start was Hacker News. Hacker News is a place where many innovative ideas and solutions are shared. A quick search for web scraping yielded some great results.
Here’s the result from my search on Hacker News. The highlights in yellow match for the term ‘web scraping’ and the red underlines match for ‘Python’. The search results are really good and many of them are based on Python. This was my reasoning.
Why did I choose to use LXML?
I came across this gem of a post: Scraping with Urllib2 & LXML. A search on Google turned this up. This post was very similar to what I wanted to accomplish. It felt like an easy win and I decided to give it a try. LXML is used by many other libraries and software packages. You can check some of the uses in the LXML FAQ page.
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
The problem I wanted to solve
I wanted to scrape Amazon search results. If I wanted to compare prices, I would have to visual do this or copy and paste results into some other place to do analysis. That’s not very appealing to me.
Looking at the search results for ‘lego 21115’, you would see the following web page. I needed to find div for each item; luckily Chrome developers tools made that easy. I just inspected the first item. I had to walk up the tree, but I found the node. For me, it was a div tag with a class of ‘a-fixed-left-grid-col a-col-right’.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
from lxml import html import requests url = "http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=lego+21115" page = requests.get(url) tree = html.fromstring(page.content) items = tree.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]') titles =  for item in items: for title in item.getiterator("a"): if title.get("title") is not None: titles.append(title.get("title")) prices =  for item in items: found = False for price in item.getiterator("span"): b = price.get('class') if(b == 'a-size-base a-color-price a-text-bold'): prices.append(price.text) found = True if found == False: prices.append("no price") print titles print prices
Results – Titles
['LEGO Minecraft 21115 The First Night', 'LEGO Minecraft 21114 The Farm', 'LEGO Minecraft 21116 Crafting Box', 'LEGO Minecraft 21121 the Desert Outpost Building Kit', 'LEGO Minecraft 21120 the Snow Hideout Building Kit', 'LEGO Minecraft 21117 The Ender Dragon', 'LEGO Minecraft 21118 The Mine', 'LEGO Minecraft Creative Adventures 21115 The First Night', '21115 Lego 408Pcs Minecraft The First Night Kids Building Playset', 'Lego Minecraft Toys Premium Educational Sets Creationary Game With Minifigures For 8 Year olds Childrens Farm Box', 'Minecraft The Farm Includes a Steve Minifigure with an Accessory, plus a Skeleton, Cow and a Sheep', 'LEGO Minecraft - The Creeper Minifigure from set 21115', 'Lego Minecraft Ultimate Collection (Cave 21113 ,Farm 21114, First Night 21115, Crafing Box 21116, Dragon 21117, Mine 21118 )', 'Bundle: LEGO Minecraft 21116 Crafting Box & LEGO Minecraft 21115 The First Night & LEGO Minecraft The Cave 21113 Playset', 'LEGO Minecraft The Cave 21113 Playset & LEGO Minecraft 21115 The First Night & LEGO Minecraft 21114 The Farm', u'Lego\xae Minecraft Terrain Ore Bundle "(1) Diamond" "(1) Emerald" "(1) Silver" "(1) Amethyst" "(1) Gold" "(1) Lapis" "(1) Redstone"']
Results – Prices
['$34.99', '$25.99', '$21.31', '$40.99', '$35.66', '$49.00', '$34.76', '$73.36', '$118.50', '$46.91', '$61.38', '$47.55', 'no price', '$5.40', '$568.99', 'no price', '$185.99', 'no price']
Results – Overall
Within a few hours of reading and coding, I was able to accomplish my original goal. I now have a base to grow from to do more complex scraping.
Be careful not to abuse scraping. Most companies like Amazon have API for you to use. This will allow to use to bypass design and style changes made to the site. Yes, my scrape will break when Amazon changes their page. It’s not a question of if, but when they will do this. Using an API puts you right next to the data. Always use an API.
This example is simple and pulls one page from Amazon’s site. Amazon would probably not block my IP. If my script started to crawl Amazon’s site, that’s another story.
Learning the craft
Here are some additional resources that gave me insight into web scraping.
XPath and XSLT with lxml – This contained a great example of using XPath
Elements and Element Trees – While traversing the tree, you will receive elements. Here’s a good overview and examples.
Requests python library – Retrieving web pages shouldn’t be a task. The Requests library makes it super easy. There’s a great example if you need to send a multi-part post request in the post.
HTML Scraping — The Hitchhiker’s Guide to Python – A good place to start if you want to get coding immediately and skip the stuff above.