
Welcome to the most interesting (and fun!) blog post on web scraping for dummies. Mind you, this is not a typical web scraping tutorial. You will learn the whys and hows of data scraping along with a few interesting use-cases and fun facts. I've got a fairly strong understanding of taxation stuff and have built a few basic projects in Python, like scraping ASX data, (requests / selenium / soup), modelling taxation, pdf scraping (Py2pdf / tesseract), file processing, etc. I've built my own share tracker in Excel which I currently use.
The goal is to extract or “scrape” information from the postson the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/
You should know that Reddit has an apiand PRAW exists to make using it easier.
- You use it, taking the blue pill—the article ends.
- You take the red pill—you stay in Wonderland, and I show you how deep a
JSON
response goes.
Remember: all I’m offering is the truth. Nothing more.
Reddit allows you to add a .json
extension to the end of your request and will give you back a JSON
response instead of HTML
.
We’ll be using requests
as our “HTTP client” which you can install using pip install requests --user
if you have not already.
We’re setting the User-Agent
header to Mozilla/5.0
as the default requests
value is blocked.
r.json()
We know that we’re receiving a JSON
response from this request so we use the .json()method on a Response object which turns a JSON
“string” intoa Python structure (also see json.loads())
To see a pretty-printed version of the JSON
data we can use json.dumps()
with its indent
argument.
The output generated for this particular response is quite largeso it makes sense to write the output to a file for further inspection.
Note if you’re using Python 2
you’ll need from __future__ import print_function
to have access to the print()
function that has the file
argument (or you could just usejson.dump()
).


Upon further inspection we can see that r.json()['data']['children']
is a list of dicts and each dict represents a submission or “post”.
Your mac is currently downloading software for iphone. There is also some “subreddit” information available.
These before
and after
values are used for result page navigationjust like when you click on the next
and prev
buttons.
To get to the next page we can pass after=t3_64o6gh
2011 macbook pro software update. as a GET param.
When making multiple requests however, you will usually want to use a session object.
So as mentioned each submission is a dict and the important information is available inside the data
key:
I’ve truncated the output here but important values include author
,selftext
, title
and url
It’s pretty annoying having to use ['data']
all the time so we could have instead declared posts
using a list comprehension.

One example of why you may want to do this perhaps is to “scrape” the linksfrom one of the “image posting” subreddits to access the images.
r/aww
One such subreddit isr/aww home of “teh cuddlez”.
Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.
Web Scraping For Profit Reddit Post-op
In the case of the direct image links we could fetch them andsave the result to disk.
BeautifulSoup
You could of course just request the regular URL, processing the HTML
with BeautifulSoup
and html5lib
which you can install usingpip install beautifulsoup4 html5lib --user
if you do not already have them.
BeautifulSoup’s select()
method locates items using CSS Selectorsand div.thing
here matches <div>
tags that contain thing
as a class namee.g. class='thing'
We can then use dict indexing on a BeautifulSoup
Tag object to extract the value of a specific tag attribute.
Web Scraping Online
In this case the URL is contained in the data-url='..'
attribute of the <div>
tag.
Web Scraping Applications
As already mentioned Reddit does have an API withrules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.
