Web Scraping For Profit Reddit



Web

Welcome to the most interesting (and fun!) blog post on web scraping for dummies. Mind you, this is not a typical web scraping tutorial. You will learn the whys and hows of data scraping along with a few interesting use-cases and fun facts. I've got a fairly strong understanding of taxation stuff and have built a few basic projects in Python, like scraping ASX data, (requests / selenium / soup), modelling taxation, pdf scraping (Py2pdf / tesseract), file processing, etc. I've built my own share tracker in Excel which I currently use.

The goal is to extract or “scrape” information from the postson the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/

You should know that Reddit has an apiand PRAW exists to make using it easier.

  • You use it, taking the blue pill—the article ends.
  • You take the red pill—you stay in Wonderland, and I show you how deep a JSON response goes.

Remember: all I’m offering is the truth. Nothing more.

Reddit allows you to add a .json extension to the end of your request and will give you back a JSON response instead of HTML.

We’ll be using requests as our “HTTP client” which you can install using pip install requests --user if you have not already.

We’re setting the User-Agent header to Mozilla/5.0 as the default requests value is blocked.

r.json()

We know that we’re receiving a JSON response from this request so we use the .json()method on a Response object which turns a JSON“string” intoa Python structure (also see json.loads())

To see a pretty-printed version of the JSON data we can use json.dumps() with its indent argument.

The output generated for this particular response is quite largeso it makes sense to write the output to a file for further inspection.

Note if you’re using Python 2 you’ll need from __future__ import print_function to have access to the print()function that has the file argument (or you could just usejson.dump()).

Web Scraping For Profit RedditWeb Scraping For Profit Reddit

Upon further inspection we can see that r.json()['data']['children'] is a list of dicts and each dict represents a submission or “post”.

Your mac is currently downloading software for iphone. There is also some “subreddit” information available.

These before and after values are used for result page navigationjust like when you click on the next and prev buttons.

To get to the next page we can pass after=t3_64o6gh 2011 macbook pro software update. as a GET param.

When making multiple requests however, you will usually want to use a session object.

So as mentioned each submission is a dict and the important information is available inside the data key:

I’ve truncated the output here but important values include author,selftext, title and url

It’s pretty annoying having to use ['data'] all the time so we could have instead declared posts using a list comprehension.

Service

One example of why you may want to do this perhaps is to “scrape” the linksfrom one of the “image posting” subreddits to access the images.

r/aww

One such subreddit isr/aww home of “teh cuddlez”.

Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.

Web Scraping For Profit Reddit Post-op

In the case of the direct image links we could fetch them andsave the result to disk.

BeautifulSoup

You could of course just request the regular URL, processing the HTMLwith BeautifulSoup and html5lib which you can install usingpip install beautifulsoup4 html5lib --user if you do not already have them.

BeautifulSoup’s select() method locates items using CSS Selectorsand div.thing here matches <div> tags that contain thing as a class namee.g. class='thing'

We can then use dict indexing on a BeautifulSoupTag object to extract the value of a specific tag attribute.

Web Scraping Online

In this case the URL is contained in the data-url='..' attribute of the <div> tag.

Web Scraping Applications

As already mentioned Reddit does have an API withrules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.