Web Scraping By Python

If you are just getting started in Python and would like to learn more, take DataCamp's Introduction to Data Science in Python course. In the time when the internet is rich with so much data, and apparently, data has become the new oil, web scraping has become even more important and practical to use in various applications. With Python tools like Beautiful Soup, you can scrape and parse this data directly from web pages to use for your projects and applications. Let's use the example of scraping MIDI data from the internet to train a neural network with Magenta that can generate classic Nintendo-sounding music. Scrape a Dynamic Website with Python; Web Scraping with Javascript (NodeJS) Turn Any Website Into An API with AutoScraper and FastAPI; 6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier; How to use a proxy in Playwright.

Web Scraping By Python Examples
Web Scraping Python Tutorial
Python Web Scraping Sample
Web Scraping Python Github
Web Scraping By Python Interview
Web Scraping Python Beautifulsoup Github

The internet has an amazingly wide variety of information for human consumption. But this data is often difficult to access programmatically if it doesn't come in the form of a dedicated REST API. With Python tools like Beautiful Soup, you can scrape and parse this data directly from web pages to use for your projects and applications.

Let's use the example of scraping MIDI data from the internet to train a neural network with Magenta that can generate classic Nintendo-sounding music. In order to do this, we'll need a set of MIDI music from old Nintendo games. Using Beautiful Soup we can get this data from the Video Game Music Archive.

Getting started and setting up dependencies

Before moving on, you will need to make sure you have an up to date version of Python 3 and pip installed. Make sure you create and activate a virtual environment before installing any dependencies.

You'll need to install the Requests library for making HTTP requests to get data from the web page, and Beautiful Soup for parsing through the HTML.

With your virtual environment activated, run the following command in your terminal:

We're using Beautiful Soup 4 because it's the latest version and Beautiful Soup 3 is no longer being developed or supported.

Using Requests to scrape data for Beautiful Soup to parse

First let's write some code to grab the HTML from the web page, and look at how we can start parsing through it. The following code will send a GET request to the web page we want, and create a BeautifulSoup object with the HTML from that page:

With this soup object, you can navigate and search through the HTML for data that you want. For example, if you run soup.title after the previous code in a Python shell you'll get the title of the web page. If you run print(soup.get_text()), you will see all of the text on the page.

Getting familiar with Beautiful Soup

The find() and find_all() methods are among the most powerful weapons in your arsenal. soup.find() is great for cases where you know there is only one element you're looking for, such as the body tag. On this page, soup.find(id='banner_ad').text will get you the text from the HTML element for the banner advertisement.

soup.find_all() is the most common method you will be using in your web scraping adventures. Using this you can iterate through all of the hyperlinks on the page and print their URLs:

You can also provide different arguments to find_all, such as regular expressions or tag attributes to filter your search as specifically as you want. You can find lots of cool features in the documentation.

Parsing and navigating HTML with BeautifulSoup

Before writing more code to parse the content that we want, let’s first take a look at the HTML that’s rendered by the browser. Every web page is different, and sometimes getting the right data out of them requires a bit of creativity, pattern recognition, and experimentation.

Our goal is to download a bunch of MIDI files, but there are a lot of duplicate tracks on this webpage as well as remixes of songs. We only want one of each song, and because we ultimately want to use this data to train a neural network to generate accurate Nintendo music, we won't want to train it on user-created remixes.

When you're writing code to parse through a web page, it's usually helpful to use the developer tools available to you in most modern browsers. If you right-click on the element you're interested in, you can inspect the HTML behind that element to figure out how you can programmatically access the data you want.

Let's use the find_all method to go through all of the links on the page, but use regular expressions to filter through them so we are only getting links that contain MIDI files whose text has no parentheses, which will allow us to exclude all of the duplicates and remixes.

Create a file called nes_midi_scraper.py and add the following code to it: Hackintosh dual boot linux.

This will filter through all of the MIDI files that we want on the page, print out the link tag corresponding to them, and then print how many files we filtered.

Run the code in your terminal with the command python nes_midi_scraper.py.

Downloading the MIDI files we want from the webpage

Now that we have working code to iterate through every MIDI file that we want, we have to write code to download all of them.

In nes_midi_scraper.py, add a function to your code called download_track, and call that function for each track in the loop iterating through them:

In this download_track function, we're passing the Beautiful Soup object representing the HTML element of the link to the MIDI file, along with a unique number to use in the filename to avoid possible naming collisions.

Web Scraping By Python Examples

Run this code from a directory where you want to save all of the MIDI files, and watch your terminal screen display all 2230 MIDIs that you downloaded (at the time of writing this). This is just one specific practical example of what you can do with Beautiful Soup.

The vast expanse of the World Wide Web

Now that you can programmatically grab things from web pages, you have access to a huge source of data for whatever your projects need. One thing to keep in mind is that changes to a web page’s HTML might break your code, so make sure to keep everything up to date if you're building applications on top of this.

If you're looking for something to do with the data you just grabbed from the Video Game Music Archive, you can try using Python libraries like Mido to work with MIDI data to clean it up, or use Magenta to train a neural network with it or have fun building a phone number people can call to hear Nintendo music.

I’m looking forward to seeing what you build. Feel free to reach out and share your experiences or ask any questions.

Email: sagnew@twilio.com
Twitter: @Sagnewshreds
Github: Sagnew
Twitch (streaming live code): Sagnewshreds

Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.

What is a dynamic website?#

A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.

Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.

In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.

A great example of a static website is example.com:

The whole content of this website is loaded as a plain HTML while the initial page load.

To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:

<head>

window.addEventListener('DOMContentLoaded',function(){

document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'

</script>

<body>

</body>

All we have here is an HTML file with a single <div> in the body that contains text - Web Scraping is hard, but after the page load, that text is replaced with the text generated by the Javascript:

window.addEventListener('DOMContentLoaded',function(){

document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'

</script>

To prove this, let's open this page in the browser and observe a dynamically replaced text:

Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.

Extract data from a dynamic web page#

BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.

Let's use BeautifulSoup for extracting the text inside <div> from our sample above.

import os

soup = BeautifulSoup(test_file)

This code snippet uses os library to open our test HTML file (test.html) from the local directory and creates an instance of the BeautifulSoup library stored in soup variable. Using the soup we find the tag with id test and extracts text from it.

In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt, but the code snippet output is the following:

And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.

We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.

Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.

Selenuim: web scraping with a webdriver#

Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.

To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:

Selenium instantiating and scraping flow is the following:

define and setup Chrome path variable
define and setup Chrome webdriver path variable
define browser launch arguments (to use headless mode, proxy, etc.)
instantiate a webdriver with defined above options
load a webpage via instantiated webdriver

In the code perspective, it looks the following:

from selenium.webdriver.chrome.options import Options

import os

opts = Options()

# opts.add_argument(' — headless') # Uncomment if the headless version needed

opts.binary_location ='<path to Chrome executable>'

# Set the location of the webdriver

chrome_driver = os.getcwd()+'<Chrome webdriver filename>'

# Instantiate a webdriver

driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)

# Load the HTML page

soup = BeautifulSoup(driver.page_source)

And finally, we'll receive the required result:

Web Scraping Python Tutorial

Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.

I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.

Pyppeteer: Python headless Chrome#

Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.

Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.

To install Pyppeteer you can execute the following command:

The usage of Pyppeteer for our needs is much simpler than Selenium:

from bs4 import BeautifulSoup

import os

# Launch the browser

page =await browser.newPage()

# Create a URI for our test file

await page.goto(page_path)

soup = BeautifulSoup(page_content)

Python Web Scraping Sample

await browser.close()

asyncio.get_event_loop().run_until_complete(main())

I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.

As we can expect, the result is the following:

We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.

Playwright: Chromium, Firefox and Webkit browser automation#

Web Scraping Python Github

Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.

The API is almost the same as for Pyppeteer, but have sync and async version both.

Installation is simple as always:

playwright install

Let's rewrite the previous example using Playwright.

from playwright.sync_api import sync_playwright

with sync_playwright()as p:

browser = p.chromium.launch()

# Open a new browser page

page_path ='file://'+ os.getcwd()+'/test.html'

# Open our test file in the opened page

page_content = page.content()

# Process extracted content with BeautifulSoup

print(soup.find(id='test').get_text())

# Close browser

As a good tradition, we can observe our beloved output:

We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?

Meet the web scraping API!

Web Scraping API#

ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.

Usage of web scraping API is the simplest option and requires only basic programming skills.

You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.

As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

To check it out as HTML, we can use another great tool: HTMLPreview

The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.

Let's install in first:

And use the installed library:

from scrapingant_client import ScrapingAntClient

# Define URL with a dynamic web content

url ='http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html'

# Create a ScrapingAntClient instance

client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')

# Get the HTML page rendered content

page_content = client.general_request(url).content

# Parse content with BeautifulSoup

print(soup.find(id='test').get_text())

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

And the result is still the required one.

Vmware fusion 9 download. All the headless browser magic happens in the cloud, so you need to make an API call to get the result.

Check out the documentation for more info about ScrapingAnt API.

Summary#

Web Scraping By Python Interview

Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:

Web Scraping Python Beautifulsoup Github

Happy web scraping, and don't forget to use proxies to avoid blocking 🚀