Course Abstract

Training duration : 2 hours

An algorithm is worse than useless without the right inputs. Learn how to supercharge your regressors and classifiers by acquiring and extracting data from nearly any website. Whether the data that you're interested in is trapped inside of an HTML table, rendered with JavaScript, or locked behind a login form, you’ll learn how to extract it all in this On-Demand course. And you’ll learn how to schedule routine scraping tasks and how to send the results to a database or to a slack channel so that you can use the data in your model pipelines.


Learning Objectives

  • How to scrape nearly any website

  • How to automate some browser tasks (like clicking and scrolling)

  • How to schedule and repeat scraping jobs


Instructor Bio:

Distinguished Faculty Member | General Assembly

Max Humber

Max Humber is the creator of gazpacho, the author of Personal Finance with Python, and a Distinguished Faculty Member at General Assembly.


Course Outline

Introduction ( 5 | 5 minutes ) 

  • Who am I, and who are you?

  • HTML/CSS Basics

  • Learning Agenda

Basic Web Scraping ( 15 | 20 minutes )

  • A quick review on how to fetch HTML and quickly parse it

  • How target HTML element tags and attributes

  • Exercise: Scrape a “simple” website

HTML Parsing ( 15 | 35 minutes )

  • String manipulation techniques and list comprehensions for scraping

  • Looping, sleeping, and monitoring

  • Working with HTML tables

  • Exercise: Scrape a Wikipedia table

Scraping JavaScript ( 15 | 50 minutes )

  • How to scrape data locked behind a login page

  • How to scrape data rendered with JavaScript

  • Exercise: Bypass a login page with credentials

Browser Automation ( 20 | 70 minutes )

  • Replicate scrolling and browser clicks to get at hard to parse data

  • How to scrape and download images

  • How to scrape and download video and audio

Scheduling ( 20 | 90 minutes )

  • How to put a scraper on a schedule 

  • How to send results to a Slack channel 

  • Exercise: Schedule a scraper locally

Serverless ( 20 | 110 minutes )

  • How to schedule scrapers with AWS Lambda

  • How to save results to a database

  • Exercise: Use AWS to scrape a web site

Conclusion ( 5 | 115 minutes )

Background knowledge

  • Required: Experience with Python

  • Nice-to-have some familiarity with BeautifulSoup