Course Abstract

Training duration : 2 hours

An algorithm is worse than useless without the right inputs. Learn how to supercharge your regressors and classifiers by acquiring and extracting data from nearly any website. Whether the data that you're interested in is trapped inside of an HTML table, rendered with JavaScript, or locked behind a login form, you’ll learn how to extract it all in this On-Demand course. And you’ll learn how to schedule routine scraping tasks and how to send the results to a database or to a slack channel so that you can use the data in your model pipelines.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • How to scrape nearly any website

  • How to automate some browser tasks (like clicking and scrolling)

  • How to schedule and repeat scraping jobs

Instructor(s)

Instructor Bio:

Distinguished Faculty Member | General Assembly

Max Humber

Max Humber is the creator of gazpacho, the author of Personal Finance with Python, and a Distinguished Faculty Member at General Assembly.

Course Outline

Introduction ( 5 | 5 minutes ) 

  • Who am I, and who are you?

  • HTML/CSS Basics

  • Learning Agenda


Basic Web Scraping ( 15 | 20 minutes )

  • A quick review on how to fetch HTML and quickly parse it

  • How target HTML element tags and attributes

  • Exercise: Scrape a “simple” website


HTML Parsing ( 15 | 35 minutes )

  • String manipulation techniques and list comprehensions for scraping

  • Looping, sleeping, and monitoring

  • Working with HTML tables

  • Exercise: Scrape a Wikipedia table


Scraping JavaScript ( 15 | 50 minutes )

  • How to scrape data locked behind a login page

  • How to scrape data rendered with JavaScript

  • Exercise: Bypass a login page with credentials


Browser Automation ( 20 | 70 minutes )

  • Replicate scrolling and browser clicks to get at hard to parse data

  • How to scrape and download images

  • How to scrape and download video and audio


Scheduling ( 20 | 90 minutes )

  • How to put a scraper on a schedule 

  • How to send results to a Slack channel 

  • Exercise: Schedule a scraper locally


Serverless ( 20 | 110 minutes )

  • How to schedule scrapers with AWS Lambda

  • How to save results to a database

  • Exercise: Use AWS to scrape a web site


Conclusion ( 5 | 115 minutes )

Background knowledge

  • Required: Experience with Python

  • Nice-to-have some familiarity with BeautifulSoup