Modern Data Acquisition

This course is only available as a part of subscription.

Course Abstract

Training duration : 2 hours

An algorithm is worse than useless without the right inputs. Learn how to supercharge your regressors and classifiers by acquiring and extracting data from nearly any website. Whether the data that you're interested in is trapped inside of an HTML table, rendered with JavaScript, or locked behind a login form, you’ll learn how to extract it all in this On-Demand course. And you’ll learn how to schedule routine scraping tasks and how to send the results to a database or to a slack channel so that you can use the data in your model pipelines.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

How to scrape nearly any website
How to automate some browser tasks (like clicking and scrolling)
How to schedule and repeat scraping jobs

Instructor(s)

Instructor Bio:

Distinguished Faculty Member | General Assembly

Max Humber

Max Humber is the creator of gazpacho, the author of Personal Finance with Python, and a Distinguished Faculty Member at General Assembly.

INTERESTED IN MORE HANDS-ON TRAINING SESSIONS?

VIEW PLANS

Course Outline

Introduction ( 5 | 5 minutes )

Who am I, and who are you?
HTML/CSS Basics
Learning Agenda

Basic Web Scraping ( 15 | 20 minutes )

A quick review on how to fetch HTML and quickly parse it
How target HTML element tags and attributes
Exercise: Scrape a “simple” website

HTML Parsing ( 15 | 35 minutes )

String manipulation techniques and list comprehensions for scraping
Looping, sleeping, and monitoring
Working with HTML tables
Exercise: Scrape a Wikipedia table

Scraping JavaScript ( 15 | 50 minutes )

How to scrape data locked behind a login page
How to scrape data rendered with JavaScript
Exercise: Bypass a login page with credentials

Browser Automation ( 20 | 70 minutes )

Replicate scrolling and browser clicks to get at hard to parse data
How to scrape and download images
How to scrape and download video and audio

Scheduling ( 20 | 90 minutes )

How to put a scraper on a schedule
How to send results to a Slack channel
Exercise: Schedule a scraper locally

Serverless ( 20 | 110 minutes )

How to schedule scrapers with AWS Lambda
How to save results to a database
Exercise: Use AWS to scrape a web site

Conclusion ( 5 | 115 minutes )

Have questions?

GET IN TOUCH >>

Background knowledge

Required: Experience with Python
Nice-to-have some familiarity with BeautifulSoup

CHECK OUT NEW AND FEATURED COURSES >>

SEE ALL COURSES>>