Learning Objectives

  • Clean text data with regular expressions and tokenization

  • Learn lemmatizing and stemming, including how and when to use these techniques

  • Transform data with CountVectorizer and TFIDFVectorizer

  • Fit machine learning models in scikit-learn and evaluate their performance

  • Build pipelines and GridSearch over NLP hyperparameters

Course Outline

Module 1: Introduction to Natural Language Processing (NLP)

- What is natural language processing?

- What are the applications of NLP?

- What is bias in NLP?


Module 2: Cleaning Text Data

- What is tokenizing, and how do we do it?

- What are regular expressions (RegEx), and how can they be used?

- What is lemmatizing and stemming?


Module 3: Converting Text Data to Model Features

- What is vectorizing?

- How do we properly construct training and testing sets when working with NLP vectors?

- What is CountVectorizer and when should it be used?

- What is TFIDFVectorizer and when should it be used?


Module 4: Hyperparameters in NLP

- What are hyperparameters? What are NLP hyperparameters?

- What are stop words and how do they affect our model?

- What are n-grams and how do they affect our model?

- What are max_features, max_df, and min_df, and how do they affect our model?


Module 5: Machine Learning with Pipelines in NLP

- What are considerations of fitting machine learning models in NLP?

- What are pipelines and GridSearch?

- How do we automate model selection with pipelines?

Instructor's Bio: Matt Brems

Matt is currently Managing Partner and Principal Data Scientist at BetaVector. His full-time professional data work spans finance, education, consumer-packaged goods, and politics and he earned General Assembly's 2019 "Distinguished Faculty Member of the Year" award. Matt earned his Master's degree in statistics from Ohio State. Matt is passionate about responsibly putting the power of machine learning into the hands of as many people as possible and mentoring folx in data and tech careers. Matt also volunteers with Statistics Without Borders and currently serves on their Executive Committee as the Marketing & Communications Director.

Who will be interested in this course?

  • All code is written in Python, so experience is helpful. However, solutions are provided so a Python background is not required

  • Some experience with machine learning is helpful, but not necessary

  • No background in NLP