Course Abstract

Training duration: 4 hr (Hands-on)

How many times a day do you use search engines or autocorrect? Do you translate text from one language to another? Getting computers to understand language like humans understand language is the key to solving many problems. However, there are so many things to learn! This course is the perfect place to start. We'll start by defining natural language processing (NLP) and exploring its uses. We'll see how NLP is biased and how to proactively reduce bias. We'll understand the process of tackling NLP problems, including cleaning text data and converting it so that we build models with text data. We'll cover vectorizers, hyperparameters, and pipelines. You'll come away with a full understanding of how to tackle an NLP problem. All of this will be done in Python! You'll know how to do these things in Python, because we'll do them together. If you don't have a strong Python background right now or if you don't know much about machine learning yet, that's OK! We'll assume no prior knowledge and get you set up. This is perfect for beginners, those who want to learn how to do these things in Python, and/or those who want to refresh their skills.

DIFFICULTY LEVEL: BEGINNER

Learning Objectives

  • Clean text data with regular expressions and tokenization

  • Learn lemmatizing and stemming, including how and when to use these techniques

  • Transform data with CountVectorizer and TFIDFVectorizer

  • Fit machine learning models in scikit-learn and evaluate their performance

  • Build pipelines and GridSearch over NLP hyperparameters

Instructor Bio:

Matt Brems

Global Lead Data Science Instructor | General Assembly

Matt Brems

Matt is currently Managing Partner and Principal Data Scientist at BetaVector. His full-time professional data work spans finance, education, consumer-packaged goods, and politics and he earned General Assembly's 2019 "Distinguished Faculty Member of the Year" award. Matt earned his Master's degree in statistics from Ohio State. Matt is passionate about responsibly putting the power of machine learning into the hands of as many people as possible and mentoring folx in data and tech careers. Matt also volunteers with Statistics Without Borders and currently serves on their Executive Committee as the Marketing & Communications Director.

Course Outline

Module 1: Introduction to Natural Language Processing (NLP)

- What is natural language processing?

- What are the applications of NLP?

- What is bias in NLP?


Module 2: Cleaning Text Data

- What is tokenizing, and how do we do it?

- What are regular expressions (RegEx), and how can they be used?

- What is lemmatizing and stemming?


Module 3: Converting Text Data to Model Features

- What is vectorizing?

- How do we properly construct training and testing sets when working with NLP vectors?

- What is CountVectorizer and when should it be used?

- What is TFIDFVectorizer and when should it be used?


Module 4: Hyperparameters in NLP

- What are hyperparameters? What are NLP hyperparameters?

- What are stop words and how do they affect our model?

- What are n-grams and how do they affect our model?

- What are max_features, max_df, and min_df, and how do they affect our model?


Module 5: Machine Learning with Pipelines in NLP

- What are considerations of fitting machine learning models in NLP?

- What are pipelines and GridSearch?

- How do we automate model selection with pipelines?

Background knowledge

  • All code is written in Python, so experience is helpful. However, solutions are provided so a Python background is not required

  • Some experience with machine learning is helpful, but not necessary

  • No background in NLP

Applicable Use-cases

  • Natural language processing is useful all over the place! Are you looking to build a chatbot? Are you working with text data like customer feedback that you need to clean and standardize? Just want to add a new skill to your resume?

  • Data scientists: If you're new to working with natural language data, this is for you

  • Developers or engineers: If you know Python but don't have a ton of data science experience (especially with text data), this is for you

  • Linguists and language enthusiasts: If you don't think you have the "technical background" of Python or machine learning, you'll be able to quickly level up. This is for you, too!