Course Abstract
Training duration: 4 hr (Hands-on)
Learning Objectives
-
Clean text data with regular expressions and tokenization
-
Learn lemmatizing and stemming, including how and when to use these techniques
-
Transform data with CountVectorizer and TFIDFVectorizer
-
Fit machine learning models in scikit-learn and evaluate their performance
-
Build pipelines and GridSearch over NLP hyperparameters
Instructor Bio:
Matt Brems

Global Lead Data Science Instructor | General Assembly
Matt Brems
Course Outline
Module 1: Introduction to Natural Language Processing (NLP)
- What is natural language processing?
- What are the applications of NLP?
- What is bias in NLP?
Module 2: Cleaning Text Data
- What is tokenizing, and how do we do it?
- What are regular expressions (RegEx), and how can they be used?
- What is lemmatizing and stemming?
Module 3: Converting Text Data to Model Features
- What is vectorizing?
- How do we properly construct training and testing sets when working with NLP vectors?
- What is CountVectorizer and when should it be used?
- What is TFIDFVectorizer and when should it be used?
Module 4: Hyperparameters in NLP
- What are hyperparameters? What are NLP hyperparameters?
- What are stop words and how do they affect our model?
- What are n-grams and how do they affect our model?
- What are max_features, max_df, and min_df, and how do they affect our model?
Module 5: Machine Learning with Pipelines in NLP
- What are considerations of fitting machine learning models in NLP?
- What are pipelines and GridSearch?
- How do we automate model selection with pipelines?
Background knowledge
-
All code is written in Python, so experience is helpful. However, solutions are provided so a Python background is not required
-
Some experience with machine learning is helpful, but not necessary
-
No background in NLP