Clean text data with regular expressions and tokenization
Learn lemmatizing and stemming, including how and when to use these techniques
Transform data with CountVectorizer and TFIDFVectorizer
Fit machine learning models in scikit-learn and evaluate their performance
Build pipelines and GridSearch over NLP hyperparameters
Module 1: Introduction to Natural Language Processing (NLP)
- What is natural language processing?
- What are the applications of NLP?
- What is bias in NLP?
Module 2: Cleaning Text Data
- What is tokenizing, and how do we do it?
- What are regular expressions (RegEx), and how can they be used?
- What is lemmatizing and stemming?
Module 3: Converting Text Data to Model Features
- What is vectorizing?
- How do we properly construct training and testing sets when working with NLP vectors?
- What is CountVectorizer and when should it be used?
- What is TFIDFVectorizer and when should it be used?
Module 4: Hyperparameters in NLP
- What are hyperparameters? What are NLP hyperparameters?
- What are stop words and how do they affect our model?
- What are n-grams and how do they affect our model?
- What are max_features, max_df, and min_df, and how do they affect our model?
Module 5: Machine Learning with Pipelines in NLP
- What are considerations of fitting machine learning models in NLP?
- What are pipelines and GridSearch?
- How do we automate model selection with pipelines?
All code is written in Python, so experience is helpful. However, solutions are provided so a Python background is not required
Some experience with machine learning is helpful, but not necessary
No background in NLP