Gathering data
Scikit-learn
Retrieve data from Scikit-learn
Scikit-learn has many datasets for use in your modeling
Similar to the Kaggle and Reddit dataset repositories
https://scikit-learn.org/stable/datasets
AWS services
Several AWS services to help gather data
- Amazon Data Pipeline
- AWS Database Migration Service (DMS)
- AWS Glue
- Amazon SageMaker
- Amazon Athena
Labs
Handling Missing Data
Several approaches to the problem of handling missing data
- Do nothing
- Remove the entire record
- Mode/median/average value replacement
- Most frequent value
- Model-based imputation
- K-Nearest Neighbors
- Regression
- Deep Learning
- Interpolation / Extrapolation
- Forward filling / Backward filling
- Hot deck imputation
Do nothing
Let your algorithm either replace them through imputation (XGBoost) or just ignore them as LightGBM does with its use_missing=false
parameter
Some algorithms will throw an error if they find missing values (LinearRegression)
Or, replace all missing values
But with what ?
Remove the Entire Record
Remove the observations that have missing values
Risk losing data points with valuable information
Labs
Median/Average Value Replacement
Replace the missing values with a simple median, or mean
- Reflection of the other values in the feature
- Does’t factor correlation between features
- Can’t use on categorical features
Labs
Most Frequent Value
Replace missing values with the most frequently occurring value in the feature
- Doesn’t factor correlation between features
- Works with categorical features
- Can introduce bias into your model
Model-Based Imputation
Use a machine learning algorithm to impute the missing values
- K-Nearest Neighbors
- Uses “feature similarity” to predict missing values
- Regression
- Predictors of the variable with missing values identified via correlation matrix
- Best predictors are selected and used as independent variables in a regression equation
- Variable with missing data is used as the target variable
- Deep Learning
- Works very well with categorical and non-numerical features
Labs
Other Methods
- Interpolation / Extrapolation
- Estimate values from other observations within the range of a discrete set of known data points
- Forward filling / Backward filling
- Fill the missing value by filling it from the preceding value or the succeeding value
- Hot deck imputation
- Randomly choosing the missing value from a set of related and similar variables
Feature Selection/Extraction
The Curse of Dimensionality
“Dimensionality” refers to the number of features (i.e. input variables) in your dataset
- High feature to observation ratio causes some algorithms struggle to train effective models
- Visualization of
multi-dimensional datasets
vstwo or three-dimensions
- Two primary methods for reducing dimensionality: Feature Selection and Feature Extraction
Feature Selection
Use feature selection to filter irrelevant or redundant features from your dataset
- Feature Selection requires normalization
- Feature Selection removes features from your dataset - Variance Thresholds
Feature Extraction
Requires Standardization
- Feature Extraction requires standardization
Reduces Features - Retains Information
Creating new features from your existing features, feature extraction creates a new, smaller set of features that stills captures most of the useful information.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is an unsupervised algorithm that creates new features by linearly combining original features
- New features are uncorrelated, meaning they are orthogonal
- New features are ranked in order of “explained variance”. The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance, etc.
- Explained variance tells you how much information (variance) can be attributed to each of the principal components
- You lose some of the variance (information) when you reduce your dimensional space
- Principal component analysis (PCA) can be used to assist in visualization of your data
- Principal component analysis (PCA) can also assist in speeding up your machine learning
Labs
- [feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds_-lab_part_1.ipynb](https://github.com/Oscaner/Exam/blob/master/aws/mls-c01/whizlabs/code/02-data-engineering/feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-lab_part_1.ipynb “feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-_lab_part_1.ipynb”)
- [feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds_-lab_part_2.ipynb](https://github.com/Oscaner/Exam/blob/master/aws/mls-c01/whizlabs/code/02-data-engineering/feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-lab_part_2.ipynb “feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-_lab_part_2.ipynb”)
Encoding categorical values
- Binarizer Encoding: for features of a binary nature
- Label Encoding: may imply ordinality, can use Ordinal Encoder
- One Hot Encoding: Change nominal categorical values such as “true”, “false”, or “rainy”, “sunny” to numerical values
Labs
Numerical engineering
- Transform numeric values so machine learning algorithms can better analyze them
- Change numeric values so all values are on the same scale
- Normalization: rescales the values into a range of
[0, 1]
- Standardization: rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)
- Normalization: rescales the values into a range of
- Binning
Binning
AKA discretization or quantization
- Categorical Binning
- Group categorical values to gain insight into data: countries by geographical region
- Numerical Binning
- Divides continuous feature into a specified number of categories or bins, thus making the data discrete
- Reduces the number of discrete intervals of a continuous feature
- Quantile Binning
- Divide up data into equal sized bins
- Defines the bins using percentiles based on the distribution of the data
Labs
Text Feature Editing
- Transform text within data so machine learning algorithms can better analyze it
- Splitting text into smaller pieces
- Used for text analysis of documents, streamed dialog, etc.
- Can use in a pipeline as steps in a machine learning analysis
Bag-of-Words
- Tokenizes raw text and creates a statistical representation of the text
- Breaks up text by whitespace into single words
N-Gram
- Extension of Bag-of-Words which produces groups of words of n size
- Breaks up text by whitespace into groups of words
Orthogonal Sparse Bigram
- Creates groups of words of size n, returns every pair of words that includes the first word
- Creates groups of words that always include the first word
TF-IDF
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Shows how important a word or words are to a given set of text by providing appropriate weights to terms that are common and less common
- Show the popularity of a word or words in text data by making common words like “the” or “and” less important
Use Cases
Use Case | Transformation | Reason |
---|---|---|
Finding phrases in spam | N-Gram | Compare whole phrases such as “you’re a winner!” or “Buy now!” |
Finding subject of several PDFs | TF-IDF Orthogonal Sparse Bigram |
Filter less important words in the documents. Find common word combinations repeated in the documents. |
Labs
本文由
Oscaner
创作, 采用
知识共享署名4.0
国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名
-
Previous
[MLS-C01] [Data Engineering] Introduction -
Next
[MLS-C01] [Data Engineering] AWS Migration services