Data preparation
Seven steps to prepare you data for use in a machine learning model
- Gather your data
- Handle missing data
- Feature extraction
- Decide which features are important
- Encode categorical values
- Numeric feature engineering
- Split your data into training and testing datasets
Gather Data
Gather data for the problem at hand
- Gather unique data
- Publicly available data
- Kaggle
- Google Dataset Search
- UCI Machine Learning Repository
- Scrape HTML pages
- Beautiful Soup
Handle Missing Data
Several approaches to the problem
- Null value replacement
- Mode/median/average value replacement
- Remove the entire record
- Model-based imputation
- Regression
- K-Nearest Neighbors
- Deep Learning
- Interpolation / Extrapolation
- Forward filling / Backward filling
- Hot deck imputation
Feature Extraction
- AKA Dimensionality Reduction
- Reduce the number of features by creating new features from existing features
Feature Selection
- Rank the importance of existing features
- Remove less important features
- Use Principal Component Analysis (PCA)
- An unsupervised learning algorithm that reduces the number of features while still retaining as much information as possible
Encode Categorical Values
- Encode categorical data to integers
- Be careful with ordinal values
- One-hot-encoding: Change nominal categorical values such as
true
,false
, orrainy
,sunny
to numerical values
Numerical Feature Engineering
- Transform numeric values so machine learning algorithms can better analyze them
- Change numeric values so all values are on the same scale
- Normalization: rescales the values into a range of
[0, 1]
- Standardization: rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)
- Binning
- Normalization: rescales the values into a range of
Training and Testing Datasets
- Split dataset into a training subset and a testing subset
- Typically a 80%, 20% split
Labs
本文由
Oscaner
创作, 采用
知识共享署名4.0
国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名
-
Previous
[MLS-C01] [Introduction] Machine Learning Terminology -
Next
[MLS-C01] [Data Engineering] Gathering data