# [MLS-C01] [Data Engineering] Introduction

Posted by Oscaner on May 10, 2022

## Data preparation

Seven steps to prepare you data for use in a machine learning model

2. Handle missing data
3. Feature extraction
4. Decide which features are important
5. Encode categorical values
6. Numeric feature engineering
7. Split your data into training and testing datasets

## Gather Data

Gather data for the problem at hand

• Gather unique data
• Publicly available data
• Kaggle
• Reddit
• UCI Machine Learning Repository
• Scrape HTML pages
• Beautiful Soup

## Handle Missing Data

Several approaches to the problem

• Null value replacement
• Mode/median/average value replacement
• Remove the entire record
• Model-based imputation
• Regression
• K-Nearest Neighbors
• Deep Learning
• Interpolation / Extrapolation
• Forward filling / Backward filling
• Hot deck imputation

## Feature Extraction

• AKA Dimensionality Reduction
• Reduce the number of features by creating new features from existing features

## Feature Selection

• Rank the importance of existing features
• Remove less important features

• Use Principal Component Analysis (PCA)
• An unsupervised learning algorithm that reduces the number of features while still retaining as much information as possible

## Encode Categorical Values

• Encode categorical data to integers
• Be careful with ordinal values
• One-hot-encoding: Change nominal categorical values such as true, false, or rainy, sunny to numerical values

## Numerical Feature Engineering

• Transform numeric values so machine learning algorithms can better analyze them
• Change numeric values so all values are on the same scale
• Normalization: rescales the values into a range of [0, 1]
• Standardization: rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)
• Binning

## Training and Testing Datasets

• Split dataset into a training subset and a testing subset
• Typically a 80%, 20% split