[MLS-C01] [Data Engineering] Gathering data

Posted by Oscaner on May 14, 2022

Gathering data


Retrieve data from Scikit-learn

Scikit-learn has many datasets for use in your modeling

Similar to the Kaggle and Reddit dataset repositories


AWS services

Several AWS services to help gather data

  • Amazon Data Pipeline
  • AWS Database Migration Service (DMS)
  • AWS Glue
  • Amazon SageMaker
  • Amazon Athena


Handling Missing Data

Several approaches to the problem of handling missing data

  • Do nothing
  • Remove the entire record
  • Mode/median/average value replacement
  • Most frequent value
  • Model-based imputation
    • K-Nearest Neighbors
    • Regression
    • Deep Learning
  • Interpolation / Extrapolation
  • Forward filling / Backward filling
  • Hot deck imputation

Do nothing

Let your algorithm either replace them through imputation (XGBoost) or just ignore them as LightGBM does with its use_missing=false parameter

Some algorithms will throw an error if they find missing values (LinearRegression)

Or, replace all missing values

But with what ?

Remove the Entire Record

Remove the observations that have missing values

Risk losing data points with valuable information


Median/Average Value Replacement

Replace the missing values with a simple median, or mean

  • Reflection of the other values in the feature
  • Does’t factor correlation between features
  • Can’t use on categorical features


Most Frequent Value

Replace missing values with the most frequently occurring value in the feature

  • Doesn’t factor correlation between features
  • Works with categorical features
  • Can introduce bias into your model

Model-Based Imputation

Use a machine learning algorithm to impute the missing values

  • K-Nearest Neighbors
    • Uses “feature similarity” to predict missing values
  • Regression
    • Predictors of the variable with missing values identified via correlation matrix
    • Best predictors are selected and used as independent variables in a regression equation
    • Variable with missing data is used as the target variable
  • Deep Learning
    • Works very well with categorical and non-numerical features


Other Methods

  • Interpolation / Extrapolation
    • Estimate values from other observations within the range of a discrete set of known data points
  • Forward filling / Backward filling
    • Fill the missing value by filling it from the preceding value or the succeeding value
  • Hot deck imputation
    • Randomly choosing the missing value from a set of related and similar variables

Feature Selection/Extraction

The Curse of Dimensionality

“Dimensionality” refers to the number of features (i.e. input variables) in your dataset

  • High feature to observation ratio causes some algorithms struggle to train effective models
  • Visualization of multi-dimensional datasets vs two or three-dimensions
  • Two primary methods for reducing dimensionality: Feature Selection and Feature Extraction

Feature Selection

Use feature selection to filter irrelevant or redundant features from your dataset

  • Feature Selection requires normalization

  • Feature Selection removes features from your dataset - Variance Thresholds

Feature Extraction

Requires Standardization

  • Feature Extraction requires standardization

Reduces Features - Retains Information

Creating new features from your existing features, feature extraction creates a new, smaller set of features that stills captures most of the useful information.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised algorithm that creates new features by linearly combining original features

  • New features are uncorrelated, meaning they are orthogonal
  • New features are ranked in order of “explained variance”. The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance, etc.
    • Explained variance tells you how much information (variance) can be attributed to each of the principal components
    • You lose some of the variance (information) when you reduce your dimensional space

  • Principal component analysis (PCA) can be used to assist in visualization of your data

  • Principal component analysis (PCA) can also assist in speeding up your machine learning


  • [feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds_-lab_part_1.ipynb](https://github.com/Oscaner/Exam/blob/master/aws/mls-c01/whizlabs/code/02-data-engineering/feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-lab_part_1.ipynb “feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-_lab_part_1.ipynb”)
  • [feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds_-lab_part_2.ipynb](https://github.com/Oscaner/Exam/blob/master/aws/mls-c01/whizlabs/code/02-data-engineering/feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-lab_part_2.ipynb “feature_extraction_and_feature_selection_with_principal_component_analysis_and_variance_thresholds-_lab_part_2.ipynb”)

Encoding categorical values

  • Binarizer Encoding: for features of a binary nature
  • Label Encoding: may imply ordinality, can use Ordinal Encoder
  • One Hot Encoding: Change nominal categorical values such as “true”, “false”, or “rainy”, “sunny” to numerical values


Numerical engineering

  • Transform numeric values so machine learning algorithms can better analyze them
  • Change numeric values so all values are on the same scale
    • Normalization: rescales the values into a range of [0, 1]
    • Standardization: rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)
  • Binning


AKA discretization or quantization

  • Categorical Binning
    • Group categorical values to gain insight into data: countries by geographical region
  • Numerical Binning
    • Divides continuous feature into a specified number of categories or bins, thus making the data discrete
    • Reduces the number of discrete intervals of a continuous feature
  • Quantile Binning
    • Divide up data into equal sized bins
    • Defines the bins using percentiles based on the distribution of the data


Text Feature Editing

  • Transform text within data so machine learning algorithms can better analyze it
  • Splitting text into smaller pieces
  • Used for text analysis of documents, streamed dialog, etc.
  • Can use in a pipeline as steps in a machine learning analysis


  • Tokenizes raw text and creates a statistical representation of the text
  • Breaks up text by whitespace into single words


  • Extension of Bag-of-Words which produces groups of words of n size
  • Breaks up text by whitespace into groups of words

Orthogonal Sparse Bigram

  • Creates groups of words of size n, returns every pair of words that includes the first word
  • Creates groups of words that always include the first word


  • Term Frequency-Inverse Document Frequency (TF-IDF)
  • Shows how important a word or words are to a given set of text by providing appropriate weights to terms that are common and less common
  • Show the popularity of a word or words in text data by making common words like “the” or “and” less important

Use Cases

Use Case Transformation Reason
Finding phrases in spam N-Gram Compare whole phrases such as “you’re a winner!” or “Buy now!”
Finding subject of several PDFs TF-IDF
Orthogonal Sparse Bigram
Filter less important words in the documents.
Find common word combinations repeated in the documents.


本文由 Oscaner 创作, 采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名