Defined
- Supervised learning algorithm
- Learns from training data then uses the result to classify new observations
- Two types:
binary-class
ormulti-class
Use Cases
- Voter prediction
- Customer loan default
- Object detection
- Image classification
- Fraud detection
- Customer segmentation
- Product classification
SageMaker Algorithms
Linear Learner
- Input a set of high-dimensional vectors including a numeric target, or label
- Target is 0 or 1 for binary classification
- Learns a linear threshold function and maps a vector to an approximation of the target
- Good classification model based on Linear Learner optimizes
- Discrete objectives suited for classification, such as
F1 measure
,precision
,recall
, oraccuracy
.
- Discrete objectives suited for classification, such as
- Requires a data matrix of observations across dimension of features
- Also requires a target column across the observations
- Example use case: predict event outcome - win or lose
- Important Hyperparameters
-
feature_dim
: number of feature in the input -
predictor_type
: type of the target variable (binary_classifier
ormulticlass_classifier
) -
loss
: specifies the loss function (auto
,logistic
,hinge_loss
,softmax_loss
)
-
Blazing Text
- Implements the Word2vec and text classification algorithms
- Useful for many downstream natural language processing (NLP) tasks, such as
sentiment analysis
,named entity recognition
,machine translation
- Also useful for applications that
perform web searches
,information retrieval
,ranking
, anddocument classification
- Maps words to vectors
- Resulting vector representation of a word is called a word embedding
- Words that are semantically similar correspond to vectors that are close together, resulting that word embeddings capture the semantic relationships between words
- Example use case: spam detection using detection of common spam phrases like “you’re a winner !”
- Important Hyperparameters
-
mode
: Word2vec architecture for training (batch_skipgram
,skipgram
,cbow
)
-
XGBoost
- Implementation of gradient boosted trees algorithm
- Supervised learning algorithm for predicting a target by combining the estimates from a set of simpler models
- Requires a data matrix of observations across dimension of features
- Also requires a target column across the observations
- Can differentiate the importance of features through weights
- Example use case: predict default on credit card payments
- Important Hyperparameters
-
num_round
: number of rounds the training runs -
num_class
: the number of classes -
objective
: learning task and learning objective (i.e.reg:logistic
,multi:softmax
)
-
K-Nearest Neighbors
- Finds the k closest points to the sample point and gives a prediction of the average of their features
- Index based
- Objective: build k-NN index to allow for efficient determination of the distance between points
- Train to construct the index
- Use dimensionality reduction to avoid the “curse of dimensionality”
- Example use case: predict wilderness tree types from geological and forest service data
- Important Hyperparameters
-
feature_dim
: number of feature in the input -
k
: number of nearest neighbors -
predictor_type
: classifier for classification problems -
sample_size
: number of data points to be samples from the training dataset -
dimensionality_reductoin_target
: target dimension to reduce to
-
Factorization Machines
- Extension of linear model used on high dimensional sparse datasets
- Typically used for sparse datasets such as
click prediction
anditem recommendation
- Scored using
Binary Cross Entropy (Log Loss)
,Accuracy (at threshold=0.5)
andF1 Score (at threshold=0.5)
- Example use case: item recommendation
- Important Hyperparameters
-
feature_dim
: number of feature in the input -
num_factors
: dimensionality of factorization -
predictor_type
:binary_classifier
for classification problems
-
Image Classification
- Supervised learning algorithm that supports multi-label classification
- Takes an image as input and outputs one or more labels assigned to that image
- Uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available
- Recommended input format is RecordIO, but can also use raw images in
.jpg
or.png
format. - Example use case: classify image content as offensive or not for Twitter feed posts
- Important Hyperparameters
-
num_classes
: the number of output classes -
num_training_samples
: Number of training examples in the input dataset
-
Random Cut Forest
- Unsupervised algorithm for detecting anomalous data points within a data set
- Uses an anomaly score
- Low score indicates data point is considered normal, high score indicates the presence of an anomaly in the data
- Definition of “low” and “high” depend on the application, common practice: scores beyond three standard deviations from the mean score are considered anomalous
- Requires a target column across the observations
- Example use case: find exceptions in streaming trade data
- Important Hyperparameters
-
feature_dim
: number of feature in the input -
eval_metrics
: List of metrics used to score a labeled test data set (accuracy
,precision_recall_fscore
) -
num_trees
: Number of trees in the forest
-
Labs
本文由
Oscaner
创作, 采用
知识共享署名4.0
国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名
-
Previous
[MLS-C01] [Algorithms] Clustering Algorithms -
Next
[MLS-C01] [Algorithms] Image Analysis Algorithms