Defined
- Both supervised and unsupervised learning algorithms
- Take text or documents as input and either categorize, sequence, or classify the text or documents
- Used as preprocessing for many downstream natural language processing (NLP) tasks, such as
sentiment analysis
,named entity recognition
,machine translation
- Text classification used for applications that perform web searches, ranking, and document classification
Use Cases
- Sentiment analysis for social media streams
- Categorize documents by topic for law firms
- Language translation
- Speech-to-text
- Summarizing longer documents
- Conversational user interfaces
- Text generation
- Word pronunciation app
SageMaker Algorithms
Blazing Text
- Implements the Word2vec and text classification algorithms
- Can use pre-trained vector representations that improve the generalizability of other models that are later trained on a more limited amount of data
- Can easily scale for large text datasets
- Can train a model on more that a billion words very quickly, in minutes, using a large multi-core CPU or a GPU
- Words that are semantically similar correspond to vectors that are close together, resulting that word embeddings capture the semantic relationships between words.
- Example use case: market research using sentiment analysis
- Important Hyperparameters
-
mode
: architecture used for training
-
Latent Dirichlet Allocation (LDA)
- Unsupervised learning algorithm that organizes a set of text observations into distinct categories
- Frequently used to discover a number of topics shared across documents within a collection of texts, or a corpus
- In an LDA algorithm based model, each observation is a document and each feature is a count of a word in the documents
- Topics are not specified in advance
- Each document is described as a mixture of topics
- Example use case: find common topics in call center transcripts
- Important Hyperparameters
-
num_topics
: number of topics to find in the data -
feature_dim
: size of the vocabulary of the input document corpus -
min_batch_size
: total number of documents in the input document corpus
-
Neural Topic Model (NTM)
- Unsupervised learning algorithm that organizes a corpus of documents into topics containing word groupings, based on the statistical distribution of the word groupings
- Frequently used to classify or summarize documents based on topics detected
- Also used to retrieve information or recommend content based on topic similarities
- Topics are inferred from observed word distributions in the corpus
- Used to visualize the contents of a large set of documents in terms of the learned topics
- Similar to LDA, but will produce different outcomes
- Example use case: find the topics of newsgroup message posts
- Important Hyperparameters
-
feature_dim
: vocabulary size of the dataset -
num_topics
: number of required topics
-
Object2Vec
- General purpose neural embedding algorithm that finds related clusters of words (words that are semantically similar)
- Embeddings can be used to find nearest neighbors of objects, and can also visualize clusters of related objects
- Besides word embeddings, Object2Vec can also learn the embeddings of other objects such as sentences, customers, products, etc.
- Frequently used for information retrieval, product search, item matching, customer profiling, etc. based on related topics
- Supports embeddings of paired tokens, paired sequences, and paired token to sequence
- Example use case: recommendation engine based on collaborative filtering
- Important Hyperparameters
-
enc0_max_seq_len
: maximum sequence length for the enc0 encoder -
enc0_vocab_size
: vocabulary size of enc0 tokens
-
Sequence-to-Sequence (seq2seq)
- Supervised learning algorithm with input of a sequence of tokens (audio, text, radar data) and output of another sequence of tokens
- Can be used for translation from one language to another, text summarization, speech-to-text
- Uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models
- Uses state-of-the-art encoder-decoder architecture
- Uses input of sequence data in recordio-protobuf format and JSON vocabulary mapping files
- Example use case: word pronunciation dictionary, a sequence of text as input and a sequence of audio as output
- Important Hyperparameters
- Has no required hyperparameters
Labs
本文由
Oscaner
创作, 采用
知识共享署名4.0
国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名
-
Previous
[MLS-C01] [Algorithms] Image Analysis Algorithms -
Next
[MLS-C01] [Algorithms] Reinforcement Learning Algorithms