- Data Engineering
- Data Analysis
- Implementation and Operations
Datasets: A collection of data used as the “fuel” for our machine learning models.
Features: describe information about your data set in a columnar way.
Observations: describing your data as each instance of a set of features.
- Datasets can be comma separated or JSON
- Datasets can also be images, audio, or video
Ways to organize data
Structured data: has a schema
- Example: table of values such as a relational database table
Unstructured data: doesn’t have a schema
- Example: PDFs, images, video, audio, logs, tweets
Semi-structured data: contains tags to separate semantic elements and enforce hierarchies
- Examples: CSV, JSON, XML
Labeled data: Has a target attribute
Unlabeled data: No target attribute
Supervised Learning: Mostly used with labeled data
Unsupervised Learning: Mostly used with unlabeled data
- Made up of a set of categories
- They describe the quality not quantity
- They are distinct
- They describe the quantity not the quality
- Data collected from text
- Used in Natural Language Processing (NLP)
- Speech recognition, text-to-speech
Ground Truth Data
Data that has been labeled either by human labelers or by machine learning algorithms.