[MLS-C01] [Exploratory Data Analysis] AWS Glue

Posted by Oscaner on June 23, 2022

Key Facts

  • A fully managed ETL service for categorizing, cleaning, enriching, and moving your data
  • Glue components
    • Central metadata repository: Glue Catalog
    • ETL engine that automatically generates python or scala code
    • Flexible scheduler for dependency resolution, job monitoring, and retries
  • Serverless
  • Can convert semi-structured schemas to relational-schemas on the fly

Terminology

  • Data Catalog: persistent metadata store
  • Classifier: determines the schema of your data
  • Connection: the properties required to connect to data store
  • Crawler: connects to a data store and steps through prioritized list of classifiers to determine schema
  • Database: set of associated data catalog table definitions
  • Data store: repository for persistently storing data
  • Data source: data store used as input to transformation
  • Data target: data store that a transformation writes to
  • Job: ETL logic
  • Table: metadata definition that represents your data
  • Transform: code logic to change your data into a different format

Components

  • Console: define and orchestrate ETL workflows
  • Data Catalog: persistent metadata store
  • Crawlers and Classifiers: crawlers scan data and classify it
  • ETL Operations: using metadata in the data catalog, autogenerates python or scala code
  • Jobs System: managed infrastructure to orchestrate your ETL workflow

Labs


本文由 Oscaner 创作, 采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外, 均为本站原创或翻译, 转载前请务必署名