AWS Glue Cheat Sheet

  • Fully managed ETL service
  • Serverless
  • Pay as you go
  • ETL jobs are executed on scalable Apache Spark environment
  • Supports encryption for data both at rest and data in-transit
  • Data Sources supported by AWS Glue:
    • Amazon S3
    • Amazon RDS
    • Amazon DynamoDB
    • Amazon Redshift
    • Amazon Kinesis Data Streams
    • Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
    • Apache Kafka
  • Data Targets supported by AWS Glue:
    • Amazon S3
    • Amazon RDS
    • Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
  • Components of AWS Glue:
    • Data Catalog -> Repository where job definitions, metadata and table definitions are stored
    • Crawler -> Program that creates metadata table in Data Catalog
    • Classifier -> Used by crawler to determine schema of data store
    • Database -> data store within the catalog
    • Connection -> configuration file to connect to a data store
    • Data store -> Input data store for crawler, such as S3 or RDBMS
    • Table -> Object in Data Catalog where crawler writes metadata to
    • data source -> Input data source for the ETL job
    • data target -> Destination data warehouse to store manipulated data, such as AWS Redshift
    • Script -> Pyspark or Scala scripts to extract, transform and load data
    • Transform -> Your code logic to manipulate data
    • Trigger -> To kick start the ETL job
    • Job -> ETL job started by trigger
    • Development endpoint -> Environment for developing and testing ETL scripts