AWS Glue Cheat Sheet

Fully managed ETL service
Serverless
Pay as you go
ETL jobs are executed on scalable Apache Spark environment
Supports encryption for data both at rest and data in-transit
Data Sources supported by AWS Glue:
- Amazon S3
- Amazon RDS
- Amazon DynamoDB
- Amazon Redshift
- Amazon Kinesis Data Streams
- Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
- Apache Kafka
Data Targets supported by AWS Glue:
- Amazon S3
- Amazon RDS
- Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
Components of AWS Glue:
- Data Catalog -> Repository where job definitions, metadata and table definitions are stored
- Crawler -> Program that creates metadata table in Data Catalog
- Classifier -> Used by crawler to determine schema of data store
- Database -> data store within the catalog
- Connection -> configuration file to connect to a data store
- Data store -> Input data store for crawler, such as S3 or RDBMS
- Table -> Object in Data Catalog where crawler writes metadata to
- data source -> Input data source for the ETL job
- data target -> Destination data warehouse to store manipulated data, such as AWS Redshift
- Script -> Pyspark or Scala scripts to extract, transform and load data
- Transform -> Your code logic to manipulate data
- Trigger -> To kick start the ETL job
- Job -> ETL job started by trigger
- Development endpoint -> Environment for developing and testing ETL scripts