- Fully managed ETL service
- Serverless
- Pay as you go
- ETL jobs are executed on scalable Apache Spark environment
- Supports encryption for data both at rest and data in-transit
- Data Sources supported by AWS Glue:
- Amazon S3
- Amazon RDS
- Amazon DynamoDB
- Amazon Redshift
- Amazon Kinesis Data Streams
- Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
- Apache Kafka
- Data Targets supported by AWS Glue:
- Amazon S3
- Amazon RDS
- Third-party JDBC-accessible databases(Oracle, SQL SERVER, MYSQL, POSTGRES, MONGODB)
- Components of AWS Glue:
- Data Catalog -> Repository where job definitions, metadata and table definitions are stored
- Crawler -> Program that creates metadata table in Data Catalog
- Classifier -> Used by crawler to determine schema of data store
- Database -> data store within the catalog
- Connection -> configuration file to connect to a data store
- Data store -> Input data store for crawler, such as S3 or RDBMS
- Table -> Object in Data Catalog where crawler writes metadata to
- data source -> Input data source for the ETL job
- data target -> Destination data warehouse to store manipulated data, such as AWS Redshift
- Script -> Pyspark or Scala scripts to extract, transform and load data
- Transform -> Your code logic to manipulate data
- Trigger -> To kick start the ETL job
- Job -> ETL job started by trigger
- Development endpoint -> Environment for developing and testing ETL scripts