The data.world catalog collector (dwcc) ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.
As of dwcc version 1.10, it is possible to catalog Glue ETL jobs associated with an AWS account. There is no need to mount in a jdbc drivers directory as the Glue cataloger uses the Glue API, not JDBC. As of dwcc version 1.12 we can support not just Glue ETL jobs, but also Glue Data Catalog tables and columns.
- The computer running the catalog collector should have network access to the AWS Glue instance, and a minimum of 2G memory and a 2Ghz processor.
- You need an AWS credentials file for authentication--the profiles defined therein determine which AWS account's instance to catalog.If you want to use a profile other than the default one, you will set it with the AWS_PROFILE environment variable. The credentials file is usually located in your home directory. To use the credential file, mount it into the container (and, for safety, mount it read-only) as shown below.
- Request access to download links from your data.world representative for the Glue catalog collector. Once you receive the links, download the catalog collector Docker image (or programmatically download it with curl).
- Load the docker images into the local computer’s Docker list:
docker load -i <filename>.tar.gz
Running the collector
Where the following variables represent:
- An alternate profile if you do not wish to use the host computer's profile.
- The name you want to give to the catalog collection, typically "AWSGlue" or a more unique, relevant business-friendly name that this system has within your organization
- This should be the AWS region where the AWS Glue instance reside. (e.g., us-east-1)
- The name of the organization in data.world (e.g. democorp
- You can have the catalog collector automatically upload your *.dwec.ttl file onto data.world by using the -U flag and providing an API token in the -t parameter. The file will be uploaded to the ddw-catalogs dataset within the organization specified by -a <ORG> NOTE: When you use the -U flag you will get an error if you do not specify an organization using -a. To find out how to get a token, see the article on generating an API token.
- The database name is optional. If this parameter is omitted all databases will be cataloged. Otherwise you can choose to catalog a group of databases using a regular expression variable (e.g., Nightowl.+ will catalog all databases that begin with prod), or a single job with an exact match (e.g., nightowl). To catalog only jobs and no databases use the parameter --no-databases.
- The job name is optional. If this parameter is omitted all jobs will be cataloged. Otherwise you can choose to catalog a group of jobs using a regular expression variable (e.g., prod.+ will catalog all jobs that begin with prod), or a single job with an exact match (e.g., nightowl-demo). To catalog only databases and no jobs use the parameter --no-jobs.
Replace the variables with the appropriate information, and run the Glue cataloger:
docker run --rm \ --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log \ --mount type=bind, source=/path/to/local/.aws/creden
tials,target=/root/.aws/credentials,readonly \ -e AWS_PROFILE=<PROFILE> dwcc catalog-awsglue --name=<COLLECTION> \
--aws-region=<REGION> -a <ORG> --database-name=<DATABASE_NAME> \
--job-name=<JOB_NAME> -U -t <API_TOKEN> -o /dwcc-output
- Note: The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
- If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command.
- If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
You can also leverage cron or your Docker container or other automation tool of choice to schedule the catalog collector to run on a recurring basis.
- For systems with schemas that change often and surfacing the latest is business critical, daily may be appropriate.
- For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense.
- Consult your data.world representative for more tailored recommendations on how best to put your catalog collector processes into production.