The data.world catalog collector (dwcc) ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.
As of dwcc version 1.10, it is possible to catalog ETL jobs associated with an AWS account. There is no need to mount in a jdbc drivers directory as the Glue cataloger uses the Glue API, not JDBC.
As of dwcc version 1.12 we can support not only Glue ETL jobs, but also Glue Data Catalog tables and columns.
As of dwcc 1.14, you can change the amount of memory that gets allocated to a DWCC docker process. See our article on allocating additional memory to Docker for more information.
As of dwcc 1.18, you can specify alternate organization permissions and upload locations when performing an automatic upload of the metadata. If you are a single-tenant VPC customer you will also need to add
a --api-host api.customername where customername was your subdomain (confirm with your data.world representative on your specific subdomain if you are unsure).
The computer running the catalog collector should have network access to the Glue instance, and a minimum of 2G memory and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
An AWS credentials file for authentication which contains the user profile to determine which AWS account's instance to catalog. Typically the AWS_CREDENTIALS_FILE is at [user’s home directory]/.aws/credentials.
Permissions to GetDatabases, GetTables, ListJobs, and GetJobcreds on AWS to ensure that the collector has permission to grab what it needs to be able to grab. For more information see the Amazon api reference docs at:
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
Load the docker image into the local computer’s Docker list:
docker load -i ddw-data-catalog-collector.tar.gz
The previous command will return an <image id> which needs to be renamed as 'dwcc'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwcc
The parameters used to run the collectors which have variables are:
--e AWS_PROFILE <PROFILE> : Optional - An alternate profile if you do not wish to use the host computer's profile. This is a Docker parameter and so needs to come after the docker run command and before dwcc.
-n <COLLECTION> : The name you want to give to the catalog collection, typically "Glue" or a more unique, relevant business-friendly name that this system has within your organization.
-a <ORG> : The name of your organization in data.world (e.g. democorp).
--database-name <DATABASE> : Optional - If this parameter is omitted, all databases will be cataloged. Otherwise you can choose to catalog a group of databases using a regular expression variable (e.g., prod.+ will catalog all databases that begin with prod), or a single database with an exact match (e.g., nightowl-demo). To catalog only databases and no jobs, use the parameter --no-jobs.
-r <REGION> : The AWS region where the AWS Glue instance resides. (e.g., us-east-1).
---job-name <JOB_NAME> : Optional - job name. If this parameter is omitted all jobs will be cataloged. Otherwise you can choose to catalog a group of jobs using a regular expression variable (e.g., prod.+ will catalog all jobs that begin with prod), or a single job with an exact match (e.g., nightowl-demo). To catalog only jobs and no databases, use the parameter --no-databases.
-U : Optional - You can have the catalog collector automatically upload your *.dwec.ttl file onto data.world by using the -U flag. If you do, you'll also need to use the -t parameter to provide an API token. The file will be uploaded to the ddw-catalogs dataset within the organization specified by -a <ORG> NOTE: When you use the -U flag you will get an error if you do not specify an organization using -a. To find out how to get a token, see the article on generating an API token.
-t <API-TOKEN> : Optional - An API token is required if you want to use -U to automatically upload the *.dwec.ttl file onto data.world.
--upload-location <UPLOAD-LOCATION> : Optional - This parameter can be used when specifying either a different dataset from ddw-catalogs or a different organization from the default specified by -a, or both. The default is 'account/ddw-catalogs'. By using --upload-location you could use either same-org/new-dataset or new-org/new-datatset. The -U and -t parameters are also required in order to use --upload-location.
--api-host <API-HOST> : Optional - If you are a single-tenant/VPC customer and you want to automatically upload to a differrent API endpoint you can use this parameter to specify the base of the API url to something other than API.data.world. This parameter also requires both -U and -t.
The script below is a copy-and-paste version for any Unix environment that’s using a Bash shell (e.g., MacOS and Linux). Replace the variables in the metadata catalog collector script below and then run it:
docker run --rm \ --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log \ --mount type=bind,source=/path/to/local/.aws/credentials,target=/root/.aws/cred entials,readonly -e AWS_PROFILE=<PROFILE> dwcc catalog-awsglue \ -n <COLLECTION> -a <ORG> --database_name <DATABASE> -r <REGION> \ ---job-name <JOB_NAME> -U -t <API-TOKEN> \ --upload-location <UPLOAD-LOCATION> --api-host <API-HOST> -o /dwcc-output
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for frequency of scheduling include:
Frequency of changes to the schema
Business criticality of up-to-date data
For systems with schemas that change often and surfacing the latest data is business critical, daily may be appropriate. For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.