The data.world catalog collector (dwcc) ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.
Often the tables in an Athena database will participate in Glue ETL jobs. The data.world catalog collector dwcc has the ability to catalog lineage information in Glue ETL jobs. For more information see our article on the Glue catalog collector.
As of dwcc 1.14, you can change the amount of memory that gets allocated to a DWCC docker process. See our article on allocating additional memory to Docker for more information.
As of dwcc 1.18, you can specify alternate organization permissions and upload locations when performing an automatic upload of the metadata. If you are a single-tenant VPC customer you will also need to add
--api-host api.customername where customername is your subdomain (confirm with your data.world representative on your specific subdomain if you are unsure).
The computer running the catalog collector should have network access to the Athena instance, and a minimum of 2G memory and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
An AWS credentials file for authentication which contains the user profile to determine which AWS account's instance to catalog. Typically the AWS_CREDENTIALS_FILE is at [user’s home directory]/.aws/credentials.
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
Load the docker image into the local computer’s Docker list:
docker load -i ddw-data-catalog-collector.tar.gz
The previous command will return an <image id> which needs to be renamed as 'dwcc'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwcc
The parameters used to run the collectors which have variables are:
-e AWS_PROFILE <PROFILE> : Optional - An alternate profile if you do not wish to use the host computer's profile. This is a Docker parameter and so needs to come after the docker run command and before dwcc.
-n <COLLECTION> : The name you want to give to the catalog collection, typically "Athena" or a more unique, relevant business-friendly name that this system has within your organization.
-a <ORG> : The name of your organization in data.world (e.g. democorp).
-d <DATABASE> : The name of the schema that contains the tables that will be cataloged.
-r <REGION> : The AWS region where the AWS Glue instance resides. (e.g., us-east-1).
---S3-output-location <OUTPUT_LOCATION> : The Amazon S3 bucket where query results should be stored. The location should start with s3://. For example, to store results in a folder named "test-folder-1" inside an S3 bucket named "query-results-bucket", you would set the location to s3://query-results-bucket/test-folder-1.
-U : Optional - You can have the catalog collector automatically upload your *.dwec.ttl file onto data.world by using the -U flag. If you do, you'll also need to use the -t parameter to provide an API token. The file will be uploaded to the ddw-catalogs dataset within the organization specified by -a <ORG> NOTE: When you use the -U flag you will get an error if you do not specify an organization using -a. To find out how to get a token, see the article on generating an API token.
-t <API-TOKEN> : Optional - An API token is required if you want to use -U to automatically upload the *.dwec.ttl file onto data.world.
--upload-location <UPLOAD-LOCATION> : Optional - This parameter can be used when specifying either a different dataset from ddw-catalogs or a different organization from the default specified by -a, or both. The default is 'account/ddw-catalogs'. By using --upload-location you could use either same-org/new-dataset or new-org/new-datatset. The -U and -t parameters are also required in order to use --upload-location.
--api-host <API-HOST> : Optional - If you are a single-tenant/VPC customer and you want to automatically upload to a differrent API endpoint you can use this parameter to specify the base of the API url to something other than API.data.world. This parameter also requires both -U and -t.
The script below is a copy-and-paste version for any Unix environment that’s using a Bash shell (e.g., MacOS and Linux). Replace the variables in the metadata catalog collector script below and then run it:
docker run --rm \ --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log \ --mount type=bind,source=/path/to/local/.aws/credentials,target=/root/.aws/cred entials,readonly -e AWS_PROFILE=<PROFILE> dwcc catalog-athena \ -n <COLLECTION> -a <ORG> -r <REGION> -d <DATABASE> \ ---S3-output-location <OUTPUT_LOCATION> -U -t <API-TOKEN> \ --upload-location <UPLOAD-LOCATION> --api-host <API-HOST> -o /dwcc-output
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for frequency of scheduling include:
Frequency of changes to the schema
Business criticality of up-to-date data
For systems with schemas that change often and surfacing the latest data is business critical, daily may be appropriate. For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.