The data.world catalog collector (dwcc) ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token. As of dwcc version 1.10, it is possible to catalog AWS Athena databases associated with an AWS account. There is no need to mount in a jdbc drivers directory as the Athena JDBC driver is bundled with dwcc.
- The computer running the catalog collector should have network access to the Athena instance, and a minimum of 2G memory and a 2Ghz processor.
- You need an AWS credentials file for authentication--the profiles defined therein determine which AWS account's instance to catalog.If you want to use a profile other than the default one, you will set it with the AWS_PROFILE environment variable. The credentials file is usually located in your home directory. To use the credential file, mount it into the container (and, for safety, mount it read-only) as shown below.
- Request access to download links from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
- Load the docker image into the local computer’s Docker list:
docker load -i <filename>.tar.gz
Running the collector
- Where the following variables represent:
- an alternate profile if you do not wish to use the host computer's profile.
- the name you want to give to the catalog collection, typically "Athena" or a more unique, relevant business-friendly name that this system has within your organization
- this should be the AWS region where the Athena and AWS Glue instance reside. (e.g., us-east-1)
- The Amazon S3 bucket where query results should be stored. The location should start with s3://. For example, to store results in a folder named "test-folder-1" inside an S3 bucket named "query-results-bucket", you would set the location to s3://query-results-bucket/test-folder-1.
- the name of the organization in data.world (e.g. democorp)
- the name of the database that we will catalog
- You can have the catalog collector automatically upload your *.dwec.ttl file onto data.world by using the -U flag and providing an API token in the -t parameter. The file will be uploaded to the ddw-catalogs dataset within the organization specified by -a <ORG> NOTE: When you use the -U flag you will get an error if you do not specify an organization using -a. To find out how to get a token, see the article on generating an API token.
- Replace the variables with the appropriate information, and run the Athena cataloger:
docker run --rm \ --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log \ --mount type=bind,source=/path/to/local/.aws/creden
tials,target=/root/.aws/credentials,readonly -e AWS_PROFILE=<PROFILE> \ dwcc catalog-athena \
--name=<COLLECTION> --aws-region=<REGION> \
--s3-output-location=<OUTPUT_LOCATION> -a <ORG> \
--database=<DATABASE> -U -t <API_TOKEN> -o /dwcc-output
- Note: The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
- Often the tables in an Athena database will participate in Glue ETL jobs. The data.world catalog collector dwcc has the ability to catalog lineage information in Glue ETL jobs. For more information see our article on the Glue catalog collector.
- If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command.
- If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
- Leverage cron or your Docker container or other automation tool of choice to schedule the catalog collector to run on a recurring basis.
- For systems with schemas that change often and surfacing the latest is business critical, daily may be appropriate.
- For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense.
- Consult your data.world representative for more tailored recommendations on how best to put your catalog collector processes into production.