The data.world catalog collector (dwcc) ships as a Docker image, which can be loaded and run with a series of command line (CLI) options, and outputs a file with the extension *.dwec.ttl. The *.dwec.ttl files should be uploaded to data.world directly.
- The computer running the catalog collector should have network access to the Snowflake instance, and a minimum of 2G memory and a 2Ghz processor.
- The computer should have the Snowflake JDBC driver on its filesystem (this documentation assumes the .jar file driver is in the ../jdbcdrivers directory).
- Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
- Load the docker image into the local computer’s Docker list:
docker load -i ddw-data-catalog-collector.tar.gz
Running the collector
- Where the following variables represent:
- the database username
- the password for the above user
- the host/IP of the instance
- Optional - the VDP port of the Snowflake instance
- Only include -p "$SNOWFLAKE_PORT" if you want to use something other than the default port.
- the database name that we will catalog
- the schema name that we will catalog
- the name you want to give to the catalog collection, typically "Snowflake" or a more unique, relevant business-friendly name that this system has within your organization
- the name of the organization in data.world (e.g. democorp)
- Replace the variables with the appropriate information, and run:
--mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log \
--mount type=bind,source=/jdbcdrivers,target=/usr/src/dwcc-config/lib \
dwcc catalog-snowflake -n <COLLECTION_NAME> \
-a <ORG_NAME> -d <SNOWFLAKE_DATABASE> -u <SNOWFLAKE_USER> \
-s <SNOWFLAKE_HOST> -S <SNOWFLAKE_SCHEMA> \
-P <SNOWFLAKE_PASSWORD> [-p <SNOWFLAKE_PORT>] -o /dwcc-output
- Note: The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
- If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command.
- If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
- Leverage cron or your Docker container or other automation tool of choice to schedule the catalog collector to run on a recurring basis.
- For systems with schemas that change often and surfacing the latest is business critical, daily may be appropriate.
- For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense.
- Consult your data.world representative for more tailored recommendations on how best to productionize your catalog collector processes.
Handling custom certificates
If the target Snowflake instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).
ADD ./ca.der ca.der
RUN keytool -importcert -alias startssl -keystore /etc/ssl/certs/java/cacerts -storepass changeit -noprompt -file ca.der
Then, in the directory with that Dockerfile:
docker build -t dwcc-cert .
And change the docker run command to use dwcc-cert instead of dwcc.