The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.
The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.
The user defined to run DWCC must have read access to all resources being cataloged.
The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.
Do not forget to replace
x.y in the command
datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.
For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).
FROM DWCC ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \ -storepass changeit -noprompt -file ca.der
Then, in the directory with that Dockerfile:
docker build -t DWCC-cert
Finally, change the docker run command to use DWCC-cert instead of DWCC.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.