The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
-
Load the docker image into the local computer’s Docker list:
docker load -i dataworld-dwcc-X.X.tar.gz
where X.X is the version number of the dwcc.
-
The previous command will return an <image id> which needs to be renamed as 'dwcc'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwcc
The following parameters are used to run the dwcc. Where available, either short (e.g., -a) or long (--acccount) forms can be used. Required parameters are shown in bold.
parameter |
value |
description |
--generate-uri-mapping |
generate statements in the catalog to associate dwcc 1.x URIs with their dwcc 2.x equivalents |
|
--tableau-api-base-url |
<baseUrl> |
Base URL of the Tableau API |
--tableau-password |
<password> |
Tableau password for authentication |
--tableau-project-id |
<projectId> |
ID of the Tableau project to catalog (if not providedCCCwill catalog all projects) |
--tableau-username |
<username> |
Tableau username for authentication |
--upload-location |
<uploadLocation> |
the dataset to which the catalog is to be uploadedCCCspecified as a simple dataset name to upload to that dataset within the organization's accountCCC or[account/dataset] to upload to a dataset in someother account (ignored if --upload not specified) |
--use-v1-uris |
generate dwcc 1.x URIs for catalog records/objects (forhistorical compatibility) |
|
-a, --account |
<account> |
the ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated |
-b, --base |
<base> |
the base URI to use as the namespace for any URIs generated |
-H, --api-host |
<apiHost> |
the host for the data.world API |
-L, --no-log |
upload do not upload the log of the dwcc run to the organization account's catalogs dataset or to another location specified with --upload-location (ignored if--upload not specified) |
|
-n, --name |
<catalogName> |
the name of the catalog - this will be used to generate the ID for the catalog as well as the filename into which the catalog file will be written |
-o, --output |
<outputDir> |
the output directory into which any catalog files should be written |
-t, --api-token |
<apiToken> |
the data.world API token to use for authenticationCCCdefault is to use an environment variable named DW_AUTH_TOKEN |
-U, --upload |
whether to upload the generated catalog to the organization account's catalogs dataset or to another location specified with --upload-location (requires--api-token) |
The script below is a copy-and-paste version for any Unix environment that uses a Bash shell (e.g., MacOS and Linux). Choose the variables you wish to use and replace the values with your information as appropriate. When you are finished, run the script.
catalog connector runtime
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).
Dockerfile:
FROM dwcc ADD ./ca.der ca.derRUN keytool -importcert -alias startssl -keystore /etc/ssl/certs/java/cacerts -storepass changeit -noprompt -file ca.der
Then, in the directory with that Dockerfile:
docker build -t dwcc-cert
Finally, change the docker run command to use dwcc-cert instead of dwcc.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.