The data.world catalog collector (dwcc) ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.
As of dwcc 1.14, you can change the amount of memory that gets allocated to a DWCC docker process. See our article on allocating additional memory to Docker for more information.
As of dwcc 1.18, you can specify alternate organization permissions and upload locations when performing an automatic upload of the metadata. If you are a single-tenant VPC customer you will also need to add
--api-host api.customername where customername is your subdomain (confirm with your data.world representative on your specific subdomain if you are unsure).
The computer running the catalog collector should have network access to the Hive instance, and a minimum of 2G memory and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
The computer should have the Hive JDBC driver on its filesystem (this documentation assumes the .jar file driver is in the ../jdbcdrivers directory).
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
Load the docker image into the local computer’s Docker list:
docker load -i ddw-data-catalog-collector.tar.gz
The previous command will return an <image id> which needs to be renamed as 'dwcc'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwcc
The parameters used to run the collectors which have variables are:
-n <COLLECTION> : The name you want to give to the catalog collection, typically "Hive" or a more unique, relevant business-friendly name that this system has within your organization.
-a <ORG> : The name of your organization in data.world (e.g. democorp).
-d <DATABASE> : The name of the database that will be cataloged.
-u <USER> : The database username.
-s <HOST> : The host/IP of the database instance.
-S <SCHEMA> : The schema to be cataloged.
-P <PASSWORD> : The user password for the source.
-p <PORT>] : Optional - the port used by the database instance. Only include the port parameter if you want to use something other than the default port.
-U : Optional - You can have the catalog collector automatically upload your *.dwec.ttl file onto data.world by using the -U flag. If you do, you'll also need to use the -t parameter to provide an API token. The file will be uploaded to the ddw-catalogs dataset within the organization specified by -a <ORG> NOTE: When you use the -U flag you will get an error if you do not specify an organization using -a. To find out how to get a token, see the article on generating an API token.
-t <API-TOKEN> : Optional - An API token is required if you want to use -U to automatically upload the *.dwec.ttl file onto data.world.
--upload-location <UPLOAD-LOCATION> : Optional - This parameter can be used when specifying either a different dataset from ddw-catalogs or a different organization from the default specified by -a, or both. The default is 'account/ddw-catalogs'. By using --upload-location you could use either same-org/new-dataset or new-org/new-datatset. The -U and -t parameters are also required in order to use --upload-location.
--api-host <API-HOST> : Optional - If you are a single-tenant/VPC customer and you want to automatically upload to a differrent API endpoint you can use this parameter to specify the base of the API url to something other than API.data.world. This parameter also requires both -U and -t.
The script below is a copy-and-paste version for any Unix environment that’s using a Bash shell (e.g., MacOS and Linux). Replace the variables in the metadata catalog collector script below and then run it:
docker run --rm \ --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log \ --mount type=bind,source=/jdbcdrivers,target=/usr/src/dwcc-config/lib \ dwcc catalog-hive -n <COLLECTION> -a <ORG> -d <DATABASE> \ -u <USER> -s <HOST> -S <SCHEMA> -P <PASSWORD> -p <PORT> \ -U -t <API-TOKEN> --upload-location <UPLOAD-LOCATION> \ --api-host <API-HOST> -o /dwcc-output
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the catalog collector ran without issues, you should see no output on the terminal, but a new file that matches the *.dwec.ttl should be in the /tmp directory from where you executed the command. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to the data.world team to investigate if the errors aren’t clear.
If the target instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).
FROM dwcc ADD ./ca.der ca.derRUN keytool -importcert -alias startssl -keystore /etc/ssl/certs/java/cacerts -storepass changeit -noprompt -file ca.der
Then, in the directory with that Dockerfile:
docker build -t dwcc-cert .
And change the docker run command to use dwcc-cert instead of dwcc.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for frequency of scheduling include:
Frequency of changes to the schema
Business criticality of up-to-date data
For systems with schemas that change often and surfacing the latest data is business critical, daily may be appropriate. For systems with schemas that do not change often and are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.