The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
-
Load the docker image into the local computer’s Docker list:
docker load -i dataworld-dwcc-X.X.tar.gz
where X.X is the version number of the dwcc.
-
The previous command will return an <image id> which needs to be renamed as 'dwcc'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwcc
The following parameters are used to run the dwcc. Where available, either short (e.g., -a) or long (--acccount) forms can be used. Required parameters are shown in bold.
--generate-uri-mapping |
generate statements in the catalog to associate dwcc 1. x URIs with their dwcc 2.x equivalents |
|
--manta-api-url |
<mantaAdminApiBaseUrl> |
URL of MANTA Admin API |
--manta-db-id-mapping |
<databaseIdMappings> |
Mappings of the form [server]/[database name]= [database-id], used to associate a database-id with a database found in the MANTA graph that has the specified server and database names. You only need provide to this if the database name in the MANTA graph is not sufficiently unique to completely identify the database. |
--manta-export-file |
<batchExportFile> |
MANTA batch export output file to use instead of exporting via API |
--manta-password |
<mantaPassword> |
Password to use for API authentication |
--manta-port-mapping |
<portMappings> |
Mappings of the form [server]/[database name]=[port], used to associate a port with a database found in the MANTA graph that has the specified server and database names. You only need provide to this if the database listens on a port other than the default port for that type of database. |
--manta-scan |
If present, have MANTA perform an analysis scan to refresh the MANTA graph prior to exporting |
|
--manta-user |
<mantaUser> |
MANTA user to use for API authentication |
--manta-viewer-url |
<mantaViewerUrl> |
URL of MANTA Viewer UI |
--upload-location |
<uploadLocation> |
the dataset to which the catalog is to be uploaded, specified as a simple dataset name to upload to that dataset within the organization's account, or [account/dataset] to upload to a dataset in some other account (ignored if --upload not specified) |
--use-v1-uris |
generate dwcc 1.x URIs for catalog records/objects (for historical compatibility) |
|
-a, --account |
<account> |
the ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated |
-b, --base |
<base> |
the base URI to use as the namespace for any URIs generated; required unless --account is specified; required unless --account is specified |
-H, --api-host |
<apiHost> |
the host for the data.world API |
-L, --no-log-upload |
do not upload the log of the dwcc run to the organization account's catalogs dataset or to another location specified with --upload-location (ignored if --upload not specified) |
|
-n, --name |
<catalogName> |
the name of the catalog - this will be used to generate the ID for the catalog as well as the filename into which the catalog file will be written |
-o, --output |
<outputDir> |
the output directory into which any catalog files should be written |
-t, --api-token |
<apiToken> |
the data.world API token to use for authentication; default is to use an environment variable named DW_AUTH_TOKEN |
-U,--upload |
whether to upload the generated catalog to the organization account's catalogs dataset or to another location specified with --upload-location (requires --api-token) |
The script below is a copy-and-paste version for any Unix environment that uses a Bash shell (e.g., MacOS and Linux). Choose the variables you wish to use and replace the values with your information as appropriate. When you are finished, run the script.
catalog connector runtime
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.