A dataset is the basic repository for data files and associated metadata, documentation, scripts, and any other supporting assets that should be stored alongside the data. Datasets are where all data is stored and documented for later sharing and use in projects. While there are many functional similarities between datasets and projects, we recommend that you store your data in a dataset and work with it--combine, query, analyze, draw insights, etc.--in a project.
When you create a dataset it might be because you have a database or other tabular data that you want to analyze and share. But data from a database isn't the only kind of data you can put in a dataset. Any file type can be saved there. Information about various file types and the ways they are handled is in the article Supported file types.
There are several ways datasets can be created:
- Manually - we'll walk through that here
- Via our API - instructions available in our API docs
- Through super connectors like Stitch, KNIME, Knots, and Singer - instructions can be found in our integration documentation under super connectors
- With Sparklebot - for data portals or enterprise companies, contact data.world to find out more about our tools to automate creation and syncing of your data library. This can be full data and metadata mirroring, or simply a catalog of your data sources with metadata and sample data where you'd like.
Creating a dataset
To create a new dataset, click on +New on the right of the menu bar and you'll be prompted to choose either a dataset or a project:
Choose Create new dataset and you'll be prompted to name the dataset, and set the ownership and, accessibility. If you are in one or more organizations, by default the owner field will contain the name of one of the organizations you are in. You can also set the owner to be yourself or any of the other organizations you are in by selecting the dropdown on the owner filed:
Dataset owner and permissions
If the dataset is intended to be used in the organization, it should typically be created under the organization namespace. In this way the dataset benefits from the organization's service tier, permissions can be easily set based on the members of the organization, and datasets remain available within the organization even as individuals and permissioning changes. Permissions on a dataset owned by an organization can either be set to No on or to everyone in the organization:
If you are not in any data.world organizations, you will automatically be set as the owner of the dataset, and you can choose to keep the dataset private or to share it with the data.world community:
The number of private datasets you are allowed is determined by your user license--you can create as many public datasets as you would like. More information on account types and pricing are found on our pricing page. There are several factors to consider when deciding whether to make your dataset public or private:
- When you make a dataset public you allow others to use that dataset in their own projects and build from it. They can't change your dataset in any way or even save queries to it, but they can use and share it.
- Data that is public on data.world can be downloaded from data.world and used externally. If your data is proprietary or sensitive, it shouldn't be shared.
- Publicly shared datasets add to the amount of information that is available to everyone for analyzing, visualizing and learning from
More information on permissions can be found in the article Understanding permissions.
Whatever the permissions are set at for the dataset will also pass through to any projects that use the dataset. So if the dataset is shared with no one then only you will be able to use it in a project, and if the project in which you include it is open to everyone, no one else will be able to see that dataset. Permissions can always be edited at a later time. After you create your dataset you can document your objective for it, add data to it, or continue on to the overview.
Adding files to a dataset
The files stored in a dataset can include more than just clean data. They can contain raw data, the scripts used to clean the data, the clean data, images, documentation--i.e., any information that would be useful for analyzing and understanding the data. See What file types can I upload? for more information about how data.world handles different file formats.
The easiest way to add files to your dataset is to drag and drop them into Add data box, but there are many other mechanisms for adding data. For more detailed information, see the articles Adding data files, and Adding files from a URL.
The overview tab on the metadata page is your main page for the dataset after creation. From here you can both add data to it and document the data:
Documenting your data consists of writing descriptions of the files, tables and columns, creating a summary, and adding tags. A starting point for information on all of these activities can be found in the article Documenting your data overview. Once you've created a dataset you can either create a project to work with the data or add it to an existing project. For more information on projects, see our article Creating a project.