A dataset is the basic repository for data files and associated metadata, documentation, scripts, and any other supporting assets that should be stored alongside the data. Datasets are where all data is stored and documented for later sharing and use in projects. While there are many functional similarities between datasets and projects, we recommend that you store your data in a dataset and work with it--combine, query, analyze, draw insights, etc.--in a project.
When you create a dataset it might be because you have a database or other tabular data that you want to analyze and share. But data from a database isn't the only kind of data you can put in a dataset. Any file type can be saved there. Information about various file types and the ways they are handled is in the article Supported file types.
There are several ways datasets can be created:
- Manually - we'll walk through that here
- Via our API - instructions available in our API docs
- Through super connectors like Stitch, KNIME, Knots, and Singer - instructions can be found in our integration documentation under super connectors
- With Sparklebot - for data portals or enterprise companies, contact data.world to find out more about our tools to automate creation and syncing of your data library. This can be full data and metadata mirroring, or simply a catalog of your data sources with metadata and sample data where you'd like.
Creating a dataset
To create a new dataset, click on +New on the right of the menu bar and select dataset:
When the Create a new dataset window comes up, you'll be prompted to set the ownership, name, and, accessibility. If you are in an organization, by default the owner field will contain a list of the organizations you're in:
If you prefer to own the dataset personally you can select Switch to personal account below the owner field. If the dataset is intended to be used in the organization, it should typically be created under the organization namespace. In this way the dataset benefits from the organization's service tier, permissions can be easily set based on the members of the organization, and datasets remain available within the organization even as individuals and permissioning changes.
The accessibility options you can choose for your dataset include not shared with anyone, shared with everyone else in the organization (if you chose an organization as the owner), or public to the data.world community. Permissions can always be edited at a later time.
If you are not in any data.world organizations, you will automatically be set as the owner of the dataset, and you can choose to keep the dataset private or to share it with the data.world community:
The number of private datasets you are allowed is determined by your user license--you can create as many public datasets as you would like. More information on account types and pricing are found on our pricing page. There are several factors to consider when deciding whether to make your dataset public or private. By making your data public you can:
- When you make a dataset public you allow others to use that dataset in their own projects and build from it. They can't change your dataset in any way or even save queries to it, but they can use and share it.
- Data that is public on data.world can be downloaded from data.world and used externally. If your data is proprietary or sensitive, it shouldn't be shared.
- Publicly shared datasets add to the amount of information that is available to everyone for analyzing, visualizing and learning from
More information on permissions can be found in the article Understanding permissions.
After you create your dataset you can document your objective for it, add data to it, or continue on to the overview.
The files stored in a dataset can include more than just clean data. They can contain raw data, the scripts used to clean the data, the clean data, images, documentation--i.e., any information that would be useful for analyzing and understanding the data. See What file types can I upload? for more information about how data.world handles different file formats.
The easiest way to add files to your dataset is to drag and drop them into Add data box, but there are many other mechanisms for adding data. For more detailed information, see the articles Adding data files, and Adding files from a URL.
The overview tab on the metadata page is your main page for the dataset after creation. From here you can both add data to it and document the data:
Documenting your data consists of writing descriptions of the files, tables and columns, creating a summary, and adding tags. A starting point for information on all of these activities can be found in the article Documenting your data overview. Once you've created a dataset you can either create a project to work with the data or add it to an existing project. For more information on projects, see our article Creating a project.