The new workflow data.world rolled out recently really highlights the differences between datasets and projects and clearly delineates when to use one and when to use the other. Previously the two could be used more or less indistinguishably--even though there was an intended functional difference. However the ability to do all the same things in both places created confusion in users and led to inconsistent data practices. The new workflow guides users to better data management and replicable, consistent analysis. In this article we'll look at the fundamental differences between datasets and projects. For information on how to navigate the management of resources in the new workflow see our overview of how the dataset and project workflows have changed.
What is a dataset?
To begin we need to clarify what a dataset is. Datasets are the building blocks for projects. They contain data and metadata related to a topic. The files and tabular data in a dataset are meant be used--queried and analyzed--in one or more projects. Datasets are meant to be reusable assets. They can be combined with other datasets in projects, or they can be a single source for querying and analysis in a project.
Datasets can be owned by an individual or an organization, and a dataset provides an additional layer of access permissions to the data in a project. Because permissions are assigned at both the dataset and the project level an individual can create a project available to the public, but if the individual adds any datasets owned by an organization to the project, they won't be visible to the public--only to the other people in the organization. Not only is the dataset not visible, but any queries in the project written against that dataset are also not visible except to members of the organization.
Because datasets are linked to projects, any changes to the data or the metadata in the dataset show up automatically in the linked project. Linking data to a project instead of copying it into the project means that everything is kept up to date throughout your organization.
What is a project?
Projects bring datasets together with documentation and analysis. This is where work and collaboration happen. A project, as the name implies, likely has a beginning and an end. Data in it is shared and analyzed, and insights are derived from the analysis and written up in the project.
The biggest difference between a dataset and a project is that datasets can be linked to and included in projects, but projects cannot be linked to or included in other projects or datasets--nor can the files that are added directly to a project. With a project you can run queries against the data, analyze it, share it and create charts and visualizations from it. However if you decide to start right away with a project and add your data files to it, neither you nor anyone else can link those data files to another project. The only way to reuse the data in another project is to download it file by file and re-upload it into a dataset or directly into another project. While there are times you'll want to download and re-upload files instead of just linking to them, you won't have a choice if you start by adding new data files directly to a project. One disadvantage to re-uploading is that you have to recreate all the metadata for the files (descriptions and the data.dictionary) which is a very cumbersome process!
Generally if you are putting up data to share or data that is private but which you might conceivably want to reuse in other projects, it's better to add the data to a dataset. If the data is in a dataset, all of its metadata will automatically show up in your project because the dataset is linked instead of copied. All changes to the original dataset--including automatic updates from the source and manual updates by the dataset owner to the metadata--will also be conveyed.
The table below summarizes the differences between adding data files to a dataset vs. to a project:
|Dataset vs. Project||dataset||project|
|Can run and save queries against||X|
|Can have charts/visualizations||X|
|Can incorporate different file types||X||X|
|Can contain multiple files||X||X|
|Can be shared/have contributors||X||X|
|Can have a discussion thread||X||X|
|Can include insights||X|
|Can use existing data.world datasets without having to download and reimport them and having to recreate the associated meta-data||X|
|Can be included in a project||X|
|Can be shared for others to use in their own datasets and projects||X|