Recently data.world rolled out a new workflow in order to improve the user experience. The new workflow really highlights the differences between datasets and projects and clearly delineates when to use one and when to use the other. Previously the two could be used more or less indistinguishably--even though there were intended functional differences. However the ability to do all the same things in both places created confusion in users and led to inconsistent data practices. As a user's or organization's number of datasets and projects grew it rapidly became unwieldy trying to find the right data in them. The new workflow guides users to better--replicable and consistent--data management and analysis.
In this article we'll compare datasets to projects and show how the dataset and project workflows have changed covering:
- What a dataset is
- What a project is
- What's new
- Queries and the new workspace
- When to use a dataset and when to use a project
What is a dataset?
To begin we need to clarify what a dataset is. Datasets are the building blocks for projects. They contain data and metadata related to a topic. The files and tabular data in a dataset are meant be used--queried and analyzed--in one or more projects. Datasets are meant to be reusable resources. They can be combined with other datasets in projects, or they can be a single source for querying and analysis in a project.
Datasets can be owned by an individual or an organization, and a dataset provides an additional layer of access permissions to the data in a project. Because permissions are assigned at both the dataset and the project level an individual can create a project available to the public, but if the individual adds any datasets owned by an organization to the project, they won't be visible to the public--only to the other people in the organization. Not only is the dataset not visible, but any queries in the project written against that dataset are also not visible except to members of the organization.
Because datasets are linked to projects, any changes to the data or the metadata in the dataset show up automatically in the linked project. Linking data to a project instead of copying it into the project means that everything is kept up to date throughout your organization.
What is a project?
Projects bring datasets together with documentation and analysis. This is where work and collaboration happen. A project, as the name implies, likely has a beginning and an end. Data in it is shared and analyzed, and insights are derived from the analysis and written up in the project.
The biggest difference between a dataset and a project is that datasets can be linked to and included in projects, but projects cannot be linked to or included in other projects or datasets--nor can the files that are added directly to a project. With a project you can run queries against the data, analyze it, share it and create charts and visualizations from it. However if you decide to start right away with a project and add your data files to it, neither you nor anyone else can link those data files to another project. The only way to reuse the data in another project is to download it file by file and re-upload it into a dataset or directly into another project. While there are times you'll want to download and re-upload files instead of just linking to them, you won't have a choice if you start by adding new data files directly to a project. One disadvantage to re-uploading is that you have to recreate all the metadata for the files (descriptions and the data.dictionary) which is a very cumbersome process!
The main change in the layout and workflow of datasets and projects is that there are no longer separate dataset and project workspaces. Instead for datasets there is now an option to explore the dataset in a new, untitled project window. In addition to exploring the dataset you can also find out how many projects use it, link it directly to another project you have already created, or create a new project based upon it:
If you select Explore this dataset there is functionally no difference between the old dataset workspace and the new untitled project workspace. The button takes you to a workspace where you are able to browse or query, but to begin analysis (i.e., save anything) you need to save the project/give it a name:
Queries and the new workspace
Obviously the biggest change here for users accustomed to the previous workflow is that queries are no longer saved to datasets by default. The logic behind this change is that datasets are for storing files and tables, and projects are for querying and analyzing those files and tables. A dataset is meant to be reused in multiple projects, and if queries are saved to it instead of to the projects using it then the dataset can rapidly fill with irrelevant queries making it difficult to use. However if the queries specific to a project are all stored in that project, the linked dataset remains clean and ready for reuse.
The reasoning above covers 80% of the use cases, but what about the times you really do want to save a query to a dataset? Maybe you want to clean up the data, join tables and preserve the lineage of the original tables for reference, or just use the query in multiple projects without having to rewrite it (you might even want to parameterize it). In those cases it is useful to be able to save your query to the dataset, and you can still do that. After running your query, to save it to the dataset select the Save link and click the drop-down link to the right of the + New Project option. In addition to New project you'll also see the name of the dataset. Select it and the query will be saved to the dataset and you'll still be in an untitled, unsaved project:
One thing about saving queries to the dataset instead of to the project is that queries saved to a dataset won't show up in the queries list of any project the dataset is used in. Instead they'll be displayed under the connected datasets info:
For details on all the features in the updated project workspace see our quickstart to navigating the project workspace.
When to use a dataset and when to use a project
Generally if you are putting up data to share or data that is private but which you might conceivably want to reuse in other projects, it's better to add the data to a dataset. If the data is in a dataset, all of its metadata will automatically show up in your project because the dataset is linked instead of copied. All changes to the original dataset--including automatic updates from the source and manual updates by the dataset owner to the metadata--will also be conveyed.
The table below summarizes the differences between adding data files to a dataset vs. to a project:
|Dataset vs. Project||dataset||project|
|Can run and save queries against||X||X|
|Can have charts/visualizations||X|
|Can incorporate different file types||X||X|
|Can contain multiple files||X||X|
|Can be shared/have contributors||X||X|
|Can have a discussion thread||X||X|
|Can include insights||X|
|Can use existing data.world datasets without having to download and reimport them and having to recreate the associated meta-data||X|
|Can be included in a project||X|
|Can be shared for others to use in their own datasets and projects||X|