This tutorial is part of the basic tutorial series for the data.world platform. See the article overview of basic tutorials for more information.
There are many ways to find data on the web, and the data you find comes in many different formats from text to tables to images. In this tutorial we will find data in a public database on the web, create a dataset from it, and link it to our project. You will walk through finding a public database on the web, downloading data from it, and uploading that data to a data.world dataset. You can also read about the process in our article on creating a dataset.
After working through the tutorial you should be able to:
- Find data on the web
- Prepare a file for upload to data.world
- Create a dataset
- Add a new dataset to a project
To complete this tutorial you need to have:
- A data.world login (available for free here if you don't have one).
- Your own tutorial project (you must create this yourself--it cannot be downloaded)
- The Bee Colony Statistics dataset linked to your project
If you need help creating the project or linking the dataset to it, detailed instructions are in the tutorial Create a project to work with data.
If you prefer to go straight to the exercises, click here.
When the original Bee Colony Statistics dataset was created, the 2017 data from the United States Department of Agriculture (USDA) wasn't available. Now that it is, we can download it, create a new dataset from it, and add it to our project. Because the original dataset is well-documented, it's easy to look up the source of the original data (the Quick Stats database of the National Agricultural Statistics Service) so we can get the latest statistics.
Find data on the web
The original dataset has more than one table in it, but for this tutorial we'll be looking at just the bee colony census data by state. The link to the Quick Stats database is in the Summary of the Bee Colony Statistics dataset:
and the parameters used in it are shown in the file Search criteria for bee colony census by state.png:
However to make getting the data a little easier, here is a link to the Quick Stats database with the parameters already filled in. All we have to do now is to select the Get Data button at the bottom of the screen. The results should contain 50 rows. The number of rows is shown in the upper right corner of the window. If there aren't 50 rows, use the Back link on the bottom of the screen to go back to the previous page to verify your parameters.
Once you have the results, you can download them onto your desktop to re-upload them. They will be in CSV format so if you have Excel, Google Sheets, or another spreadsheet program you could open the file after you've downloaded it, but that isn't necessary for this activity. Select Spreadsheet (shown in the image above) to download the file.
Prepare a file for upload to data.world
The filename from the USDA will be a series of letters and numbers--nothing with any informational content. When files are uploaded to data.world, the names they have on ingest are the names they will have on data.world--they cannot be changed after uploading except by downloading them, changing the name, re-uploading the renamed version, and deleting the original file. To make your data more useful rename the file Bee Colony Census 2017 by State.csv:
When you upload a a spreadsheet with multiple tabs each tab is preserved as a separate table in data.world. Before uploading your file it is a good idea to review the names on all the tabs as they will each show up as a table name.
While you can upload any type of file to data.world you might get an error if the file is too large or if it's corrupt, or if there is another issue with the file. For a complete list of the errors you might encounter when uploading a file see the article on file upload status messages.
Create a dataset
Creating a dataset is very similar to creating a project. From your homepage (or any page with a + New link in the header) Select + New from the header and choose Create new dataset:
In the Create a new dataset dialog you can name of the dataset, choose the owner, and set the access permissions. If you are in an organization, the organization's name will show as the owner by default. If you are not in an organization, you will be the default owner. From the dropdown on the Owner field you can change the owner--including proposing ownership to an organization that accepts proposals for ownership:
After you have put in a title and set the ownership, you need to set the permissions. By default, permissions are set to share with no one. If you set the ownership of the dataset to an organization, the other options are to share with everyone in the organization or to make it public to the data.world community. If you set yourself as the owner your only options are to share with no one or make public to the entire data.world community. Once you've set the permissions, select Create dataset and you can either add a description and/or upload your data file, or you can continue on to the dataset overview:
Add a new dataset to a project
To add this dataset to a project select the arrow next to Explore this dataset and choose Add to existing project:
Note: If you did not create a project in the prior exercise you can do it now by selecting Create a new project. If you need help creating the project, see the tutorial article Create a project to work with data.
At this point you'll be presented with a dialog box showing the dataset on the left and a list of the projects owned by you or an organization you are in to which you have write permissions:
Make your selection, and after you click Save you can either go back to your dataset or to the project:
This tutorial uses real-world, feral data--not a made-up, sanitized file. It begins with accessing a live, publicly-accessible, US government database on the web, running a query against it, and saving the results from the query to a file. The next step is to create a dataset and upload a file to it. If you prefer to skip right to creating the dataset and uploading the file, download the Bee Colony Census 2017 by State.csv file at the bottom of this tutorial and proceed directly to step 5 below.
- Go to the Quick Stats database for the National Agriculture Statistics Service on the United States Department of Agriculture website (the parameters will be pre-populated for you).
- Select Get Data
- Download the data as a spreadsheet
- Rename the downloaded file Bee Colony Census 2017 by State.csv
- Login to your data.world account and create a dataset named Bee Colony Census 2017
- Upload and add the file to the dataset
- Add the new dataset to your project
It is very easy to create a dataset and add new data to data.world, and there are some things you can do to make it easy to use too. See our handy list of tips for uploading data for more information.
Creating a dataset and creating project are very similar activities, and both are intentionally structured to work together easily. You can create a dataset and add it to an existing project, or you can create a dataset and a project to work with it all at the same time. Uploading a file is only one way to add data to data.world. See the article on getting your data into data.world for information on other methods.
After you put your data onto data.world there are many things you can do to make it easy for others to find and use. We have additional articles on how you license, document, verify, set file labels, and tag your data, and how different files types are handled in our help center. We encourage you to make use of them to get the most out of your experience putting data on data.world.
- Overview of basic tutorials - An overview of the basic tutorials
- Creating a dataset - All the ways to get data into data.world from uploading files to using a virtual connection
- Summary of the Bee Colony Statistics dataset - Shows the source of the original data
- Search criteria for bee colony census by state.png - A list of the criteria used to search the source database for the original data
- Formatted query for the Quick Stats database - A pre-formatted query to run against the source database to get the latest data
- File upload status messages - A list of error messages you might encounter when uploading a file to data.world
- Create a project to work with data - The previous exercise in this tutorial
- Tips for uploading data - Best practices and other information on good data practices
- Getting your data into data.world - All the different ways you can get data into data.world
- Setting a license type - Understanding how licensing works for data you own as well as data you own when it is uploaded to data.world
- Documenting your data - How to use the description, summary, data dictionary, and other documentation assets in your datasets
- Verifying your data with data inspectors - How to validate the state of your data in a data.world dataset
- Organizing a dataset with file labels - Using file labels to further identify files in a dataset
- Tagging - How to use tags to organize and group a dataset or project by topic, category, source, department, or team
- Supported file types - How data.world handles different file formats