When you run a query in data.world, one of the things you can do with the results is to save them as a new table in a dataset or project. The main reasons you might want to save your query as a new data file are security and ease of analysis. Saving the query allows you to:
- Present a subset of the data from the original file
- Reorder the data
- Eliminate the need for joins or subqueries in future queries
- Work with columns as rows and vice versa without having to include either a PIVOT or an UNPIVOT clause in your queries
In this article we'll look at the various things you accomplish by saving results from a query as a new data file and provide examples.
Present a subset of the data
When data files are streamed into data.world they include all the columns in the original file. Sometimes there are columns which are irrelevant to your analysis or which contain substandard or incomplete data. Some columns might contain sensitive information that can't be shared outside of the original audience of the dataset but also have data that would be useful to a wider group. While you could manually clean up the data by downloading, editing, and re-uploading it, you would lose the ability for it to automatically update from the original source if that is how the initial data was set up in data.world. However you can both clean the data and maintain automatic updates by running a query against the original data file and saving the results to another dataset or project.
For example, the data file USDA Honey Census data in the dataset USDA bee colony inventories and loss is automatically brought in and synced daily through a URL. Because the original dataset contains information on much more than bees and honey, there are many columns in the table that contain either identical data for all the rows or no data at all.
A quick look at the data pulled in from the URL looks like this:
All columns displayed above except the second one contain duplicate data. Scrolling through the rest of the columns reveals eight more which have either no values or all the same value. To make this data more useful we can run the following query:
and save it to another dataset (we could also have saved it to the current dataset). To save the query select Save to dataset or project from the Download link:
There you will be prompted to either select a dataset or project from your recently viewed items , type the name of a different dataset or project and the system will autofill items for you, or save the data as a new dataset:
When saving the query results as a table in a brand new dataset, you create what is called a derviative dataset. The new dataset updates from the original dataset which updates from the source, but it only pulls in the data you want to see:
To keep the file up-to-date you can either manually sync it (using the Sync now button on the right of the screen shown above), or by going to the dataset/project overview page and setting the Autosync on:
Note that you have the option upon saving to either allow the query underlying the table to update based on changes to the original table or not. If you allow the query to update then changes to the original table will be reflected in your results (e.g., column name changes, columns added, columns deleted, etc.).
Reorder the data
When you use the SQL SELECT * clause to return data from a dataset, the columns in the resulting table are presented in the order in which they are found in the original data source. If you would always like to see them in a different order, you can accomplish this by running a query to order them as you would like to see them and then saving the query as a table in the current dataset. This query would be similar to the one shown above for presenting a subset of the data and the steps to save it would be the same.
Eliminate the need for joins or subqueries
Sometimes all the data you need to analyze is not stored in one table. While you could still access it using a join or a subquery, it's easier and more convenient to be able to query one table. Another reason for saving the data into one table is that queries against large amounts of data run faster if there is no subquery. Finally, it makes it easier for the members of your team who are not well-versed in SQL to do their own analyses of the data.
For this example we'll use the dataset SQL CRM Example Data. With the way the dataset is structured we can see a lot of information about what’s in the pipeline for sales agents (in the table sales_pipeline), but we can’t easily see the connection between their managers (information stored in the table sales-teams) and the pipeline. To see which managers had the most closed deals and for which accounts, we could run a query like this:
And that's nice, but if someone else wanted different information the query would have to be modified and the person running the query would have to be comfortable enough with SQL to know how to do it. Far easier would be to to write another query with the
WITH clause and to create a new table from it. Here is an example of such a query:
The results would look like this:
By saving the query as a new table it could be queried directly without having to use the join. The first query in this example would now look like this:
Work with columns as rows and vice versa
SQL has two clauses that allow you to rearrange the data in a table: PIVOT and UNPIVOT. PIVOT lets you see rows as columns, and UNPIVOT shows columns as rows. These are both complicated clauses and do not allow certain other clauses to follow them (e.g., WHERE, GROUP BY, and HAVING). if you find yourself using PIVOT or UNPIVOT frequently, or if you'd like to be able to use one of the incompatible clauses, it would be far better to write the query using PIVOT or UNPIVOT once and then save the resulting table for further analysis. See our SQL documentation on PIVOT and UNPIVOT for more information.