There is no restriction on file types that can be uploaded or downloaded on data.world, and a dataset can consist of any combination of files added to it. There are some size limitations, and files are handled differently based on the extension as follows:
Formats: csv, tsv, xls, xlsx
Tabular files are presented in a spreadsheet-style preview and we perform basic analyses on each of the columns. The data is then queryable using SQL and SPARQL; take a look at this video for more info on getting started with querying.
To provide these querying capabilities, and in line with our mission to connect the world’s data (by making it linkable,) we’re converting it to RDF Triples, or graph data, under-the-hood. To learn more, check out our blog post on the matter and the W3C primer on RDF.
Excel files will include all of the underlying sheets in a tabbed interface. Only the tabular data will be included; other elements like pivot tables and charts will not be shown in the preview but they will still be available in the original file.
Database file formats
Database dumps will consist of multiple tables, and a schema that models the type information and the relationships between those tables. Each table will be represented as a data.world table, which can be previewed and queried naturally via our SQL engine.
Formats: rdf, rdfs, owl, jsonl, nt, ttl, n3
These formats are serializations of RDF data - since RDF is the native data format for data.world’s platform, the statements in this file are simply loaded into the graph for the dataset or project that the file is added to. By uploading raw RDF data into a dataset or project, that data is searchable via the attached SPARQL endpoint. Take a look at this video for more info on getting started with queries. We show a preview of the contents of the file, including summaries of the classes, properties, and namespaces used in the file.
Formats: json, ND-JSON, other 'sufficiently tabular' json files
When a JSON file has a "sufficiently tabular" structure, we will attempt to produce a table of data that represents the contents of the file. Common logging formats that include JSON arrays of simple objects or newline-separated JSON objects will generally work great with this interpretation. If the structure of the file is too hierarchical or inconsistent in nature, the file will instead be treated in its raw form - you can view or download the file, but it’s not queryable through our query engine.
Archive and compressed formats
Formats: zip, tar, tbz2, tbz, bz2, tgz, gz, -gz, z, -z
Archives that contain multiple files can be extracted and the first 50 files are stored in the dataset. Each extracted file is then handled using the criteria established for its extension. Please note that archives are not extracted by default. To do so, a Contributor must click on the ‘Extract’ button on the right-hand side of the archive.
Individual files that are compressed (i.e. foo.csv.gz) are decompressed and then treated as though the uncompressed file had been added directly.
Formats: jpg, jpeg, png, gif, svg
Images are displayed in-line.
Formats: ipynb (version 4 and higher), js, r, py, as, apl, bash, bas, bat, c, cpp, cs, css, d, dart, diff, go, ini, java, julia, kt, lua, matlab, nasm, ml, perl, php, ps1, rb, scala, sql, tcl, ts, vim, yaml, xml, asp, jade, tex, less, sass, scss, Dockerfile
Source files are presented with full syntax highlighting where appropriate.
Formats: txt, html, md, pdf
Document formats are rendered during preview.
All other file types can be uploaded and downloaded as long as they are within the supported size limits.