You can catalog, document, and query live data from Athena by configuring a virtual connection to it. With a virtual connection, your data stays where it is--it is not hosted on data.world--so all actions using it are always run against current data.
When connecting your data systems to data.world, often your system or firewall may have a network policy that only allows access from specific IP addresses--your allowlist. In those cases, you must add data.world to your network policy IP allowlist.
Allow the following IP addresses:
If you are on your organization's home page select Connection Manager. Otherwise go to the integration page (https://data.world/integrations/athena) and click Enable integration to set up a connection.
Before configuring a virtual connection to Athena you need to have set up an IAM role in the AWS console.
To configure a virtual connection to Athena you will need to create a dedicated IAM role in your Amazon Web Services (AWS) console and enter the AWS Amazon Resource Name (ARN) for it in the Add a new connection dialog. To create the role, however, you will need to first get the AWS External ID from the bottom of the connection dialog. Follow the steps below to create the AWS role and the connection to Athena.
Go to the Athena integration page and select +Enable Integration
Copy the External ID and do not close the dialog.
You have to leave the Add a new connection dialog open while you connect to the AWS console and create the role needed for the connection because every time you open the dialog to create a new connection, a new external ID is generated.
Go to the AWS console and select Create role.
Use the following parameters for the role:
Select type of trusted entity - Another AWS account
Account ID - 465428570792
Require external ID - checked
External ID - The value copied from the Add new connection dialog in data.world
Select Next: Permissions:
Use the search bar to find the following two policies and add them:
You may choose to be more fine grained in precisely which buckets you allow data.world to access. We will only need write access on the S3 output bucket location configured earlier. Otherwise, the minimum permissions required to query data from table backing buckets is required.
Select Next: Tags and add any tags you would like.
Select Next: Review
Name the role, write a description, verify that the two policies shown above are present, and select Create role.
Find the role you have just created:
Copy its ARN, and paste the ARN into the dialog window you left open for adding a new Athena connection.
When you create a connection to use with a data source you are asked to set the owner of the connection. By default, if you are in organization then your organization is the owner of the connection. However, you can also set yourself as the owner making it a personal connection.
There are two compelling reasons for having most connections owned by an organization:
There is no loss of access to data when an employee leaves and their account is deactivated.
Federation across data sources is faster and more efficient if it uses the same connection.
Organizational-level connections are shared between admins of the organization and can be used by by all of them to create new live tables. Non-administrator users can only query and preview existing live tables.
Organization-owned connections can only be used to add data to datasets owned by that organization. If you are in organizations A and B, you cannot add data to a dataset owned by B using a connection owned by A.
With a personal connection, only the connection owner may create new live tables with the connection, and other members of the organization can query and preview live tables.
In order to create a virtual connection to Athena you will use the following parameters:
AWS Region: This should be the AWS region where the Athena and AWS Glue instance reside. (e.g., us-east-1)
S3 Output Bucket Location: The Amazon S3 bucket where query results should be stored. The location should start with s3://. For example, to store results in a folder named "test-folder-1" inside an S3 bucket named "query-results-bucket", you would set the location to s3://query-results-bucket/test-folder-1.
Workgroup: If your Athena instance is configured with different workspaces you can assign your connection to a workspace here.
AWS ARN: A dedicated Identity Access Management (IAM) role created specifically for data.world.
AWS external id: provided in the "Add a new Athena connection" dialog.
Choose a Connection owner (yourself or your organization) and set the Display name for your connection. The display name is the name everyone in your organization will see for the connection. Then enter your user credential information into the dialog screen:
Click Test Athena configuration to make sure it works, and then save it by selecting Configure.
After your connection is configured you can use it anytime you select Add data:
Choose the connection you want to use from My data sources:
Go to the integration page (easily found under My integrations on our Integrations page) and select Manage:
From here you can edit your current connection or add a new one:
You will need your original credentials (password or key file) to make changes to an existing connection.
When executing queries against virtualized data sources in data.world, we will translate those queries to the proper SQL dialect of the target system and run them on the target system whenever possible.
When functions cannot be translated directly, data.world fetch the necessary data from the target system and execute those functions locally. We refer to those as emulated functions or aggregations below.
Additionally, some functions may not be supported on either the target system or locally - those are noted as unavailable functions below.
Database connectors - A full list of our connection integrations