

Location 's3:////lineitem_csv/' Querying data Use the Amazon Athena data catalog as the metadata store, and create an external schema named “spectrum” as follows:ĬREATE external table spectrum.LINEITEM_PART_PARQ (ĬREATE external table spectrum.LINEITEM_CSV (

For more information on how this can be done, check out the following resources:
#Amazon redshift spectrum how to#
How to convert from one file format to another is beyond the scope of this blog post. One important requirement is that the S3 files for the largest table need to be in three separate data formats: CSV, non-partitioned Parquet as well as partitioned Parquet. You can use any data set to perform the tests to validate the best practices we have outlined in this blog post. Set up the test environmentįor information about prerequisites and steps to get started in Amazon Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum. So queries run quickly, whether they are processing a terabyte, a petabyte, or even an exabyte. Amazon Redshift Spectrum automatically scales out to thousands of instances. You can even run multiple Amazon Redshift clusters against the same Amazon S3 data lake, enabling limitless concurrency. It lets you separate storage and compute, allowing you to scale each independently. With Amazon Redshift Spectrum, you don’t have to worry about scaling your cluster. Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. We recommend using Amazon Redshift on large sets of structured data.

You can also use Amazon QuickSight for easy visualization. All the major BI tools and SQL clients that use JDBC can be used with Amazon Athena. You can get significant cost savings and better performance by compressing, partitioning, or converting your data into a columnar format, which reduces the amount of data that Amazon Athena needs to scan to execute a query. You are charged based on the amount of S3 data scanned by each query. The serverless architecture in Amazon Athena frees you from having to provision a cluster to perform queries. Amazon AthenaĪWS customers often ask us: Amazon Athena or Amazon Redshift Spectrum? When should I use one over the other?Īmazon Athena supports a use case in which you want interactive ad-hoc queries to run against data that is stored in Amazon S3 using SQL. These guidelines are the product of many interactions and direct project work with Amazon Redshift customers. In this blog post, we have collected 10 important best practices for Amazon Redshift Spectrum by grouping them into several different functional groups. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance. You can query vast amounts of data in your Amazon S3 “data lake” without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored on local disks in your data warehouse. Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data that is stored in Amazon S3.
