AWS re:Invent Introduces the World to S3 Select & Glacier Select

AWS launches S3 Select and Glacier Select at re:Invent 2017. Use S3 Select to retrieve specific data from object contents and Glacier Select to run queries on data stored in Amazon Glacier. With Amazon S3 Select and Glacier S3, you can easily retrieve only a subset of data from an object by using simple SQL expressions. What does this mean? It means that you can query the contents of a zipped CSV stored within S3 or Glacier, without having to download and decompress the file. This is possible through a binary wire protocol  in the updated S3 SDK, and accordingly requires the addition of a deserialization library. Glacier Select works just like any other Glacier retrieval job, with the exception that an additional set of parameters can be passed to the initiate job request.

S3 Select

S3 Select

The Amazon S3 feature makes it effortless to retrieve specific data from the contents of an object using simple SQL expressions. The biggest attraction is that you can do this without having to retrieve the entire object. Here’s a comprehensive list of what you can do with S3 Select:

  1. Retrieve a subset of data using SQL clauses, like SELECT and WHERE, from delimited text files and JSON objects in Amazon S3.
  2. Retrieve a smaller, targeted data set from an object using simple SQL statements.
  3. Use S3 Select with AWS Lambda to build serverless applications that use S3 Select to efficiently and easily retrieve data from Amazon S3 instead of retrieving and processing entire object.
  4. Use S3 Select with Big Data frameworks, such as Presto, Apache Hive, and Apache Spark to scan and filter the data in Amazon S3.
  5. retrieve specific data using SQL statements from the contents of an object stored in Amazon S3 without having to retrieve the entire object.
  6. Simplify and Improve the performance of scanning and filtering object content into smaller, targeted dataset by up to 400%.
  7. Perform operational investigations on log files in Amazon S3 without the need to operate or manage a compute cluster.

You can apply for the access of the currently limited preview by completing the Amazon S3 Select Preview application form. During the preview, you have the option to use Amazon S3 Select through the available Presto connector with AWS Lambda, or from any other application using the S3 Select SDK for Java or Python.

Glacier Select

Amazon Glacier Select is a feature that allows you to run queries and perform analysis on your data stored in Amazon Glacier. The incentive here is that you get to do this without having to restore the entire object to a hotter tier like Amazon S3. This makes it cheaper, faster and easier to gather insights from your cold data in Amazon Glacier. Not only this, but you can also unlock exciting business value for your archives, opening up multiple scenarios of using Amazon Glacier for Big Data, IoT, and custom analytics workloads. When compared to legacy archival solutions like on-premises tape libraries, they have highly restricted data retrieval throughput and seldom have idle compute capacity nearby. The problem is exacerbated if tapes have been sent to an off-site storage facility. Running any kind of analysis on these solutions can cost you one of the most incalculable resources – time. You can easily squander away weeks and even months the process. In contrast, with Amazon Glacier Select it is elementary to analyze your Amazon Glacier data in-place quickly and inexpensively at latencies you choose ranging from minutes to hours. Glacier provides three retrieval options – Expedited, Standard, and Bulk. All of these options provide different retrieval times and costs. Amazon Glacier Select works with each of these retrieval options, allowing you to choose the one best suited to the speed at which you want your query to return results. For all but the largest archives (250MB+), data accessed using expedited retrievals are usually available within 1 – 5 minutes. Standard retrievals complete within 3 – 5 hours, whereas bulk retrievals complete within 5 – 12 hours.

Here a couple more things you can use Glacier Select for:  

  1. Perform filtering and basic querying using a subset of SQL directly against your data in Amazon Glacier.
  2. Provide a SQL query and list of Amazon Glacier objects, and Amazon Glacier Select will run the query in-place and write the output results to a bucket you specify in Amazon S3.
  3. Perform pattern matching or custom analytics on archived data stored in Glacier. You may face situations where you need to filter specific keys in response to an audit where response needs to be within in a few hours.
  4. Use the Amazon Glacier Select APIs in higher-level Big Data applications, like Amazon Athena, to provide Amazon Glacier as an additional data source, so that you can use tools and languages against Glacier data.

Want to know how to create an Amazon Glacier Select job? Just use the Amazon S3 API, Amazon Glacier API, AWS SDK, or AWS CLI.

So what are you waiting for? It’s time to enhance your applications and build new ones with these newly launched capabilities. Use any commercial region that offers Glacier and get started with Glacier Select. Coming to S3 select, You can apply for its preview at the S3 Select Preview Application Page.  Amazon also plans to integrate Athena with Glacier using Glacier Select in 2018. Just follow us on Twitter and LinkedIn, and we’ll let you know about any future updates.