Greetings interested Redshift users!
Having gained access to the Redshift Limited Preview, I’m using this opportunity to document my experiences in using Redshift to assist other people who are contemplating its use.
The official description of Redshift is: “A fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers you fast query performance when analyzing virtually any size data set using the same SQL-based tools and business intelligence applications you use today.”
- ZDNet: Amazon Redshift: ParAccel in, costly appliances out
- InformationWeek: Amazon Redshift Leaves On-Premises Opening, Says ParAccel
- Gartner: Amazon Redshift Disrupts DW Economics – But Nothing Comes Without Costs
The claimed $1000 per TB per year is based on a 3-year Reserved Instance of their XL (2TB) or 8XL (16TB) nodes. If paying per hour, the price is $3723/TB/year but the minimum is 2TB so you are up for $7446 if you run it full-time. Of course, the benefit of cloud-based systems is that sometimes you’ll turn it off (and pay nothing) and sometimes you’ll want to get more to add additional computing power.
What is Redshift, really?
I haven’t had much usage of Redshift yet, but I’ve read through the documentation and I basically see Redshift as:
- Postgres (connect to it and use it as you would with Postgres)
- Column-based storage (meaning it’s very fast working across many rows)
- Compression-based storage (taking less space, therefore faster to access)
- Split across Node Slices — “A compute node is partitioned into slices; one slice per disk for each core of the node’s multi-core processor. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node.”
The node slices are the most interesting — it means that a query is split across multiple threads and even multiple systems, similar to the way Hadoop processes a MapReduce job. Tables have ‘keys’ that define how the data is split across slices. The recommended practice is to split based upon commonly-joined columns, so that joined data resides on the same slice, thus avoiding the need to move data between systems.
Data is also pre-sorted, which enables faster querying when ranges of data are used, such as date ranges or ID ranges. This is similar to traditional ‘partitioning’ concepts, without the data actually being partitioned. To keep this operational, regular data ‘vacuums’ are recommended to keep the data sorted after new data has been added.
What is it good for?
I see Redshift as being a super-powerful database. It allows regular SQL to be used against large amounts of data, with the expectation of operating quickly. (I’ll be testing this out as I experiment with the system.)
Hadoop is great for operating across large quantities of data, but Hive (the SQL-like portion of Hadoop) is somewhat limited. I’m hoping that Redshift will give us the ability to play with large amounts of data like a ‘normal’ database.
Oh, and being a normal-ish database, Redshift should be good for handling transactional data — that is, data that is inserted and updated frequently. Hive does not offer the ability to run an
The Bottom Line
- Redshift is based on Paraccel, but with changes to fit the Amazon Web Services model
- It’s Postgres at the interface layer, so people can get going quickly
- It should offer Hadoop-like capabilities within the framework of a traditional database
- I’ll be blogging about my Redshift experiments to share my experiences