Redshift User Blog – Intro

Greetings interested Redshift users!

Having gained access to the Redshift Limited Preview, I’m using this opportunity to document my experiences in using Redshift to assist other people who are contemplating its use.

The official description of Redshift is: “A fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers you fast query performance when analyzing virtually any size data set using the same SQL-based tools and business intelligence applications you use today.”

Redshift is based on Paraccel, a company in which Amazon invested $20m. Here’s some background reading:

The claimed $1000 per TB per year is based on a 3-year Reserved Instance of their XL (2TB) or 8XL (16TB) nodes. If paying per hour, the price is $3723/TB/year but the minimum is 2TB so you are up for $7446 if you run it full-time. Of course, the benefit of cloud-based systems is that sometimes you’ll turn it off (and pay nothing) and sometimes you’ll want to get more to add additional computing power.

What is Redshift, really?

I haven’t had much usage of Redshift yet, but I’ve read through the documentation and I basically see Redshift as:

  • Postgres (connect to it and use it as you would with Postgres)
  • Column-based storage (meaning it’s very fast working across many rows)
  • Compression-based storage (taking less space, therefore faster to access)
  • Split across Node Slices — “A compute node is partitioned into slices; one slice per disk for each core of the node’s multi-core processor. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node.”

The node slices are the most interesting — it means that a query is split across multiple threads and even multiple systems, similar to the way Hadoop processes a MapReduce job. Tables have ‘keys’ that define how the data is split across slices. The recommended practice is to split based upon commonly-joined columns, so that joined data resides on the same slice, thus avoiding the need to move data between systems.

Data is also pre-sorted, which enables faster querying when ranges of data are used, such as date ranges or ID ranges. This is similar to traditional ‘partitioning’ concepts, without the data actually being partitioned. To keep this operational, regular data ‘vacuums’ are recommended to keep the data sorted after new data has been added.

What is it good for?

I see Redshift as being a super-powerful database. It allows regular SQL to be used against large amounts of data, with the expectation of operating quickly. (I’ll be testing this out as I experiment with the system.)

Hadoop is great for operating across large quantities of data, but Hive (the SQL-like portion of Hadoop) is somewhat limited. I’m hoping that Redshift will give us the ability to play with large amounts of data like a ‘normal’ database.

Oh, and being a normal-ish database, Redshift should be good for handling transactional data — that is, data that is inserted and updated frequently. Hive does not offer the ability to run an UPDATE command.

The Bottom Line

  • Redshift is based on Paraccel, but with changes to fit the Amazon Web Services model
  • It’s Postgres at the interface layer, so people can get going quickly
  • It should offer Hadoop-like capabilities within the framework of a traditional database
  • I’ll be blogging about my Redshift experiments to share my experiences
Posted in Redshift | Leave a comment

Book Review: Getting Started with Amazon Redshift

I just posted a review on for Getting Started with Amazon Redshift:

All of the official documentation for Redshift is available for free from Amazon, so this book is not attempting to be an official user manual for Redshift. Rather, it is passing information to the reader in a friendly manner, with additional commentary on how things differ from other systems on the market. This is helpful, since the official documentation focuses more on what Redshift does, while this book also highlights what it doesn’t do (such as unsupported SQL commands). The book also makes it easier to come up-to-speed with Redshift because it is being explained by someone who has already had to ‘climb the learning curve’ and is now explaining this in a less-formal manner, with frequent emphasis on difficult or ‘different than normal’ topics.

The author’s style is informal and friendly, but sometimes overly talkative. For example, there are some paragraphs that occupy a whole page and sound like they were written for a blog, without much formal editing. Many concepts are thrown in without a lot of background, so it may be difficult for somebody without much previous experience in databases and data warehousing, but the typical reader who is interested in learning about Redshift probably comes from such a background anyway.

If you’re looking to learn Redshift and you’ve found the official documentation a bit dry or difficult, then this book will give you similar information in a friendly and more ‘personal’ manner. Please note that, as an online product, new features are continually being added to Redshift so some portions of this book are already outdated.

Here’s a quick outline of the book:


The book begins with a fairly informal chat about Redshift, its pricing and how it will likely affect the Data Warehouse marketplace.


This chapter talks through the creation of a Redshift cluster and then gets down into technical detail, showing what standard SQL functions are NOT supported by Redshift. It also covers accessing Redshift via command-line and via SQL tools.


As the author says, “this is where the fun starts”. Topics include schema and table creation, loading data from Amazon S3 (which was the only method available at the time of publishing), plus a bit on performance monitoring and tips on optimising table design (using Redshift’s Sort and Distribution Keys). Truth be told, data loading is where typical Redshift users end up spending most of their time.


This covers snapshots, resizing, vacuuming of tables and workload management (which is the low-level management of job queues). These topics are important once your cluster is up and running and you’d like to improve performance or move into ‘production’ mode for your system.


Not so much a lesson in SQL, but rather information for people already familiar with SQL but considering a move to Redshift. Special tables and the EXPLAIN command are covered, plus tips on JOINs, INSERTs, etc. You won’t learn data warehousing from this chapter, but if you’re familiar with an existing system then you’ll come to understand the subtle differences with Redshift. Some SQL code samples are included in this chapter.


This final chapter is a random collection of remaining topics such as security and monitoring.

Posted in Redshift | Leave a comment

Bucket Explorer is great for managing files in S3

Getting data into Redshift has meant that I’ve spent a lot of time uploading files into S3 and moving them between buckets.

I’ve written previously about Uploading data by using Boto, which is now the official AWS Python SDK. However, once data is in S3, I’ve been trying various tools to list and move files around. Here’s a quick recap of some of the utilities I’ve used.

The AWS S3 Console

The S3 console does a pretty good job of viewing buckets, directories and files. Directories, of course, don’t actually exist within S3 but it’s pretty standard that filenames contain slashes to create ‘pretend’ directories. The S3 Console handles this concept very nicely, showing hierarchies of directories when viewing buckets.

The Console also handles a large number of files in a bucket. Buckets can have an unlimited number of files, so this is always a challenge when listing files in S3. The Console shows files sorted by key name and uses ‘infinite scrolling’ to keep showing more files when scrolling down.

Some nice features of the Console include the ability to:

  • Create buckets
  • Delete files
  • Cut & Paste files to ‘move’ them between directories and buckets
  • Copy & Paste to copy files
  • Change the Storage Class to Reduced Redundancy Storage (RRS), which is 20% cheaper
  • Restore from Glacier (or at lest it has a ‘Restore’ option, so I assume that’s what it does)

However, the Console is rather slow when handling a large number of files in one directory, since it loads extra files only after scrolling down.

Extended S3 Browser for Chrome

This add-on to the Chrome web browser provides quick, convenient access to S3 buckets from within the browser with the ability to:

  • Create buckets
  • Create folders
  • Upload files
  • Delete files

It’s pretty basic and seems to only show a limited number of files (which confused me when I pointed it to a large directory). On the up-side, it’s very fast in the browser.

Bucket Explorer

Whoa! Bucket Explorer is a heavy-duty app that seems to know all the ins & outs of S3. It has far more functions than I’d every need! Things I’ve found include:

  • File counts & sizes (eg measuring the total size of a bucket)
  • Retrieving Access Control Lists
  • Updating metadata
  • Comparing files and buckets
  • Activating logging
  • Dragging files between the local disk and S3

It has an insanely huge list of capabilities such as configuring CloudFront, setting up SNS notifications (eg when an RSS file is lost) and creating an AWS Import/Export job.

One problem I discovered was that I moved some files between buckets and lost the fact that they were stored in RSS. This might have been due to a configuration setting I should have changed, but I was surprised it didn’t keep the same storage settings.

Bucket Explorer is a highly-capable S3 management tool. The downside is that it’s a $70 commercial product. There’s also an interesting ‘Team Edition’ that has the ability to selectively allow access to data amongst the team. A 30-day trial is available.

The Bottom Line

  • The AWS S3 Console is adequate for simple S3 needs
  • Bucket Explorer is a highly capable tool
  • There’s probably plenty of other tools out there, too!
Posted in S3 | Leave a comment

Rebalancing AWS Redshift clusters

I’m continually learning more use-cases for cloud-based, scalable infrastructure. The best stories on this topic come from Netflix, who have some great videos online from the recent AWS re:Invent conference.

I heard a rumor about Redshift that gives another great example for infrastructure in the cloud, and it concerns the rebalancing of data when clusters are added/removed from Redshift.

First, I should mention that Redshift is designed to work across nodes in a cluster. Work is divided between the nodes, then sent back to the Leader node for final combining, much like MapReduce works in Hadoop.

One difference, however, is how Redshift stores its data. Rather than distributing the same data across each node (as is done by default with HDFS in Hadoop), Redshift puts data on different nodes based on a column that is marked as the Distribution Key (DISTKEY). All data relating to a particular value of the DISTKEY is stored on the same server, which makes JOIN queries very fast if both tables have the same DISTKEY. However, it can also slow things down if you’re just working on one value of the DISTKEY.

This is contrary to some traditional concepts of partitioning data. Normally, partitioning data is used to speed up queries, by reducing the amount of data that needs to be scanned during the query. A typical example is partitioning by date range, so that only a subset of data needs to be scanned if you’re querying data for a given month.

Redshift, on the other hand, can actually be slower if data is partitioned in this manner. Distributing by month would mean that queries on a single month could only operate from one node (the one where that month is stored). The system would run faster if the data was not partitioned by date.

The best DISTKEY is a field that used in JOIN clauses, since all data to be joined is on one server. Dates are rarely ever used in JOIN clauses, so pick a field that’s typically used to uniquely identify records.

Rebalancing Nodes

Redshift, being an on-demand and scalable service, offers the additional intriguing capability of being able to add and remove nodes within a cluster. This has an impact on the DISTKEY.

For example, a 2-node cluster will distribute data evenly between two nodes based on a hash of the DISTKEY. If a 3rd node is added, the data needs to be rebalanced amongst the 3 nodes. It’s not just a matter of sending all new data to the 3rd node because that would require lookups to figure out where data is stored. Rather, the node needs to be rebalanced by redistributing the data between the 3 nodes. Redshift takes care of this automatically. Just add the nodes and the data moves.

This is where the fun kicks in!

If the rumor I heard is true, then Redshift redistributes the data as follows:

  • A ‘new’ set of nodes is created (in the above example, 3 nodes would be created)
  • Redshift moves the data from the 2-node cluster to the 3-node cluster, rebalancing the data during the copy
  • Users are then flicked across from the ‘old’ 2-node cluster to the ‘new’ 3-node cluster

This is a great example of scalable cloud infrastructure — rather than having to ‘upgrade’ an existing system, it is much more efficient to provision a new system, copy data and then decommission the old system. This is a wonderful, fresh way of looking at infrastructure that is quite different to the old way of thinking in terms of physical boxes.

Would you believe, Netflix even fires up 500-node Hadoop clusters for short-term analysis!

The Bottom Line

  • Redshift automatically rebalances data between nodes
  • Scalable cloud infrastructure provides opportunities to rethink the way IT is run
Posted in Redshift | Leave a comment

AWS Redshift query comparison times against Hadoop and Postgres

I did some timing experiments with the same set of data across three systems:

  • AWS Reshift dw.hs1.xlarge x 2 nodes (see node description)
  • Hive running on a 3-node Hadoop server my company’s data centre
  • PostgreSQL 9.2 in my company’s data centre

Sorry, I don’t have the system specs for the last 2, but they’re on pretty good hardware.

I performed the queries against 1.3 billion rows of data.

Query PostgreSQL Hive Redshift
count(*) 7.5 minutes 6.5 minutes 39 seconds
Top 20 grouping 10 minutes 11 minutes 62 seconds
Top 20 grouping for one month of data * 10 minutes 5 minutes 23 seconds

* In Hadoop, data is partitioned by month. In Redshift, the time dimension is the SORTKEY.

The results are pretty startling, but understandable.

PostgreSQL was not indexed, so it always did a full-table scan. This could have been tweaked to improve performance, by adding indexes and partitioning data. Partitions within Postgres are great for separating data to do less scans when the data is partitioned by a field used in the WHERE statement, such as only having to scan a subset of records when searching for recent transactions.

It was actually quite scary loading 1.3 billion rows into PostgreSQL because I’ve never had more than a few million in there. Fortunately, it handled the data volume quite gracefully. Since Redshift presents itself as a PostgreSQL database, I performed the load by using DBLINK to ‘pull’ the data from Redshift into PostgreSQL. The two databases effectively spoke to each other to exchange information.

Hive had files in Snappy format, partitioned by month. That’s why the 3rd query was so much faster. However, it still had to scan a considerable amount of data.

Redshift had a Distribution key (DISTKEY) that didn’t feature in the query, and a Sort Key (SORTKEY) on the date dimension, which is why it was faster on the 3rd query.

So, why was Redshift so much quicker? Several reasons:

  • The queries were very ‘columnar‘, which is to Redshift’s benefit. There was at most two columns of data accessed (the grouping field and the date field). Plus, those columns were compressed, which makes loading the data even quicker.
  • The two nodes split the work, similar to Hive. This is an advantage over PostgreSQL, which was running on just one node.
  • These are the sorts of queries that Redshift is well-suited to run. It is less optimal for doing complex computations against all data, for which Hive Streaming would be well-placed. Redshift does not implement stored procedures, so iterating through all data to perform complex logic would need to be done by external code that simply calls Redshift for data.

The other benefit that Redshift offers over Hive is that it presents a more friendly interface. Being able to connect to Redshift using your favourite SQL client or reporting tool (I love using dbVisualizer) is a godsend compared to playing with Hive on the command line. While Hive can be accessed via JDBC, it is not provided out-of-the-box. With Redshift, it’s there the moment the cluster starts.

The Bottom Line

  • With large data, Redshift easily beats PostgreSQL (yet it presents itself as PostgreSQL)
  • With the right sorts of queries, it can easily beat Hive/Hadoop (but I admit this can be improved by tuning the way data is stored in Hadoop, so no flame wars, please!)
  • You should try these comparisons for yourself — Redshift is now publicly available and you can use it on-demand for $0.215 per hour per node
Posted in Redshift | 1 Comment

AWS Redshift bulk load times

As part of the pre-release “preview”, Redshift was upgraded this week so I had to reload data from S3.

Whereas previously I had loaded data in sections, this was a great opportunity to bulk-load my entire data:

  • Rows of data: 13,122,397,463 ros (yes, 13 billion!)
  • Number of Files: 16,912
  • Data size: 135GB (compressed)

After all, we are talking about Big Data!

All my data was already in S3, so I simply had to issue one command to get it all imported. The command was similar to this:

COPY table1 FROM 's3://bucket1/' credentials 'aws_access_key_id=abc;aws_secret_access_key=xyz' delimiter '|' gzip removequotes truncatecolumns maxerror 1000

Some explanation:

  • AWS Redshift can load multiple files in parallel — just give a path ‘prefix’ and any files matching that prefix are loaded. In the above command, it was actually the whole bucket, but you could do subdirectories too.
  • The input data was pipe-delimited and compressed with gzip
  • Some of the fields contained the pipe character within the data itself. Fortunately, these fields were quoted (much like Excel outputs CSV files and quotes any fields that contain a comma). The removequotes parameter tells Redshift to strip the quotes and nicely handled the delimiter being within the fields.
  • truncatecolumns avoids load failures by simply truncating columns that are too long. Yes, data is lost, but it’s better than a complete load failure!
  • maxerror 1000 also tells Redshift to continue if up to 1000 errors occur. This is good for letting through small errors but it would stop the load if something was majorly wrong.

Here’s some statistics related to this particular load:

  • Time taken: 28 hours (this includes time for sorting the data at the end of the load)
  • Errors detected: 58

Redshift also provides some pretty graphs during, and after, execution of queries. Here’s some samples:

AWS Redshift query charts

The charts show the 28 hour load job. The movement 2/3rds through the job represents the start of the Sort phase, where data is sorted according to the sortkey. This allows for faster querying for ranges of data and takes the place of a normal database index (which is not available with Redshift).

Strangely, the query monitor reported 19b rows before the sort, and only 13b afterwards.

Also, three systems are represented in the charts: a leader node and two compute nodes. From what I’ve observed, the leader node is the PostgreSQL-like frontend while the compute nodes are the Hadoop-like data processors. Additional compute nodes can be added (with a corresponding cost increase) to perform jobs faster. Changing the number of nodes will require the system to redistribute data that is distributed with a distkey parameter. The ability to distribute data is one of Redshift’s benefits — it spreads the workloads amongst multiple compute nodes and then joins the data back together for final output.

As it happens, most of the data I loaded was irrelevant for our needs, so I performed a standard SQL ‘delete’ command. I then issued a ‘vacuum’ command to remove the deleted data and optimise storage. Here’s some stats:

  • Before delete: 13 billion rows
  • After delete: 1.4 billion rows
  • Time taken for delete: 15 minutes
  • Time taken for vacuum: 3 hours

For the record, here’s the time taken for some simple queries on the 1.4 billion rows:

  • Min & Max on the sortkey: 95 seconds
  • Group by a field that has 3 values and count the rows: 49 seconds

Not all queries are that fast, but it is impressive given the huge number of rows!

The Bottom Line

  • Data can be loaded in one big batch
  • Include parameters to avoid errors that abort the whole load
  • It can take some time, but that’s understandable with huge quantities of data
Posted in Redshift | 1 Comment

AWS Redshift is lightning fast compared to Hadoop

I just had a need to find distinct values across my 550-million row table:

select distinct(field1) from table1

I couldn’t do it in our Hadoop cluster because it’s currently busy processing four other queries that seem to have consumed all the map and reduce jobs. So, I gave it a try in our Redshift copy of the data and… it took just 20 seconds!

That’s almost the same as merely the overhead in starting a Hive query!

That particular query only had 3 distinct values, so I tried it again on a field that I knew had many more values:

select count(dinstinct field2) from table1

It gave the answer of 242,000 distinct values (which is quite high!) and took only 29 seconds. I’m very impressed.

Of course, Redshift isn’t a complete replacement for Hive. I’m currently writing a Streaming function for Hive, which isn’t something that Redshift can do. The closest capability would be PostgreSQL stored procedures, but the Redshift documentation doesn’t mention any such ability.

The Bottom Line

  • Querying in Redshift is FAST!
  • Redshift isn’t a complete replacement for a Hadoop system
Posted in Uncategorized | Leave a comment

Using Boto to load data into AWS Redshift

I’ve started work on loading a significant amount of data into Redshift. There’s only three ways to get data into Redshift:

  • COPY from S3 (delimited text files)
  • COPY from DynamoDB (NoSQL datastore)
  • JDBC/ODBC transactions (not efficient for bulk loading)

My source system generates a few dozen files a day, making it easy for me to upload into Redshift a little bit at a time. So, I’ve started writing a Python script that will:

  • Extract data files from my source system (which is actually a Hoop server fronting Hadoop’s HDFS)
  • Load them into S3 (via Boto)
  • Run the COPY command in Redshift to load the data (via JDBC)

I was searching for a way to upload files to S3 and found many utilities, but they all seemed into various states of disrepair. Fortunately, I came across Boto, which works very nicely with many AWS services. The documentation is very good and it is written by Mitch Garnaat, who has written an O’Reilly book titled Python & AWS Cookbook and who, it seems, is now working for AWS. His blog has a wealth of information, too.

Using boto, I managed to automate the uploading of files to S3. I also found a handy blog post on Upload a file to S3 with Boto showing how to use a callback function during the upload.

I was also pleased to see that Boto could mark files for Reduced Redundancy Storage, which is 20% cheaper. Data is apparently replicated a little less on S3, which makes increases the possibility of data loss. However, this is fine for my purposes since I have a copy of the source data anyway.

Next, I did a test load of the data by performing a manual COPY command in Redshift. Unfortunately, I raw across an error while loading the data. It’s worth mentioning that Redshift is great for diagnosing load errors. Specifically:

  • A system table (stl_load_errors) is populated with error information
  • I can set the MAXERRORS parameter to allow the system to continue loading, even when errors are generated
  • I can use the NOLOAD feature to do a “trial load” of data. It parses the data for errors, but does not load the data. That’s great for checking whether the data will load.

In this particular case, the error was “Extra column(s) found“. I soon discovered that some of my data contained the delimiter. The data uses the pipe “|” character to separate columns, but several rows of data also contained a pipe character within their fields, even though they were within quotation marks.

Fortunately, this was solved by using the REMOVEQUOTES option on the COPY command. The documentation doesn’t mention it, but this avoids the error when delimiters are within quoted strings.

Bottom Line

  • Boto is good for uploading data to S3 via Python (and to other AWS services)
  • Debugging imports is quite friendly
  • Use REMOVEQUOTES to avoid import problems
Posted in Redshift | 6 Comments