Using Boto to load data into AWS Redshift

I’ve started work on loading a significant amount of data into Redshift. There’s only three ways to get data into Redshift:

  • COPY from S3 (delimited text files)
  • COPY from DynamoDB (NoSQL datastore)
  • JDBC/ODBC transactions (not efficient for bulk loading)

My source system generates a few dozen files a day, making it easy for me to upload into Redshift a little bit at a time. So, I’ve started writing a Python script that will:

  • Extract data files from my source system (which is actually a Hoop server fronting Hadoop’s HDFS)
  • Load them into S3 (via Boto)
  • Run the COPY command in Redshift to load the data (via JDBC)

I was searching for a way to upload files to S3 and found many utilities, but they all seemed into various states of disrepair. Fortunately, I came across Boto, which works very nicely with many AWS services. The documentation is very good and it is written by Mitch Garnaat, who has written an O’Reilly book titled Python & AWS Cookbook and who, it seems, is now working for AWS. His blog has a wealth of information, too.

Using boto, I managed to automate the uploading of files to S3. I also found a handy blog post on Upload a file to S3 with Boto showing how to use a callback function during the upload.

I was also pleased to see that Boto could mark files for Reduced Redundancy Storage, which is 20% cheaper. Data is apparently replicated a little less on S3, which makes increases the possibility of data loss. However, this is fine for my purposes since I have a copy of the source data anyway.

Next, I did a test load of the data by performing a manual COPY command in Redshift. Unfortunately, I raw across an error while loading the data. It’s worth mentioning that Redshift is great for diagnosing load errors. Specifically:

  • A system table (stl_load_errors) is populated with error information
  • I can set the MAXERRORS parameter to allow the system to continue loading, even when errors are generated
  • I can use the NOLOAD feature to do a “trial load” of data. It parses the data for errors, but does not load the data. That’s great for checking whether the data will load.

In this particular case, the error was “Extra column(s) found“. I soon discovered that some of my data contained the delimiter. The data uses the pipe “|” character to separate columns, but several rows of data also contained a pipe character within their fields, even though they were within quotation marks.

Fortunately, this was solved by using the REMOVEQUOTES option on the COPY command. The documentation doesn’t mention it, but this avoids the error when delimiters are within quoted strings.

Bottom Line

  • Boto is good for uploading data to S3 via Python (and to other AWS services)
  • Debugging imports is quite friendly
  • Use REMOVEQUOTES to avoid import problems
This entry was posted in Redshift. Bookmark the permalink.

6 Responses to Using Boto to load data into AWS Redshift

  1. njaffer says:

    Were you able to figure out how to load a file that had headers in it as well, typically in postgres one would use the ‘HEADER’ option, but doesn’t seem to work here.

  2. Pingback: Combining Hadoop/Elastic Mapreduce with AWS Redshift Data Warehouse

  3. Pingback: Bucket Explorer is great for managing files in S3 | AWS Redshift User Blog

  4. Scott Bollinger says:

    can we use boto to execute the copy command from s3 to redshift?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s