I’ve started work on loading a significant amount of data into Redshift. There’s only three ways to get data into Redshift:
- COPY from S3 (delimited text files)
- COPY from DynamoDB (NoSQL datastore)
- JDBC/ODBC transactions (not efficient for bulk loading)
My source system generates a few dozen files a day, making it easy for me to upload into Redshift a little bit at a time. So, I’ve started writing a Python script that will:
- Extract data files from my source system (which is actually a Hoop server fronting Hadoop’s HDFS)
- Load them into S3 (via Boto)
- Run the
COPYcommand in Redshift to load the data (via JDBC)
I was searching for a way to upload files to S3 and found many utilities, but they all seemed into various states of disrepair. Fortunately, I came across Boto, which works very nicely with many AWS services. The documentation is very good and it is written by Mitch Garnaat, who has written an O’Reilly book titled Python & AWS Cookbook and who, it seems, is now working for AWS. His blog has a wealth of information, too.
Using boto, I managed to automate the uploading of files to S3. I also found a handy blog post on ferrouswheel.me Upload a file to S3 with Boto showing how to use a callback function during the upload.
I was also pleased to see that Boto could mark files for Reduced Redundancy Storage, which is 20% cheaper. Data is apparently replicated a little less on S3, which makes increases the possibility of data loss. However, this is fine for my purposes since I have a copy of the source data anyway.
Next, I did a test load of the data by performing a manual COPY command in Redshift. Unfortunately, I raw across an error while loading the data. It’s worth mentioning that Redshift is great for diagnosing load errors. Specifically:
- A system table (
stl_load_errors) is populated with error information
- I can set the
MAXERRORSparameter to allow the system to continue loading, even when errors are generated
- I can use the
NOLOADfeature to do a “trial load” of data. It parses the data for errors, but does not load the data. That’s great for checking whether the data will load.
In this particular case, the error was “
Extra column(s) found“. I soon discovered that some of my data contained the delimiter. The data uses the pipe “|” character to separate columns, but several rows of data also contained a pipe character within their fields, even though they were within quotation marks.
Fortunately, this was solved by using the
REMOVEQUOTES option on the
COPY command. The documentation doesn’t mention it, but this avoids the error when delimiters are within quoted strings.
- Boto is good for uploading data to S3 via Python (and to other AWS services)
- Debugging imports is quite friendly
REMOVEQUOTESto avoid import problems