gsandie online notebook

multipart uploads + @fog + threads = win

Recently I’ve found myself playing with Fog[1] quite a lot, if you don’t know it, it is a really nice library for working with different cloud providers. If you’re coding in Ruby and using multiple cloud computing vendors, you should check it out.

One of the nice things about fog is that is supports Amazons S3 multipart uploads[2]. This is a feature of S3 that Amazon recommends you use if the files you want to upload are greater than 100Mb. It just so happened I had a bunch of files that fit the bill.

Multipart uploads are neat as there is no expiry of the upload, you need to either complete it or abort it. This would let you schedule part uploads during times when your network traffic is quiet. You are also able to recover from a single part failing without it affecting the whole file upload.

How do multipart uploads work?

The basic steps are:

  • get a file

  • split the file into chunks, each part except the last part must be at least 5Mb in size

  • get the Base64 encoded MD5 sum of the part

  • initiate a multipart upload and get an upload id

  • upload each part, identify it with a part number and the upload id, saving a tag of part

  • if you are happy, use the tags and the upload id to complete the upload

  • if you are NOT happy you have to abort the upload

After some hacking about I had a basic script that would take a file, split it, get the Base64 encoded MD5 of the parts and upload them (The hacky results are in https://gist.github.com/907430). This worked well, however I really wanted to upload multiple parts at once to increase the speed so I investigated threading in ruby.

Results

The results are presented below as a proof of concept script. The main thing that had me scratching my head was completing the upload. Originally I had been pushing the ETag from each part onto an array, however as the threads can run in different orders and finish in different times there was no guarantee for the order of the tags in the array. Once I realised this I explicitly set array element to its corresponding tag and the uploads would complete.

The above is far from perfect but it is working for me and I hope it gets the general idea across. I now plan on taking base and turning it into a system that can perform a single upload on small files, and a multipart upload on large files.

[1] - http://fog.io

[2] - http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?uploadobjusingmpu.html

— Gavin