Joe Ray

Software and Infrastructure Development.

GNU Parallel or How to list millions of S3 objects

GNU Parallel is a great tool for parallelising command line tasks and the AWS CLI is a great tool for interacting with S3.

Put them together and you have everything you might need to list millions of objects in an S3 bucket efficiently.

The problem: Listing a lot of objects in an S3 bucket takes a seriously long time when using aws s3 ls

The solution: GNU parallel can parallelise the process

Let’s say your bucket contains a set of log files named by date and you’ve followed the S3 performance tips by prefixing each key with a hexadecimal character.

We can use parallel to split up the object listing into separate tasks which can then run simultaneously:

$ parallel aws s3api list-objects \
    --bucket my-bucket \
    --prefix {1} \
    ::: {a..f} {0..9}

What we’re doing here is performing 16 list-objects commands, one for each of the hexadecimal digits. s3api list-objects is more efficient than s3 ls as it just does the raw API call and returns the result to you. The {1} is a placeholder for each of the digits and we generate the digits using brace expansion. We tell parallel that the hexadecimal digits are arguments using the ::: syntax.

If we had a two-character prefix, we could further parallelise the work by using the following command:

$ parallel aws s3api list-objects \
    --bucket my-bucket \
    --prefix {1}{2} \
    ::: {a..f} {0..9} \
    ::: {a..f} {0..9}

This uses the same technique as before but will now parallelise all combinations of the hexadecimal digits, producing commands like this:

aws s3api list-objects --bucket my-bucket --prefix ab
aws s3api list-objects --bucket my-bucket --prefix ac
aws s3api list-objects --bucket my-bucket --prefix ad
aws s3api list-objects --bucket my-bucket --prefix ae
aws s3api list-objects --bucket my-bucket --prefix af
aws s3api list-objects --bucket my-bucket --prefix a0
aws s3api list-objects --bucket my-bucket --prefix a1
aws s3api list-objects --bucket my-bucket --prefix a2
aws s3api list-objects --bucket my-bucket --prefix a3
aws s3api list-objects --bucket my-bucket --prefix a4
aws s3api list-objects --bucket my-bucket --prefix a5
aws s3api list-objects --bucket my-bucket --prefix a6
aws s3api list-objects --bucket my-bucket --prefix a7
aws s3api list-objects --bucket my-bucket --prefix a8
aws s3api list-objects --bucket my-bucket --prefix a9
aws s3api list-objects --bucket my-bucket --prefix ba
aws s3api list-objects --bucket my-bucket --prefix bb
...

You could combine this technique with the jq and awk commands to count the number of objects and sum the total size:

$ parallel aws s3api list-objects \
    --bucket my-bucket \
    --prefix {1}{2} \
    ::: {a..f} {0..9} \
    ::: {a..f} {0..9} | \
  jq '.Contents[].Size' --raw-output | \
  awk '{ count += 1; size += $1 } END
       { print "Count:", count, "Size:", size }'

Or use parallel again to copy the objects locally:

parallel aws s3api list-objects \
    --bucket my-bucket \
    --prefix {1}{2} \
    ::: {a..f} {0..9} \
    ::: {a..f} {0..9} | \
  jq '.Contents[].Key' --raw-output | \
  parallel -j20 aws s3 cp s3://my-bucket/{} .

Here we use the -j flag to increase the level of parallelisation. By default, parallel will create as many parallel tasks as you have cores on your CPU but if your workload involves waiting for network IO (as in the case of accessing S3) then you can probably safely increase the level of parallelisation to speed things up.

Contact Me

I love a good conversation! E-mail me (encrypted if you want) or find me on Twitter, LinkedIn and GitHub.