You are a data scientist, a developer or some kind of data engineer, who needs some computation powers from AWS. You upload your files with scp from your local machine to your EC2 instance, and after some computations, you download them back to your machine.
If this scenario is true, you may have encountered some of the following problems:
- Download speed in this coffee shop is SO SLOW!
- Upload speed in this coffee shop is SO SLOW!
- My Spot instance was shut down, now I have to upload all files again!
- My spot instance was shut down, now all results are gone!
- Hey, can you send me all the 400GB data we used for this project?
Behold, there is a solution for this! Simply share your files between co-workers and / or instances with AWS S3 Buckets! How? That’s pretty easy if you follow 3 simples steps, explained below.
Install AWS CLI Tools
The first step is to download the AWS command line tools. Assuming you’re on a good old Ubuntu machine, simply type the following command and you’re good to go.
$ sudo apt-get install awscli
Setup your Credentials
Now that we have the necessary binaries installed, it is time to connect them to our AWS Account. For that, we simply execute the following command.
$ aws configure
You will be asked some questions. The important parts are your security tokens. To get them, you have to
- open the AWS IAM Users page
- click on your user name
- select the Security credentials tab
- create a new security key or use an existing one
When this is done, you should be able to see your buckets. You can confirm that everything works as expected by listing all your buckets with the following command.
$ aws s3 ls
Sync your data
// Download a directory recursive to EC2 $ aws s3 cp s3://your_bucket/dir local_dir --recursive // Sync a full directory (download only new files) // to your local EC2 directory $ aws s3 sync s3://your_bucket/dir local_dir // Upload a local EC2 directory recursive to S3 $ aws s3 cp --recursive local_dir/ s3://your_bucket/dir // Sync a full directoy (upload only changes) // to your S3 Bucket $ aws s3 sync local_dir/ s3://your_bucket/dir
A common use case would be to upload data files to your S3 Bucket. After that, you spin up a machine, download all data files, make some computations and finally upload the results. On the next day, you can simply pull these results to another machine in a few seconds and go on with your rocket science. This scenario would look like this in the terminal.
// start ec2 instance and ssh into it $ sudo apt-get install awscli $ aws configure // copy paste credentials $ aws s3 sync s3://your_bucket/input_files input_data // execute rocket science python script // to crunch numbers from input_data // and write world saving results to output_data $ aws s3 sync output_data/ s3://your_bucket/output_files
I would like to see the AWS CLI tools pre-installed on EC2 machines, like on GCP instances. Nevertheless, I find the process very pleasant and easy to use.
The download and upload speed between EC2 and S3 is so good, that you will never spend important time with upload and download local files again.
If you know even faster solutions or other mechanisms, please let me know in the comments. I’m always curious how to improve my daily life as a developer. ❤️