Direct S3 file upload

4 Sep 20205 min read

Problem

Users could not upload all their products to the marketplace because most of them were files that are larger than 200 MB or were video files. Both were not supported for upload; there was a size limitation for digital files, and video files were not supported at all.

Goals and challenges

Allow users to upload files that are larger than 200 MB.
Enable users to continue file upload after the network error from user side has occurred.
Scan uploaded files for malware.

How WE helped

Direct upload of files to Amazon S3 was designed and successfully applied in the existing service (marketplace) within a shorter period of time than the upload system that was used before: half a month compared to 2 months. Moreover, it also enhanced security for downloaded files.

Results

Users could upload files up to 500 MB or up to 1 GB (depending on association with a specific user group).
New items (Video Files) became available for upload.
S3 server usage decreased the time and resources required for server maintenance compared to the hardware servers that were used before.
The project received enhanced security of downloaded files.

Background

Existing service allowed users to upload files and attach them to products. However, the service put a solid limit on the size of files that were selected for upload. As a result, it was not possible to upload a file that is larger than 200 MB. Also, it worked in a way that when a file upload failed (due to the network error), the user was forced to start the entire process all over again. A major problem was that many users had files usually larger than 200 MB. Consequently, there was a lot of requests to upload such types of files. Besides, the NGINX module which was used for file upload was very old and unsupportable.

Implementation process

Old upload schema

Previously, when the user uploaded a file, we used an old NGINX module. This NGINX module simply read all the content of the file and stored it on our server in a temporary directory with a temporary file name. The following problems emerged:

This mechanism read the whole file within one operation. This approach didn't work correctly for the big files and caused a significant load on our servers.
The unique temporary file name was not, in fact, unique. Sometimes, when we had lot of users on the website, we had conflicts when two or more users uploaded files with the exact same temporary file name.
Our upload progress bar displayed the upload process only. We could not add a progress bar for virus check, consequently users just saw a stalled progress bar while we checked the file for viruses. Also, we generated preview files, which were a reduced version of the original user’s files; because of this, users could not monitor the progression of the upload process.
After the upload, we stored all files on our own data servers. It required more attention from our side and resources to maintain them. Moreover, it made our codebase more complicated since we required some extra fields in the database in table item properties; we called it mask of servers, which contained information where a file for a specified item was stored. It increased complexity to select all items which stored files on specific data servers only.
This upload process was not secure:

If someone hacked our system, they could catch and replace files.
We checked files for viruses by Cron after the upload was completed. During the time between upload and check, an infected file could corrupt our system and harm other users who later bought the product and downloaded the infected file.

direct s3 file upload old scheme

Solution

New upload schema

We decided to upload files directly to S3 and no longer store them on our servers. Also, we’ve chosen to use a 3rd party JavaScript library to upload files to S3 directly instead of creating our own library. We selected https://github.com/TTLabs/EvaporateJS . This library had already implemented and tested features for all known web browsers:

Secured direct S3 file upload.
Resume failed file upload.

Now all files were uploading to S3 into a temporary bucket for each user’s item. Also, we used unique paths and file names. We decided to split files into a smaller parts before upload and run uploads to S3 in parallel threads, each part was signed by our server so it couldn't be caught and replaced by someone else. Now when a user selects a file to upload to S3:

Evaporate JS uploads a file into a temporary bucket directly. During the upload process, it makes a request to our servers only for signing each part of the file. This is done to avoid cases when somebody can catch part of the uploaded file and replace it.
After uploading files into the temporary bucket, we start to process the files in the following order:
- Check for viruses: If a virus is detected, we stop processing the file, move it into a quarantine bucket (it clears automatically after 4 days), and notify the user about this event.
- Generate preview files.
- Collect information about the file (size, checksum, type of file, etc.).
User fills in required text fields and clicks on submit button.
Process push job of funneling item into a queue: When the user wants to create several items, we organize all the processes into a queue. Once the first item is submitted for creation/update, the user is able to select the next one and doesn't have to wait until the create/update process is finished for the first item. Once the item is created/updated, we display a message to inform the user.
In the background, worker process takes jobs from the queue and performs the next job in parallel threads: Save item details into database.
In case the background saving process has failed (network issue, Amazon issue, etc.), the queue process always restarts failed tasks a couple of times.

direct s3 file upload old scheme

Results

The new upload process was designed and successfully applied. We gave our users an opportunity to upload files up to 500 MB. For premium users, we’ve added a new feature – files upload up to 1 GB. Also, we decided to allow upload of video files as a new item. This upload process was successfully applied in the existing service (marketplace) within a shorter period of time than the upload system that was used before: half a month compared to 2 months. The S3 server also was easier to maintain, and took less time and effort to maintain, compared to the hardware servers that were used before. Moreover, it also reinforced security for downloaded files.

Wise Engineering team helps businesses solve complex tech challenges like migration to cloud, internal search integration, optimization for high load, and more. Check out our expertise in custom software development and contact us to discuss your future project.

Share on

Direct S3 file upload

Problem

Goals and challenges

How WE helped

Results

Background

Implementation process

Old upload schema

Solution

New upload schema

Results

Table of contents