Direct S3 File Upload

Problem

Users could not upload all their products to the marketplace because most of them were files that are larger than 200Mb or were video files. Both were not supported for upload; there was a size limitation for digital files and video files were not supported at all.

Goals and challenges

  • Allow users to upload files that are larger than 200 Mb
  • Enable users to continue file upload after the network error from user side has occurred.
  • Scan uploaded files for malware

How Wise Engineering helped

Direct upload of files to Amazon S3 was designed and successfully applied in the existing service (marketplace) within a shorter period of time than the upload system that was used before: half a month compared to 2 months. Moreover, it also enhanced security for downloaded files.

Results

  • Users can upload files up to 500Mb or up to 1 Gb (depending on association with a specific user group)
  • New items (Video Files) became available for upload
  • S3 server usage decreased the time and resources required for server maintenance compared to the hardware servers that were used before
  • Enhanced security of downloaded files

Background

Existing service allowed users to upload files and attach them to products. However, this service puts a solid limit on the size of files that are selected for upload. As a result, it was not possible to upload a file that is larger than 200 Mb. Also, it worked in a way that when a file upload failed (due to the network error), the user was forced to start the entire process all over again. A major problem is that many users have files that are larger than 200Mb. Consequently, there was a lot of requests to upload such types of files. Besides, the NGINX module which was used for file upload was very old and unsupportable.

Implementation process

Old upload schema

Previously, when the user uploaded a file, we used an old NGINX module. This NGINX module simply read all the content of the file and stored it on our server in a temporary directory with a temporary file name. The following problems emerged:

  1. This mechanism read the whole file within one operation.
    This approach doesn’t work correctly for the big files and caused a significant load on our servers.
  2. The unique temporary file name was not, in fact, unique.
    Sometimes, when we had lot of users on the website, we had conflicts when two or more users uploaded files with the exact same temporary file name.
  3. Our upload progress bar displayed the upload process only.
    We could not add a progress bar for virus check, consequently users just saw a stalled progress bar while we checked the file for viruses. Also, we generated preview files, which were a reduced version of the original user’s files; because of this, users could not monitor the progression of the upload process.
  4. After upload, we stored all files on our own data servers.
    It required for us more attention and resources to maintain them. Moreover, it made our codebase more complicated since we had to have some extra fields in the database in table item properties; we called it mask of servers, which contains information where a file for a specified item was stored. It increased complexity when we needed to select all items which stored files on specific data servers only.
  5. This upload process was not really secure:
    • If someone hacked our system, he could catch and replace files
    • We checked files for viruses by Cron after upload was completed. During the time between upload and check, an infected file may corrupt our system and harm other users who later bought the product and downloaded the infected file.

Solution

New upload schema

We decided to upload files directly to S3 and no longer store them on our servers. Also, we’ve chosen to use a 3rd party JavaScript library to upload files to S3 directly instead of creating our own library. We selected https://github.com/TTLabs/EvaporateJS. This library has already implemented and tested features for all known web browsers:

  • Secured direct S3 file upload
  • Resume failed file upload

Now all files are uploading to S3 into a temporary bucket for each user’s item. Also, we use unique paths and file names. We decided to split files into a smaller parts before upload and run uploads to S3 in parallel threads, each part is signed by our server so it can’t be catched and replaced by someone else.

Now when a user selects a file to upload to S3, the following occurs:

  1. Evaporate JS uploads a file into a temporary bucket directly
    During the upload process, it makes a request to our servers only for signing each part of the file. This is done in order to avoid cases when somebody can catch part of the uploaded file and replace it.
  2. After uploading files into the temporary bucket we started to process the files in the following order:
    1. Check for viruses:
      If a virus was detected, we stop processing the file, move it into a quarantine bucket (it clears automatically after 4 days) and notify the user about this event.
    2. Generate preview files
    3. Collect information about the file (size, checksum, type of file, etc.)
      During this process, the user sees an upload progress bar. And does not see a frozen 100% upload as it was with old upload system.
  3. User fills in required text fields and clicks on submit button.
  4. Process push job of funneling item into a queue:
    When the user wants to create several items, we organize all the processes into a queue. Once the first item is submitted for creation/update, the user is able to select the next one and doesn't have to wait until the create/update process is finished for the first item. Once the item is created/updated, we display a message to inform the user.
  5. In the background, worker process takes jobs from the queue and performs the next job in parallel threads:
    1. Save item details into database
    2. Move uploaded files from temporary bucket into regular.
      In a regular S3 bucket, we used MD5 from the file content as the file name. It gives us an opportunity to know when the file was changed and then store the whole file on S3 along with the file upload history. Now we give users a new feature - opportunity to download and edit older versions of downloaded file.
  6. In case the background saving process has failed (network issue, Amazon issue, etc.), the queue process always restarts failed tasks a couple of times.

Results

The new upload process was designed and successfully applied. We gave our users an opportunity to upload files up to 500Mb. For premium users, we’ve added a new feature - opportunity to upload files up to 1 Gb. Also, we decided to allow upload of video files as a new item. This upload process was successfully applied in the existing service (marketplace) within a shorter period of time than the upload system that was used before: half a month compared to 2 months. Using the S3 server also makes it easier to maintain, and takes less time and effort to maintain, compared to the hardware servers that were used before. Moreover, it also reinforced security for downloaded files.