Direct S3 File Upload

Problem

Users could not upload all their products to the marketplace, because most of them were files that are larger than 200Mb or it was video files. Both were not supported for upload, as there was a size limitation for digital files and video files were not supported at all.

Goals and challenges

  • Allow users to upload files that are larger than 200 Mb
  • Enable users to continue file upload after the network error from user side has occurred.
  • Scan uploaded files for malware

How Wise Engineering helped

Direct upload of files to Amazon S3 was designed and successfully applied in the existing service (marketplace) within a shorter period of time than upload system that was used before. Half a month compared to 2 months. Moreover, it also enhanced security for downloaded files.

Results

  • Users can upload files up to 500Mb or up to 1 Gb (depends on belonging to specific user group)
  • New item (Video Files) became available for upload
  • S3 server usage decrease time and resources required spent on server maintenance compared to the hardware servers that were used before
  • Enhanced security of downloaded files

Background

Existing service allowed users to upload files and attach them to products. However this service puts solid limit on the size of files that are selected for upload. As a result, it was not possible to upload file that is larger than 200 Mb. Besides, it worked in a way, when file upload failed (due to the network error) the user was forced to start entire process all over again. Major problem was that Many users have files that are larger than 200Mb. Consequently, there was a lot of requests to upload such types of files. Besides, NGINX module which we used for file upload was very old and unsupportable.

Implementation process

Old upload schema

Previously, when the user uploaded a file, we used some old NGINX module. This NGINX module had simply read all the content of the file and stored it on our server in a temporary directory with a temporary file name. And here the following problems have emerged:

  1. This mechanism read whole file within one operation.
    This approach doesn’t work correctly for the big files and caused significant load on our servers.
  2. Unique temporary file name was not in fact unique.
    Sometimes, when we had lot of user on website we had collisions when two or more users uploaded files with the exact same temporary file name.
  3. Our upload progress bar displayed upload process only.
    We could not add progress bar for virus check there and that's why users just saw stocked progress bar while we checked file for viruses. We've generated preview files, which were the reduced version of the original user’s files. And at this case user also could not monitor the upload process progress.
  4. After upload, we store all files on our own data servers.
    It required for us more attention and resources to maintain them. Moreover, it made our codebase more complicated since we had to have some extra field in database in table item_properties, we called it mask of servers, which contains information where file for specified item was stored. It increased complexity when we need to select all items which store files on specific data servers only.
  5. This upload process was not really secured:
    • if someone hacks our system, he could catch and replace file
    • WE checked file on viruses by cron after upload was completed. During the time between upload and check, infected file may corrupt our system and make harm to other users who later bought the product and downloaded infected file.

Solution

New upload schema

We decided to upload files directly on S3 and do not store them in our servers. Also we’ve chosen to use 3rd party javascript library for upload files on S3 directly instead of create own library. We selected to use https://github.com/TTLabs/EvaporateJS. This library has already implemented and tested features for all known web browsers:

  • Secured direct S3 file upload
  • Resume failed file upload

Now all files are uploading on S3 into temporary bucket for each users item.Besides we use unique path and files name, We decided to split files into a smaller parts n before upload and run upload on S3 in parallel threads, each part is signed by our server and can’t be catched and replaced by someone else.

Now when user selects file to upload on S3, the next process was followed:

  1. Evaporate JS uploads file into temporary bucket directly.
    During upload process it makes request to our servers only for signing each part of file. This is done in order to avoid cases when somebody can catch the part of uploaded file and replace it.
  2. After uploading file into temporary bucket we started to process this files in the following order:
    1. Checking for viruses
      If the virus was detected, we stop to process the file, move it into quarantine bucket (it clears automatically after 4 days) and notify a user about this event.
    2. Generating preview files
    3. Collect information about file (size, checksum, type of file and etc)
      During this process, user see upload progress bar. And do not see frozen 100% upload as it was with old upload system.
  3. User filled in required text fields and click on submit button.
  4. Process push job of creating item into queue.
    When the user wanted to create several items, we’ve organized all the processes into queue once the first item was submitted for creation/update, the user was able to select the next one and doesn't have to wait until create/update process was finished for the first item. Once the item was created/updated we displayed a message to inform user.
  5. In background worker process takes jobs from queue and perform next jobs in parallel threads:
    1. Save item details into database
    2. Move uploaded files from temporary bucket into regular
      In a regular S3 bucket we used MD5 from the file content as the file name. It give us a possibility to know when the file was changed and store the whole file on S3 along with the file upload history. Now we give users new feature - opportunity to download and edit older version of downloaded file.
  6. In case background saving process has failed (network issue, Amazon issue, etc), the queue process always restarts failed tasks for a couple of times.

Results

New upload process was designed and successfully applied. We gave our users an opportunity to upload files up to 500Mb. For premium users, we’ve added a new feature - opportunity to upload files up to 1 Gb. Also, we decided to allow upload video files as a new kind of item. This upload process was successfully applied in the existing service (marketplace) within a shorter period of time than upload system that was used before. Half a month compared to 2 months. Using S3 server also easier to maintain, takes less time and efforts to maintain compared to the hardware servers that were used before. Moreover, it also reinforced security for downloaded files.