Uploading Files

As mentioned previously (in the Usage section), after submit-metadata-bundle processes the main submitted metadata file, it will (after prompting) upload, to AWS S3, any files referenced within the submission metadata file.

These files should reside on you local file-system in the same directory as your submission file. Or, if they do not, then you must specify the directory where these files can be found, like this:

submit-metadata-bundle your_metadata_file.xlsx --env <environment-name> --directory <path-to-files>

The above commands will only look for the files to upload directly within the specified directory (and not any sub-directories therein). To look (recursively) within sub-directories, do:

submit-metadata-bundle your_metadata_file.xlsx --env <environment-name> --directory <path-to-files> --sub-directories

Resuming Uploads

When using submit-metadata-bundle you can choose not to upload any referenced files when prompted. In this case, you will probably want to manually upload them subsequently; or you may want to update a previously uploaded file. You can do this using the resume-uploads command.

You can resume execution with the upload part by doing:

resume-uploads --env <environment-name> <uuid>

where the <uuid> argument is the UUID (e.g. 0ad28518-2755-40b5-af51-036042dd099d) for the submission which should have been displayed in the output of the submit-metadata-bundle command (e.g. see screenshot); this will upload all of the files references for the given submission UUID.

Or, you can upload individual files referenced in the original submission separately by doing:

resume-uploads --env <environment-name> <referenced-file-uuid> --uuid <item-uuid>

where the <referenced-file-uuid> argument is the UUID for the individual file referenced (e.g. b5a7999e-d614-4deb-b98d-b784925ab910), or the accession ID (e.g. SMAURL8WB1ZS) or accession ID based file name (e.g. SMAURL8WB1ZS.fastq) of the referenced file. This UUID, and accession ID and accession ID based file name, is included in the output of submit-metadata-bundle; specifically in the Upload Info section of that output (e.g. see screenshot).

For both of these commands above, you will be asked to confirm if you would like to continue with the stated action. If you would like to skip these prompts so the commands can be run by a scheduler or in the background, you can pass the --no_query or -nq argument)

As with the submit-metadata-bundle use the --directory argument to explicitly specify in what directory the upload files should be looked for. (And --sub-directories if you want that directory searched recursively).

Other Upload Considerations

Since smaht-submitr will only upload files found on the local computer running the package, if your files are not stored locally and are instead in Cloud storage or a local cluster, you need to consider other options for uploading such files.

Upoading Files Locally

This default option works well for uploading a small number of files or files of small size. Files can be transferred to your local computer from Cloud storage or a computing cluster in several ways.

Alternatively, the files can be directly downloaded from a remote location, for example from AWS S3, using the AWS command-line tool (awscli) for files on AWS S3.

However, note that the methods above require enough free disk space on your local computer to store the files to upload. As such files can be rather large, we recommend performing the upload from a Cloud or cluster instance for uploading many files or larger files.

Uploading from Google Cloud Storeage

If your data files are stored in Google Cloud Storage (GCS), we support the ability to upload, or more precisely, transfer files directly from GCS to AWS S3. The smaht-submitr command-line tools (submit-metadata-bundle and resume-uploads) accomplish this by leveraging a third-party software called rclone.

The advantage of this is that you needn’t have download the entire data file to your local machine, which well may not have enough disk space. The rclone facility transfers the data file from GCS to AWS S3 by way of your machine, i.e. using it as an intermediary, so that only a small portion of the data ever actually travels through your machine at a time.

And no need to worry about the details of rclone - its installation, and usage, and whatnot - the smaht-submitr tools automatically installs and hides the details of its workings from you. To take advantage of this you merely need to specificy a couple of command-line options, specifially --rclone-google-source and --rclone-google-credentials, for example:

submit-metadata-bundle your-metadata.xlsx --submit \
    --rclone-google-source your-gcs-bucket \
    --rclone-google-credentials your-gcp-service-account-file

Mounting AWS S3 Files

If your files are stored on AWS S3, tools such as s3fs or goofys facilitate mounting of S3 buckets as local file systems that can be readily accessed by smaht-submitr. Similar tools exist for Google Cloud Storage and Microsoft Azure.

Caution

If you are working on a Mac M1 or M2 system (i.e. using the ARM-based chip), you may encounter problems using these kinds of mounting tools. More guidance about this will (hopefully) be forthcoming.

Running Submission Remotely

File submission can be scripted to accommodate running on a another remote server from your own. Once an instance has been launched with appropriate storage requirements for the files to upload, the files can either be mounted or downloaded as before, smaht-submitr can be installed, and the remainder of the upload process can continue as on your local computer.

Note that your smaht-submitr keys (residing by default in ~/.smaht-keys.json) will also have to be copied to this server for successful file upload.

For example, if using an AWS EC2 instance running Amazon Linux 2 with files in AWS S3 and an appropriate IAM role and associated access/secret keys, executing the below will mount the indicated bucket(s) and upload the appropriate files to the DAC if found within the buckets:

# Install s3fs for mounting S3 buckets locally.
sudo amazon-linux-extras install epel -y
sudo yum install s3fs-fuse -y

# Setup your AWS credentials.
echo 'your-aws-access-key-id:your-aws-secret-access-key' > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs

# Setup your SMaHT credentials.
echo '{"data": {"key": "your-smaht-access-key-id", "secret": "your-smaht-secret-key", "server": "https://data.smaht.org"}}' > ~/.smaht-keys.json
chmod 600 ~/.smaht-keys.json

# Mount buckets on your local /path-to-your-mount-directory directory.
mkdir /path-to-your-mount-directory
s3fs your-s3-bucket-name /path-to-your-mount-directory -o passwd_file=~/.passwd-s3fs

# Run smaht-submitr with mounted files (assuming you have python and pip installed).
pip install smaht-submitr
resume-uploads your-upload-file-uuid --directory /path-to-your-mount-directory --sub-directories -nq

For further support or questions regarding file submission, please contact the SMaHT DAC Team at smhelp@hms-dbmi.atlassian.net