Submitting Metadata
Once you have finished installing the smaht-submitr
package (per the Installation section),
and you have setup your access keys (per the Credentials section),
you should be ready to use the submit-metadata-bundle
command.
What follows are detailed instructions for how to format your metatdata submission files,
and how to actually submit (and validate) these, and upload yours files, to SMaHT Portal using this command.
Formatting Metadata Files
Most commonly, the file format recommended for metadata submission to SMaHT Portal,
is an Excel spreadsheet file (e.g. your_metadata_file.xlsx
),
comprised of one or more sheets.
Note these important aspects of using the Excel spreadsheet format:
The spreadsheet must have a file suffix of
.xls
or.xlsx
; there are no other requirements for the name of this file.Each sheet name must be the exact name of a SMaHT Portal item or object defined within the system (e.g.
AlignedReads
).Each sheet must have as its first row a special header row, which enumerates in each column, the exact names of the Portal object properties as the column names; order does not matter.
Each sheet may contain any number of data rows (directly below the header row), each representing an instance of the Portal object.
The values in the cells/columns of each data row correspond to property names in the same column of the (first) header row.
Note these important rules defining exactly the parts of the spreadsheet which are relevant for metadata submission.
The first row which is entirely empty marks the end of the data, and any subsequent rows will be entirely ignored; this means you can include comments in your spreadsheet in rows after (below) the first blank row indicating the end of data input.
The first column in the header row which is empty marks the end of the header, and any subsequent columns will be entirely ignored.
Sheets which are marked as hidden will be ignored; this provides a way of including sheets with other auxiliary information without their contents interfering with the submission tool.
Sheets which have a name enclosed in parenthesis, for example
(My Comments)
, will similarly be treated as hidden as described above.
Despite the rather dense chunk of text here, it is actually pretty intuitive, straightforward, and almost self-explanatory. Here is screenshot of a simple example Excel spreadsheet:
Notice that the first row comprises the property/column header, defining properties named submitted_id
, submission_centers
, filename
, and so on. (N.B. Though submission_centers
is shown in the above screenshot,
that particular field is not actually required to be specified, as it’s automatically added by the smaht-submitr
tool if needed).
Notice the multiple tabs at the bottom for the different sheets within the spreadsheet,
representing (in this example) data for the Portal objects CellCultureSample
, Analyte
, Library
, and so on.
Note
For an actual example, as well as a template, please see the Metadata section below.
Tip
Other file formats besides Excel actually are supported; see the Advanced Usage section for more information.
SMaHT object properties have different types. Many of the types are simply text (or strings). Other types are described below.
Object Reference Properties
Some Portal object properties are defined as being references to other Portal objects (also known as linkTo properties). The values of these in the spreadsheet should be the unique identifying value for that object.
It is important to know that the smaht-submitr
tool and SMaHT will ensure that the referenced
objects actually exist within the SMaHT Portal, or are defined within the spreadsheet itself;
if this is not the case then an error will result.
Tip
For the database savvy, such references can be thought of as being analogous to foreign keys.
The identifying value property for an object varies depending on the specific object in question;
though the uuid
property is always common to all objects; other common identifying properties
are submitted_id
and accession
. The identifying properties for each object type (and other
relevant info) can be found in the Object Model section.
Date/Time Properties
For Portal object properties which are defined as date values,
the required format is YYYY-MM-DD
, for example 2024-02-09
.
For Portal object properties which are defined as date-time values,
the required format is YYYY-MM-DD hh:mm:ss
, for example 2024-02-09 08:25:10
.
This will default to your local timezone; if you want to specify a timezone
use a suffix like +hh:mm
where hh
and mm
are the hour and minute offsets (respectively) from GMT.
Boolean Properties
For Portal object properties which are defined as boolean values, meaning either true or false,
simply use these values, i.e. true
or false
(case-insensitive).
Array Properties
Some Portal object properties are defined to be lists (or arrays) of values.
To define the values for such array properties, separate the individual array values by a pipe character (|
).
For example if an object defines a molecules
property as an array type, then to set this
value to an array with the two elements DNA
and RNA
, use the value DNA|RNA
in the associated spreadsheet cell.
Less common, but still supported, is the ability to set values for individual array elements.
This is accomplished by the convention suffixing the property name in the column header with
a pound sign (#
) followed by an integer representing the zero-indexed array element.
For example to set the first element of the molecules
property (using the example above), use column header value molecule#0
.
Nested Properties
Some Portal object properties defined to contain other nested objects. Since a (Excel spreadsheet) inherently defines a “flat” structure, rather than the more hierarchical structure supported by Portal objects (which are actually JSON objects), in which such nested objects can be defined, a special syntactic convention is needed to be able to reference the properties of these nested objects.
For this we will use a dot-notation whereby dots (.
) are used to separate a parent property from its child property.
For example, if an object (e.g. ReferenceFile) defines an extra_files
property which itself
refers to an object containing a file_format
property,
then to reference that nested file_format
property, the spreadsheet column header would need to be extra_files.file_format
.
Implicit Properties
Some Portal objects require (or support) the specific submission_centers
property.
If you do not specify this though, smaht-submitr
will automatically supply this particular property;
it will implicitly be set to the submission center to which you belong.
Property Deletions
A column value within a (non-header) data row may be empty, but this only means that the value for the corresponding property will be ignored
when creating or updating the associated object. In order to actually delete a property value from an object,
a special value - *delete*
- should be used as the the property value.
Metadata
A thorough discussion of the metadata semantics is beyond the scope of this document, but there is a reference guide to the metadata objects supported by SMaHT Portal, provided at the link below. You can quickly view important aspects of each of the object types, such as the required and reference properties for each type, as well as each property type, and more.
Tip
More savvy command-line oriented users may find the view-portal-object
command useful.
This is described in the Advanced Usage section.
There is also a metadata submission template which you may find useful, from which to start your spreadsheet, as well as an example spreadsheet:
Submission
The type of submission supported is called a “metadata bundles”, or accessioning.
And the name of the command-line tool to initiate a submission is submit-metadata-bundle
.
A brief tour of this command, its arguments, and function is described below.
To get help about the command, do:
submit-metadata-bundle --help
To submit your metadata run submit-metadata-bundle
with your metadata file,
and the SMaHT environment name (e.g. data
) from your keys file (as described in the Credentials section)
as an argument to the --env
option, and the --submit
option.
For example:
submit-metadata-bundle your_metadata_file.xlsx --env data --submit
This will first validate your metadata, and if no errors were encountered, it will do the actual metadata submmision; you will be prompted for confirmation before the submission is started. If errors were encountered, the submission will not commence; you will not be able to submit until you fix the errors.
Tip
You can omit the --env
option entirely if your keys file has only one single entry,
or if you have your SMAHT_ENV
environment variable setup (see the Credentials section).
Note
If you opted to use a file other than ~/.smaht-keys.json
to store
your credentials, you will need to use the --keys
option with the path name to your alternate file as an argument;
or have your SMAHT_KEYS
environment variable setup (see the Credentials section).
This command should do everything, including uploading any referenced files, prompting first for confirmation; see the Uploading Files section for more on this.
If you belong to
multiple consortia and/or submission centers, you can also add the --consortium <consortium>
and --submission-center <submission-center>
options; if you belong to only one,
the command will automatically detect (based on your user profile) and use those.
Tip
You may wonder: Is it okay to submit the same metadata file more that once? The answer is: Yes. And, if you had made any changes to the file, updates will be applied as expected.
Validation
As mentioned in the previous section, using the --submit
option will perform
validation of your metadata before submitting it (after prompting you to do so).
But if you want to only run validation without the possibility of submitting the metadata to SMaHT Portal,
then invoke submit-metadata-bundle
with the --validate
option like:
submit-metadata-bundle your_metadata_file.xlsx --env <environment-name> --validate
Tip
This feature basically constitutes a sort of “dry run” facility.
To be more specific about the the validation checks, they include the following:
Ensures the basic integrity of the format of the metadata submission file.
Validates that objects defined within the metadata submission file conform to the corresponding Portal schemas for these objects.
Confirms that any objects referenced within the submission file can be resolved; i.e. either they already exist within the Portal, or are defined within the metadata submission file itself.
Verifies that referenced files (to be subsequently uploaded) actually exist on the file system.
Note
If you get validation errors, and then you fix them, and then you try again, it is possible that you will get new, additional errors. I.e. it is not necessarily the case that all validation errors will be comprehensively reported all at once. This is because there are two kinds (or phases) of validation: local client-side and remote server-side. You can learn more about the details of ths validation process in the Advanced Usage section.
If you’re getting a ton of validation errors dumped to your terminal screen,
you many want to use the --output FILE
object which will cause all output
to be saved to the specified file; and this will also refrain writing lengthy content to the terminal.
Getting Submission Info
To view relevant information about a submission use the check-submission
command like this:
check-submission --env <environment-name> <uuid>
where the <uuid>
argument is the UUID for the submission which should have been displayed
in the output of the submit-metadata-bundle
command (e.g. see screenshot).
Listing Recent Submissions
To view a list of recent submissions (with submission UUID and submission date/time),
in order of most recent first, use the list-submissions
command like this:
list-submissions --env <environment-name>
Use the --verbose
option to list more information for each of the recent submissions shown.
You can control the maximum number of results output using the --count
option with an integer count argument.
Use the --mine
option to see only your submissions; and use the --user EMAIL
to see only submissions from the named user (by email).
Screenshots
Here is a visual of a spreasheet snippet featuriing reference properties:
Here is a visual of a spreasheet snippet featuriing date/time and array properties:
The output of a successful submit-metadata-bundle --submit
will look something like this:
Notice the Submission tracking ID value in section as well as Upload File ID values;
these may be used in a subsequent resume-uploads
invocation; see the Uploading Files section for more on this.
When instead specifying the --validate
option the output will look something like this:
And if you additionally specify the --verbose
option the output will look something like this: