Good Citizenship: RNA-Seq

A step-by-step guide to submitting RNA-Seq reads and assemblies to the National Center for Biotechnology Information.

RNA-Seq has facilitated an explosion of transcriptome studies in non-model organisms and contributed to our understanding of diverse aspects of evolutionary biology. Given that many RNA-Seq studies rely on public funding, we regard the open sharing raw data as an ethical imperative. In addition, most journals have data availability requirements that oblige researchers to publicly share their data.

The databases housed at the National Center for Biotechnology Information (NCBI) are a popular choice for data deposition, but the submission of RNA-Seq data to NCBI can be challenging. We have often felt the need for a detailed, all-in-one-place, presentation of the steps required to submit RNA-Seq reads and assemblies to NCBI, so we decided to create one.

From the creation of an NCBI user account to the submission of an assembled transcriptome, this step-by-step guide will show you how it's done. We hope this contribution will allow scientists of any experience level to easily share the data they have worked so hard to generate, and make NCBI an even more valuable resource than it is today.

Submission Checklist

Download this checklist, then make sure you've got the following resources at hand.

Checklist: RNA-Seq Data Submission

Necessary Resources

Hardware

Unix, Windows, or Macintosh workstation with an Internet connection and web browser (Firefox, Chrome, Safari or Internet Explorer)

Software

Command-line user interface (CLI). Macs have a built-in CLI called Terminal, which is usually in Macintosh HD/Applications/Utilities/Terminal. Some users may prefer iTerm. Windows machines do not include a CLI, but there are many available for download, including: ConEmu, Console2, cmder, Cygwin, and PuTTY.
Text editor. The default Mac text editor is TextEdit (Macintosh HD/Applications/TextEdit), but better functionality is found with BBEdit or SublimeText. The Windows default is Notepad (C:\Windows\System32\notepad.exe). An alternative for Windows is Emacs.
Aspera connect high-speed file transfer software: Download a command line version or a web-based plugin.
tbl2asn - only needed for TSA submission.

Data

Sequence read data in standard FASTQ format
Biological information about the sequenced organism
Technical information about how the sequencing library was prepared
A transcriptome assembly based on the raw reads
Technical information about how the transcriptome assembly was generated

Protocol 1: Sequence Read Archive (SRA) Submission

1. NCBI User Account

Create an NCBI user account. One account is used for all submissions by an individual.

2. Create a BioSample

Go to the BioSample submission page and sign in with your user account.

Click on the BioSample link and choose New submission.

Provide as much information as possible about the studied organism. It is difficult to edit this information after the process is complete, so carefully proofread all fields before submitting. The submission process progresses through seven fillable forms (presented as tabs at the top of the page). In order, these are:

Submitter: Provide information about the person submitting the data and the submitting organization (typically, this will be the submitter's organizational affiliation). An email address from the submitting organization's domain is required. If desired, a shared submission group can be created, allowing multiple authors to access and contribute to the submission.
General info: Assign a release date for the data, which can occur immediately upon submission or be delayed until publication (or until a specified future date). In addition, the you must choose between Single BioSample or Batch/Multiple BioSamples submission. If a batch submission is selected, only samples that are part of the same project should be included.
Sample type: Choose among ten options giving a general description of the sample type. Researchers working with non-model invertebrates should choose Invertebrate, those working on model or non-model plants should choose Plants, those working on canonical model animals (e.g. D. melanogaster and C. elegans) should choose Model organism or animal sample, and those working on any non-model non-invertebrate animal should also choose Model organism or animal sample. There are a variety of more specialized descriptors available for metagenomes or pathogens that should be chosen if appropriate.
Attributes: If Single BioSample was chosen, this page is a fillable form. If Batch/Multiple BioSamples was selected, you will be prompted to download a fillable template file. In either case, the following fields are mandatory:
- sample_name: a short, unique descriptor of the sequenced sample; organism the scientific name of the organism to the most specific level available (standard 'Genus species' if possible);
- collection_date: the date when the sample was collected, from the field or lab as appropriate'basically the date the organism was sacrificed;
- geo_loc_name: the site where the specimen was collected, in the general format Country:State:City;
- tissue: the specific tissue from which RNA was extracted for sequencing.

Batch/Multiple BioSamples

Overview: Look over all the submission parts and decide if it is ready to submit. Check carefully for any errors – once the submission is complete, changes can only be made by emailing the BioSample help desk.

3. Submit

Once the request is processed, you will receive a confirmation email containing the BioSample ID(s) in the format SAMNxxxxx.

4. Create a BioProject

Go to the BioProject submission page and sign in with your user account.

Provide information about the sequencing project, including organism information, funding sources, etc. The submission process progresses through seven fillable forms (presented as tabs at the top of the page). In order, these are:

Submitter: Provide information about the person submitting the data and the submitting organization (typically, this will be the submitter's organizational affiliation). An email address from the submitting organization's domain is required. If desired, again a shared submission group can be created.
Project type: Many Project data types are available; at this stage, you select Raw sequence reads. Also on this tab, the you must define the "sample scope", which will depend upon the biological entities included in the study. Monoisolate studies involve a single individual OR projects that result in a single transcriptome assembly, even if multiple individuals were pooled for sequencing; Multiisolate studies involve multiple individuals or strains from a single species; Multi-species studies involve individuals from 2+ different species; Environment studies involve environmental samples (water, soil, etc.) whose species content is unknown; Synthetic studies involve synthetically created samples; and Other is a catchall that allows you to manually define the sample scope.
Target: Enter information about the organism (or group of organisms) sequenced. The Organism name should be broad enough to cover all the included species (if multiple species within Genus are included, use Genus as the organism name).
General info: Assign a release date for the data, which can occur immediately upon submission, or be delayed until publication (or until a specified future date). In addition, you must choose a Project title and a Public description. These fields will be visible to the public once your submission is complete and released for public viewing. Define the Relevance of the BioProject (agricultural, evolution, etc.). Additional (optional) fields are available on this tab: External Links, for relevant web pages (such as a project page for the sequencing project, or a personal research page describing the research goals); Select Your Grants, for funding information; Consortium name, if the study is part of a consortium project; and Data provider, if the data were not generated by the submitter.
BioSample: Provide a previously created BioSample ID. However, only single BioSamples can be registered this way. If your project involves multiple samples (i.e. if it was a Batch/Multiple BioSamples submission), you should complete BioProject creation without including any BioSample information and then submit the BioSamples separately, including the BioProject ID in the submission. Click Continue without selecting a BioSample to skip this step.
Publications: If publications related to the BioProject already exist, they can be entered here. Future publications can be added at a later date.
Overview: Review all the submission parts to confirm it is ready to submit. Check carefully for any errors – once the submission is complete, changes can only be made by emailing the BioProject help desk!

5. Submit

Once the request is processed, you will receive a confirmation email containing your Submission Title, Submission ID, and BioProject ID (in the form PRJNAxxxxxx).

6. SRA Submission

At this point, you are ready to submit the raw reads that are associated with your BioProject. Sequences can be submitted exactly as received from the sequencing center. Alternatively, processed reads (i.e. those that have been subjected to quality trimming, contaminant removal, etc.) may be submitted. If processed reads are submitted, is important to include information about all the processing procedures applied.

Go to the Sequence Read Archive submission page and sign in with your user account.

Choose New submission. Provide information about the sequencing project, including organism information, funding sources, etc.. The submission process progresses through five fillable forms (presented as tabs at the top of the page). In order, these are:

Submitter: Provide information about the person submitting the data and the submitting organization (typically, this will be the submitter's organizational affiliation). An email address from the submitting organization's domain is required. If desired, a shared submission group can be created, allowing multiple authors to access and contribute to the submission.
General info: Enter the BioProject ID created earlier, indicate whether BioSample IDs have been created, and choose a release date. If Release immediately following processing is selected, the raw data will become publicly accessible right away. If you wish to delay release, a future release date must be selected.
SRA metadata: Provide a metadata file containing information about the sequencing procedures used. A template file can be downloaded from the SRA site as a tab-delimited TXT file or as an Excel file. The Excel file is easier to work with and provides helpful details. In either case, you must save the edited template (the sheet called "SRA_Data" in the Excel file) as a new tab-delimited file. To save the SRA_Data worksheet as a tab-delimited file, use "Save As" "Tab Delimited Text (.txt)."
- bioproject_accession: The BioProject ID associated with the raw reads.
- biosample_accession: The BioSample ID(s) associated with the raw reads.
- library_ID: A user-defined unique identifier. Each sequencing library must have its own unique ID.
- title: A short, publicly viewable description of the data. NCBI recommends the format "<methodology> of <organism>: <sample info>" (e.g. "RNA-Seq of Drosophila melanogaster: adult female antennae").
- library_strategy: You must choose from a provided set of options. For most transcriptome studies, the choose "RNA-Seq".
- library_source: The nucleic acid type that was used to prepare the library. For most transcriptome studies, choose "transcriptomic," but "metatranscriptomic" and "single cell transcriptomic" are also available choices.
- library_selection: The method of selection or enrichment used in preparing the sequencing library. For RNA-Seq studies using polyA selection for enrichment for messenger RNA (mRNA), choose "PolyA." For other RNA-Seq methods, such as Total RNA, choose "cDNA." More specialized options are available as appropriate (e.g. "cDNA_oligo_dT").
- library_layout: Specify whether paired or single end sequencing was done.
- platform: The sequencing platform used (Illumina, PacBio, etc.).
- instrument_model: The specific model of the sequencing instrument
- design_description: A short methods section describing how the libraries were prepared. You are encouraged to provide all relevant details, including, e.g., specific tissues extracted or whether sequencing represents pooled individuals. If processed reads are being submitted, describe the filtering or other steps carried out.
- filetype: The format of the raw reads file (typically FASTQ).
- filename: The exact file name, including extension. This must match the name of the file(s) you upload in the next step. In the case of paired read files (R1 and R2), the submitter will enter filenames in the columns filename and filenam
Files: Upload the raw reads. There are two main ways to do this: (i) Web-based direct upload of files stored on your local computer. Files smaller than 2GB can be uploaded directly, larger files will require the Aspera connect plugin. (ii) Command line-based interface to upload files via FTP or Aspera connect command line. The command line options are preferred, especially for larger data sets. We here describe each option more thoroughly:
- Direct upload: On the "Files" page, choose I will upload all the files now via HTTP/Aspera, then use the Browse button to select files stored locally. After all the files are transferred, click on Continue to move to the final tab.
- Aspera plugin: If you have installed the Aspera connect plugin, choosing I will upload all the files now via HTTP/Aspera will automatically launch an Aspera dialogue and let the user select a locally stored file. Multiple files can be selected at once. After selecting the files, the user must confirm the transfer by allowing Aspera to connect to the NCBI web page. After all the files are transferred, click on Continue to move to the final tab.
- Command line FTP: The command ftp is standard in Unix and Linux environments. On the "Files" page, choose I have all files preloaded for this submission, then click on FTP upload instructions.This will create a temporary NCBI user directory, and display all the information required for logging in. The information is a numbered list (1 through 7) and provides variable names that will be entered as shown below. Keep this webpage open (or copy all the information).Now the user should open a session on their local server and proceed as shown below. If there are difficulties in having the password accepted, try copying it from the SRA page into a text editor, then copying it from the text editor to the terminal.
  
  After all the files are transferred, return to the SRA submission webpage. Click on Select preload folder to see and select the folder that was just created.
  
  Click on Use selected folder and Continue to move to the final tab.
- Aspera command line: If you have installed the Aspera connect software on the server where the FASTQ files are stored, a single command can be used to transfer all the FASTQ files at once. On the "Files" page, choose I have all files preloaded for this submission, then click on Aspera command line upload instructions.
  
  Click on the link Get the key file to download a file called "aspera.openssh." This file must be transferred to the server where the FASTQ reads are stored, which can be done with the command scp from the terminal or using free FTP software such as FileZilla. Make sure that all the FASTQ files for upload are in a single folder that contains nothing else, then execute the following:After all the files are transferred, return to the SRA submission webpage. Click on Select preload folder to see and select the folder that was just created.Click on Use selected folder and Continue to move to the final tab.
Overview: Review all the provided information and check for errors.

7. Submit

Once satisfied with all entries, click Submit. When the process is complete, you will receive an email containing the final SRA ID in the form SRAxxxxxx.

Protocol 2: Transcriptome Sequence Assembly (TSA) Submission

Preparation Steps

1. For submission to TSA, certain formatting conventions must be adhered to. Failure to follow these conventions will result in a failed submission, so it is important to properly format the data before submitting.

Contig criteria: Contigs must be longer than 199 bp; must not contain more than 10% N’s; must not start or end with N; and, must not contain stretches of more than 14 N’s in a row. Contigs that do not meet these requirements should be removed from the assembly before attempting TSA submission.
Sequence definition line: This is the header line of each contig, which starts with “>” and ends with a newline character. The definition line must not be longer than 50 characters, including spaces, and must begin with a unique identifier (e.g. “>contig_001”). Additional modifiers can be used - see the complete list. These follow the format “[modifier=text]” and can include organism, sex, and other details. NCBI advises that all TSA submissions include “[moltype=mRNA]” and “[tech=TSA]” in the definition line. Many assembly programs will produce definition lines that contain information about contig length, assembly path, etc. All these values must be removed. See the accompanying webpage for more details.

To see the current definition line:
head
To modify the definition line to meet NCBI recommendations:

Commands used:
sed ‘s/len.*$/[organism=Good citizenii] [bioproject= PRJNA472791] [moltype=transcribed_RNA] [tech=TSA]/g' Gcitizenii_Spec7_Trinity.fasta' \
> Gcitizenii_Spec7_assembly.fasta

To see new definition line:
head

File name: The assembly must have the extension .fsa, not .fasta. To rename, do:
mv Gcitizenii_Spec7_assembly.fasta Gcitizenii_Spec7_assembly.fsa

File format: The user must decide between submitting the assembly as a FASTA file or as an ASN.1 file. The ASN.1 format is mandatory if the submitter also plans to provide annotation; otherwise, either method can be chosen. We recommend the ASN.1 format because it embeds data in the submission that otherwise (if FASTA is chosen) must be entered manually on the TSA submission page.

A TSA submission can contain only one assembly. Thus, if a BioProject contains, for example, three BioSamples (with corresponding SRA files for each), three distinct TSA submissions will be required.

Protocol Steps

2. Once the assembly file is properly prepared, go to the Submission Portal to begin the TSA submission. Sign in to NCBI with your user account, choose TSA and click on New Submission.

During this step, the submitter will provide information about the assembly, including the BioSample, BioProject, and SRA identifiers. The user must decide between submitting the assembly as a FASTA file or as an ASN.1 file. An ASN.1 file is mandatory if the submitter also plans to provide annotation; otherwise, either method can be chosen. If FASTA upload is selected, the user will manually provide some additional data; if ASN.1 is chosen, the required information is embedded in the ASN.1 file. The submission process progresses through either five (ASN.1) or six (FASTA) fillable forms, depending on the sequence format chosen. We first present the steps for FASTA files, followed by the steps for ASN.1 files.

3. For FASTA files:

Submitter: As in earlier examples.
General Info: Provide the BioProject identifier (PRJNAxxxxxx), the BioSample identifier (SAMNxxxxxx), the release date, the data type (EST or NGS), and the SRA accession(s) (SRRxxxxxx).
- Information about the assembly method: Provide the name of the assembly program used (e.g. Trinity, Abyss, etc.) and the version number (or date of assembly, if program version is not known);
- Assembly name (optional);
- Assembly coverage (optional);
- Description of Assembly method (required): This should be as detailed as possible and include and read processing steps, whether default program settings were used, and any other information that would be required to exactly reproduce the assembly process; and,
- Sequencing Technology: The platform used for sequencing (i.e. Illumina HiSeq, PacBio, etc.).

File: Choose between submitting the assembly as a FASTA file or as an ASN.1 file. Choose File type FASTA and click Continue.
Sequence: Click on Browse to select an assembly.fsa file stored on the user's local machine. An Aspera connect window will open to display the progress of the upload. Once the upload is complete, the message "Please wait! Processing the data" is displayed as an initial TSA validation check is conducted. If errors are displayed, click on report.txt for more information. If no report.txt link is shown, simply copy all the errors from the webpage and save as a file called TSA_report.txt.The report contains a list of all the problem contigs (see TSA_report.txt for example). Use the following steps to remove any problematic contigs:
- Create a list of all the contig IDs in the assembly:
References: Provide the name(s) of the "Sequence authors", i.e. the people responsible for generating the raw reads upon which the assembly is based, and information about publications (unpublished, in-press, or published) that include the assembly.
Overview: Look over all the provided information and decide if changes are needed. After you click Submit, the submission will undergo further assessment at NCBI and a complete VecScreen analysis will occur. This process can take 12 or more hours. If there are no problems, NCBI will send a confirmation email with a TSA accession number in the format GAAxxxxxx.

4. For ASN.1 files:

Before submitting an ASN.1 file, some additional preparations are required. The following files must be generated for each assembly that will be submitted:

Create a "GenBank Submission Template" (SBT file): Go to https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ and complete the required fields. The same BioProject identifier can be used for as many assemblies as are part of the project, but each BioSample identifier should have its own SBT file. After filling all fields, click Create Template to download the SBT file. Save the file with a name that reflects which assembly it is for.
Create a "Structured Comment - Non Genome Submissions" (CMT file): Go to https://submit.ncbi.nlm.nih.gov/structcomment/nongenomes/ and complete the required fields. The Assembly name field should be specific to a single assembly. Click Download to download the CMT file. Save the file with a name that reflects which assembly it is for.
Place the following files in a single folder (either on the user's local machine or on a server): The correctly formatted assembly.fsa file, as described in the beginning of section 3; the .sbt and .cmt files just created; and an appropriate version of the tbl2asn executable (e.g. linux64.tbl2asn, mac.tbl2asn, etc.).
From within the folder, run the tbl2asn command as follows:The resulting output files will be (i) assembly.val, a validation file that will report errors-if this file is empty, no errors were detected; (ii) assembly.sqn, which is the ASN.1 file for TSA submission.
Once the assembly.sqn file is prepared, go to https://submit.ncbi.nlm.nih.gov/ to begin the TSA submission.
Submitter: As in earlier examples
General info: As in earlier examples
File: Choose File type ASN and click Continue.
Sequence: Click on Browse to select an assembly.sqn file stored on the user's local machine. An Aspera connect window will open to display the progress of the upload. Once the upload is complete, the message "Please wait! Processing the data" is displayed as an initial TSA validation check is conducted. If no errors are displayed, click Continue.If there are errors, see the previous section "For FASTA files: d" for instructions on removing problematic contigs. Be sure to save the filtered assembly with a new name.
Overview: Look over all the provided information and decide if changes are needed.

5. Submission: After you click Submit, the submission will undergo further assessment at NCBI and a complete VecScreen analysis will occur. This process can take 12 or more hours. If there are no problems, NCBI will send a confirmation email with a TSA accession number in the format GAAxxxxxx.

If the assembly fails to pass the more thorough quality checks that occur post-submission, the submitter will receive an email stating that the submission failed.
An included link directs the user to the submission portal. Click on the Fix link.
This will lead you to a downloadable Contamination.txt file describing the type of contaminant identified and list the corresponding sequence identifiers.
See Gcitizenii_sqn_contamination.txt for an example.

6: See the previous section "For FASTA files: d" for instructions on removing problematic contigs. Be sure to save the filtered assembly with a new name.

Once the assembly has been filtered, return to the Sequence tab and upload the filtered assembly.

If all is well, you will receive a conformation email.

How to Cite

Feindt, W., Oppenheim, S. J., DeSalle, R., & Mehr, S. (2018). Good citizenship made easy: A step-by-step guide to submitting RNA-Seq data to NCBI. Current Protocols in Bioinformatics, e67. doi: 10.1002/cpbi.67

Disclaimer

This guide to is provided "as is." At the time of publication, all of the information and commands function as described, but future changes to software programs or in NCBI procedures could affect the performance of this guide.

Any expressed or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this guide, even if advised of the possibility of such damage.