GoTriple - Data from: De novo transcriptome assembly databases in the butterfly orchid Phalaenopsis equestris

Abstract

Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. P. equestris genome assemblyThe P. equestris genome scaffolds and the file containing the locational relationship between the superscaffold and scaffolds or contigsPha_1213.scafSeq.FG2_superscaffold.tar.gzP. equestris genome repeat annotationThe P. equestris genome repeat annotation，which containing repeat annotation file by proteinmasker, repeatmasker and TRF, the gff format file of repeat annotation by proteinmasker, repeatmasker and TRF, the gff format file of de novo repeat annotation and the xlsx format file of the statistics of repeat annotation.pequ_repeat_dataset1.tar.gzP. equestris genome gene modelsThe P. equestris genome gene models contain predicted coding sequence, proteins and gff format filepequ_gene_models_dataset1.tarP. equestris genome functional annotationThe P. equestris genome function annotation dataset contains the blast results from KEGG, InterPro, Swissprot, TrEMBL databasepequ_functional_annotation_dataset1.tarThe transcriptome assemblyThe dataset contains the unigenes from the longest contigs per transcripts generated by Trinity. The fb.flower bud.Unigene.fa file contains unigenes from flower of P. equestris, the L5.root.Unigene.fa file are unigenes from root of P. equestris, the L6.stem.Unigene.fa file contains unigenes from stem of P. equestris, the PHA.leaf. Unigene.fa file contains unigenes from leaf of P. equestris. 12_day.unigene.fasta, 7_day.unigene.fasta and 4_day.unigene.fasta files are unigenes from seeds respectively taken from sowing on 1/2 MS medium for 12 days, 7 days and 4 days. sepal.unigene.fasta, petal.unigene.fasta, lip.unigene.fasta and column.unigene.fasta files are unigenes from sepal, petal, lip and column.unigene_dataset3.tarThe transcriptome functional annotationThe dataset contains functional annotation and gene coding sequence annotation for 11tissues. There are five annotation files per tissues, which are three functional annotation files and two structural annotation files, respectively. They are the KEGG, COG and Nr database annotation files. The cds and pep files are fasta format, the title in the files contains unigene name predicted coding sequence, the locus and the coding directionannotation_dataset4.tar.gzHSP gene family in the eleven transcriptomeWe tested full-length transcripts against the HSP90 and HSP70 gene family in order to examine the completeness of the data by comparing 11 tissues transcriptomes with P. equestris genome. PEQU means P. equestri; flower bud, root, stem and leaf are labeled by fb, L5, L6 and PHA, respectively. 4_day_seed, 7_day_seed and 12_day_seed are seeds respectively taken from sowing on 1/2 MS medium for 4 days, 7 days and 12 days.HSP_dataset5.tar100 CEGs for checking transcript assembly completenessThe alignment results from100 randomly selected conserved core eukaryotic genes (CEGs) among Arabidopsis thaliana, P. equestris and eleven transcriptomes for examining the transcript assemblies completeness. 82CEGs sequences (82%) were perfectly reconstructed, showing high consistency, although there were some sequences suggesting that partial sequencing missed in PEQU genome, such as sequences from At2g36880.1 and At1g12840.1 homologous genes, and some sequences in transcriptomes should be merged, such as sequences from At4g39280.1 homologous genes.CEGs_dataset6.tar