test
Search publications, data, projects and authors

Dataset

Undefined

ID: <

50|dedup_wf_001::4a42ba8a25b7a26adb431780f505bbe3

>

·

DOI: <

10.5061/dryad.p4s57

>

Where these data come from
Data from: The constrained maximal expression level owing to haploidy shapes gene content on the mammalian X chromosome

Abstract

X chromosomes are unusual in many regards, not least of which is their nonrandom gene content. The causes of this bias are commonly discussed in the context of sexual antagonism and the avoidance of activity in the male germline. Here, we examine the notion that, at least in some taxa, functionally biased gene content may more profoundly be shaped by limits imposed on gene expression owing to haploid expression of the X chromosome. Notably, if the X, as in primates, is transcribed at rates comparable to the ancestral rate (per promoter) prior to the X chromosome formation, then the X is not a tolerable environment for genes with very high maximal net levels of expression, owing to transcriptional traffic jams. We test this hypothesis using The Encyclopedia of DNA Elements (ENCODE) and data from the Functional Annotation of the Mammalian Genome (FANTOM5) project. As predicted, the maximal expression of human X-linked genes is much lower than that of genes on autosomes: on average, maximal expression is three times lower on the X chromosome than on autosomes. Similarly, autosome-to-X retroposition events are associated with lower maximal expression of retrogenes on the X than seen for X-to-autosome retrogenes on autosomes. Also as expected, X-linked genes have a lesser degree of increase in gene expression than autosomal ones (compared to the human/Chimpanzee common ancestor) if highly expressed, but not if lowly expressed. The traffic jam model also explains the known lower breadth of expression for genes on the X (and the Z of birds), as genes with broad expression are, on average, those with high maximal expression. As then further predicted, highly expressed tissue-specific genes are also rare on the X and broadly expressed genes on the X tend to be lowly expressed, both indicating that the trend is shaped by the maximal expression level not the breadth of expression per se. Importantly, a limit to the maximal expression level explains biased tissue of expression profiles of X-linked genes. Tissues whose tissue-specific genes are very highly expressed (e.g., secretory tissues, tissues abundant in structural proteins) are also tissues in which gene expression is relatively rare on the X chromosome. These trends cannot be fully accounted for in terms of alternative models of biased expression. In conclusion, the notion that it is hard for genes on the Therian X to be highly expressed, owing to transcriptional traffic jams, provides a simple yet robustly supported rationale of many peculiar features of X’s gene content, gene expression, and evolution. chicken.all_samples.galGal3.tpm.refgene.oscData for the analysis of the chicken chromosome Z. FANTOM5 chicken libraries consisted of 25 CAGE libraries including: chicken aortic smooth muscles, hepatocytes, mesenchymal stem cells, leg buds, wing buds, embryo extra-embryonic tissue (day 7 and day 15), and whole body developmental time course (from 5 hours 30 minutes to 20 days). The number of available datapoints to which TPM was normalized was limited by the number of annotated chicken RefSeq transcripts (which was approximately six times smaller than human, N = 4,426 on autosomes, and N = 241 on chromosome Z). Consequently, the cutoff for a gene to be classified as “on” was adjusted six times higher to 60 TPM.human.primary_cell.hCAGE.hg19.tpm.refgene.oscThe FANTOM5 dataset for human primary cells.human.cell_line.hCAGE.hg19.tpm.refgene.oscThe FANTOM5 dataset for human cancer cell-lines.human.tissue.hCAGE.hg19.tpm.refgene.oscThe FANTOM5 dataset for human tissue. CAGE tags were mapped to RefSeq transcripts +/-500 base pairs (bps) from their TSSes and normalized to tags per million (TPM), as previously described [37,45]. The signal of ten TPM was chosen as the cutoff for a gene to be classified as “on” (this cutoff was accepted as the standard for human data throughout the consortium). FANTOM5 is the most comprehensive expression dataset ever generated, including 952 human and 396 mouse tissues, primary cells and cancer cell-lines. FANTOM5 is based on cap analysis of gene expression (CAGE) a unique technology that characterizes TSSes across the entire genome in an unbiased fashion and at a single-base resolution level [21]. CAGE automatically sums expression levels of all transcripts beginning at a given transcription start site.raw_Z_Exp_Anc_LData for Fig 2 "The comparison of change in gene expression (Z) since the human-Chimpanzee common ancestor for five somatic tissues."SUPPLEMENTARY TABLESData in Table S3 underlies Figure 4. Data in Table S7 partially underlies Fig 1. Data in Tables S4 underlies Fig 3. Data in Tables S10-12 underlies Fig S1.data for Fig1R environment containing data underlying Fig1. The environment contains the following variables sorted identically as the gene list in refSeqs: chromosome (chromosomal location), chromosome_short (location on autosomes,chrX, or chrY?), data_matrix (F5 data matrix in TPM for human tissues)‚ MAX (maximal expression for each RefSeq)‚ max (maximal expression for each tissue)‚ strata_classification (strata classification for genes on chromosome X)‚ refSeqs_2entrezIDs (entrez ids mapped to refseqs)‚ boe (the breadth of expression)env_fig1GC-contents data for for Fig S6 and S7This R environment contains GC-contents data for either proximal promoters or isochore around the TSS (marked as big). The data is calculated for either masked or unmasked genome seqeuence.env_gc_contentsdata for Fig S3numbers of ENCODE transcription factor binding sites mapped to TSSes of RefSeq genes in symmetrical windows of different sizes (from 250 to 20000 bps) and depending on ENCODE quality cut-off (strict or all).FigS3_data.txtdata underlying Fig S8Breadth of expression and maximal expression is compared in three groups of observations: (1) autosomal paralogs of X-linked genes, (2) other autosomal paralogs matched by age, (3) X-linked paralogs. Newly formed paralogs are defined as those mapped by phylogenetic timing to taxa Theria or younger. Pre-existing duplications are defined as those descending from duplication notes mapped by phylogenetic timing to taxa Amniota or older.FigS8_data.txtdata underlying Fig7Fig7_data.txtTreeFam data for timing of gene duplications in R environmentsThese files are R environments. Use load() to load them into your R session! You ls() to view contents. You may use attach() syntax to load the namespace or access data members of the environment using the "$" reference operator. There is no warranty for this softwareenv_duplicator_baseAdditional TreeFam gene duplication data with duplication timingenv_duplicator_vectors

Your Feedback

Please give us your feedback and help us make GoTriple better.
Fill in our satisfaction questionnaire and tell us what you like about GoTriple!