Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os...
Transcript of Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os...
![Page 1: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/1.jpg)
![Page 2: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/2.jpg)
Dissertação apresentada para obtenção do grau de doutor
em Biologia Evolutiva
pelo Instituto de Tecnologia Química e Biológica
da Universidade Nova de Lisboa.
Este trabalho teve apoio financeiro da FCT e do FSE
no âmbito do Quadro Comunitário de apoio,
BD nº SFRH/BD/15856/2005.
![Page 3: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/3.jpg)
Acknowledgments
I would like to thank Arcadi Navarro and Isabel Gordo for accepting to supervise this
PhD and Arcadi Navarro for the opportunity to collaborate in other projects, three of
which resulted in the publications found in the Appendices section.
I am also grateful to the Unitat de Biologia Evolutiva of the Universitat Pompeu
Fabra, now part of the Institut de Biologia Evolutiva, for hosting me during this work,
and its members for making me feel welcome.
A very special thank you to a great number of people who I was lucky to meet along
these years that took the time to discuss my (and their own) projects with me and
contributed with helpful comments and ideas which greatly improved my work, even
if that work didn’t make it into the thesis.
The work presented here would not have been possible without the financial support
from the Portuguese Fundação para a Ciência e a Tecnologia through a PhD
fellowship (SFRH/BD/15856/2005), and the excellent training and education provided
by the Programa Gulbenkian de Doutoramento em Biomedicina.
![Page 4: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/4.jpg)
![Page 5: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/5.jpg)
Index
Summary 7 Resumo 9 INTRODUCTION 11
Historical perspective 13 Genes in pieces 13 Not much room for doubt 14 First impressions 15 Evolutionary perspective 16 Four kinds of introns 17 tRNA and archaeal introns 17 Self-splicing introns 17 Spliceosomal introns 20 Introns early vs late 20 Mechanisms of intron gain and loss 22 Intron loss 22 Intron gain 23 Splicing 24 The spliceosome 25 Splicing signals and the assembly of the spliceosome 25 The minor form of spliceosome 27 Finding the correct pair of splice sites 28 Alternative splicing 29 Why should we care about introns? 30 Boost mRNA quality 31 Increase recombination 31 Source of functional diversity 31 Repositories of functional elements 32 References 35 RESULTS 41
Publication I: Intronic mutational constraints in Primates 43 Abstract 45 Introduction 45 Materials and Methods 47 Results 49 Discussion 57 Conclusions 61
![Page 6: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/6.jpg)
Acknowledgments 61 References 62 Publication II: Accelerated evolution in Human introns 65 Abstract 67 Introduction 67 Materials and Methods 70 Results 74 Discussion 80 Acknowledgments 83 Supplementary Tables 84 References 93 GENERAL DISCUSSION AND CONCLUSIONS 97
Constraints on the evolution of intronic sequences 99 Accelerated evolution of intronic sequences 100 References 102
![Page 7: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/7.jpg)
Summary ● 7
Summary
Spliceosomal introns, the most common class of introns in eukaryotes, found in the
protein coding genes in the nucleus of these organisms, are commonly described as
regions in the primary transcript that need to be excised in order to produce the
functional mRNA molecule. Yet, they are also regions in the RNA transcript, and the
corresponding genomic regions, with a high number of functional elements that act
either at the RNA or DNA level and help regulate important cellular processes such as
splicing and gene expression.
With the exception of the core splicing signals, whose sequence motifs and location
within the intron are relatively well defined, most of the other cis-acting functional
elements in introns are located at variable distances from the splice sites and contain
degenerate sequence motifs with low information content, which make them much
harder to locate within the introns. Given the critical roles played by these elements,
it is likely that many evolve under selective pressure to maintain function, which will
affect intron sequence conservation levels. Thus, sequence conservation can help in
the task of finding these cis-regulatory elements, as the most constrained regions in
introns are their most likely location.
In our first study we examined the sequence conservation along primate introns
(human, chimpanzee and macaque) and identified regions where functional elements
involved in splicing (within 400 base pairs from the splice sites) and transcription
regulation (up to several kilobase pairs from the donor splice site in the first intron)
are more likely to occur, and intronic regions which evolve mostly unconstrained
(central portions of introns left after removing the constrained regions described
above). The results from this study are of particular interest for defining target
regions in studies of functional elements present in introns (either computational
scans of over-represented motifs or functional experiments), and for studies using
![Page 8: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/8.jpg)
8 ● Summary
introns as neutrally evolving sequences in order to, for instance, estimate genetic
distances between species or detect selective events.
Given the potential of alternative splicing to generate proteins with diverse functions
(sometimes even opposite roles) from the same gene, and the contribution of both
tissue-specific alternative splicing events and transcription regulation to organism
complexity, it is plausible that some of the cis-acting functional elements found in
introns evolved under positive selection and are responsible for organismal
differences between species.
In our second study we performed a genome-wide scan for introns with evidence of
having evolved under positive selection in the human branch and found 86
candidates, mostly belonging to different genes. Our results indicate that functional
sequences in these fast evolving introns are more likely to have a role in the control
of transcription and gene expression than in the regulation of alternative splicing.
Since our functional analysis of the genes containing our candidate introns did not
identify any particular biological process or molecular function, we suggest that
positive selection acting upon introns has been largely decoupled from the functions
of the genes to which these introns belong. In contrast, it is possible that a significant
portion of the fast evolving elements in our candidate introns are distant
transcription regulatory elements acting on neighboring genes, which often have
unrelated functions.
![Page 9: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/9.jpg)
Resumo ● 9
Resumo
Os intrões dependentes do spliceossoma, a classe de intrões mais comum em
eucatiotas, presentes nos genes que codificam proteínas existentes no núcleo destes
organismos, são frequentemente descritos como as regiões dos transcritos primários
que necessitam ser removidas para que se forme uma molécula funcional de RNA
mensageiro. No entanto, os intrões são também regiões no transcrito de RNA, e nas
zonas genómicas correspondentes, que contêm um grande número de elementos
funcionais, que actuam a nível do RNA ou do DNA, e que contribuem para a
regulação de processos celulares importantes, como o splicing e a expressão génica.
Exceptuando os sinais de splicing principais, cujos padrões de sequência e localização
dentro do intrão são relativamente bem definidos, a maior parte dos elementos
funcionais presentes nos intrões encontram-se a distâncias variáveis dos locais de
splicing e contêm padrões de sequência degenerados com baixo conteúdo de
informação, o que dificulta a sua identificação. Dada a sua importância, é provável
que muitos evoluam sob pressão selectiva para manter a sua função, o que se
reflectirá nos níveis de conservação ao longo do intrões. Desta maneira, os níveis de
conservação podem ajudar na tarefa de encontrar estes elementos reguladores, já
que as regiões mais conservadas nos intrões são as que maior probabilidade têm de
os conter.
Num primeiro estudo examinámos a conservação ao longo de sequências intrónicas
de primatas (humano, chimpanzé e macaco) e identificámos regiões com maior
probabilidade de conter elementos funcionais envolvidos na regulação do splicing
(nos 400 pares de base adjacentes aos locais de splicing) e da transcrição (várias
quilobases desde o local de splicing a 5’ do primeiro intrão), e também regiões que
evoluem maioritariamente sem restrições (as porções centrais dos intrões que
sobram depois de se excluir as regiões constrangidas acima descritas). Os resultados
deste trabalho são de particular importância, quer para a definição de regiões de
interesse em estudos de elementos funcionais presentes nos intrões (buscas
![Page 10: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/10.jpg)
10 ● Resumo
computacionais de motivos sobre-representados ou experiências funcionais), quer
para estudos que usem intrões como sequências que evoluem neutralmente para,
por exemplo, estimar distâncias genéticas entre espécies ou detectar eventos de
selecção.Tendo em conta que através do mecanismo de splicing alternativo se
podem gerar diferentes proteínas a partir do mesmo gene (por vezes até com
funções antagónicas), e a contribuição, tanto dos eventos de splicing alternativo
variável de acordo com o tecido celular como da regulação da transcrição, para a
complexidade dos organismos, é possível que alguns dos elementos funcionais
presentes nos intrões tenham evoluído sob selecção positiva e sejam responsáveis
por diferenças entre espécies a nível do organismo.
No segundo estudo procurámos intrões, ao longo de todo o genoma, com evidência
de terem evoluído sob selecção positiva no ramo humano, e encontrámos 86 intrões
candidatos, a maior parte dos quais pertencentes a genes distintos. Os nossos
resultados indicam que é mais provável que as sequências funcionais presentes
nestes intrões estejam envolvidas no controlo da transcrição e da expressão génica
do que na regulação do mecanismo de splicing alternativo. Uma vez que a análise
funcional dos genes aos quais os nossos intrões candidatos pertencem não destacou
nenhum processo biológico ou função molecular em particular, sugerimos que a
selecção positiva que actua sobre os intrões está maioritariamente dissociada das
funções dos genes aos quais os intrões pertencem. É possível que uma porção
significativa dos elementos em rápida evolução nos nossos intrões candidatos
estejam envolvidos na regulação da transcrição a larga distância de genes vizinhos,
que frequentemente têm funções distintas.
![Page 11: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/11.jpg)
Introduction
![Page 12: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/12.jpg)
12 ● Introduction
"A week of hard work can sometimes save you an hour of thought."
Unknown author.
![Page 13: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/13.jpg)
Historical perspective ● 13
Historical perspective
(Stumbling into introns)
The word intron first appeared in 1978 by the hand of Walter Gilbert (Gilbert 1978)
as an abbreviation for intragenic region. The discovery of introns itself however was
made the year before, in 1977, and is now commonly attributed to Richard J. Roberts
and Phillip A. Sharp, who, in 1993, were awarded the Nobel Prize in Physiology or
Medicine for their discovery of "split genes".
Genes in pieces?
By the mid 1970s, genes were seen as “transcribed code” (Gerstein et al. 2007) –
continuous stretches of DNA that are copied into RNA – and messenger RNA (mRNA)
was thought to be a direct copy of the gene sequence. This view was based mainly on
studies with bacteria and bacteriophages, which dominated the field at the time, but
the collinearity and continuity of the DNA, RNA and protein sequences was assumed
to be universal. Therefore, the finding that mRNA can derive from physically separate
sections along the DNA came as a shock1 and at the time it looked like genes were
split in pieces by introns, which were initially referred to as intervening DNA, inserts,
spacer sequences or spacers.
By 1976 it was already known that the primary transcripts of all major classes of RNA
(ribosomal, transfer and messenger) undergo some processing before they become
the functionally competent, mature forms of RNA. There was also considerable
evidence that eukaryotic mRNAs are initially transcribed as much larger molecules –
the heterogeneous nuclear RNAs (hnRNAs) – that are subsequently shortened. Based
1 James Watson actually used the word “bombshell” to describe this finding in the ‘Foreword’
to the 1977 Cold Spring Harbor Symposia on Quantitative Biology – where the first results were presented a few months before they were published – and words such as ‘amazing’ and ‘baroque’ were used in the title of scientific articles and communications.
![Page 14: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/14.jpg)
14 ● Introduction
on observations that mRNA and hnRNA share the same polyadenylation site it was
proposed that the mRNA segment was placed at the 3’-end of the hnRNA. When it
was later found that caps are also present at the 5’-end of both mRNA and hnRNA
researchers reasoned that, in some cases, the mRNA segment was located at the 5’-
termini of its precursor (Perry 1976). It was assumed that one or the other end of the
initial transcript was cut off, no one expected that the discarded segments could
come from the middle of the RNA (Marx 1977; Rogers 1978; Marx 1978).
Not much room for doubt
Not only was the discovery of introns surprising and unexpected, it also happened at
a breathtaking pace (Figure 1).
The finding was first reported at the Cold Spring Harbor Symposia on Quantitative
Biology, in the beginning of June 1977. Several groups of investigators, including
Sharp’s and Roberts’ groups, presented their independent discovery that a number
of mRNAs of animal viruses consist of sequences complementary to widely separated
portions of the viral genome. The importance of these works was immediately
recognized and featured in the News sections of magazines such as Nature and
Science (Sambrook 1977; Marx 1977) even before the original research articles
(Berget et al. 1977; Chow et al. 1977; Klessig 1977; Dunn and Hassell 1977; Lewis et
al. 1977; Aloni et al. 1977; Kitchingman et al. 1977; Hsu and Ford 1977) were
published.
Although the discovery was made in viral messengers, researchers suspected that the
same could be happening with mRNAs from animal cells, since the viruses use the
enzymes of the nucleated cells they infect to produce their own mRNA. Their
hypothesis was confirmed by other groups in November (Doel et al. 1977;
Breathnach et al. 1977), only three months after the first publication of the discovery
in viruses. As with the work on viruses, the discovery of introns in eukaryotic
messengers was made almost simultaneously by several independent groups.
![Page 15: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/15.jpg)
Historical perspective ● 15
Figure 1 Timeline of events regarding the discovery of introns. Above the timeline are events discussed
in the main text and below are some of the main advances that allowed the discovery of introns. Temin,
Baltimore, Smith, Berg, Sharp and Roberts were all later awarded Nobel prizes for the discoveries
mentioned in the figure. Line width is proportional to the number of publications. *The study on
Drosophila rRNA was published in February 1977.
In the following months the list of species in which introns were observed grew
quickly and introns were found to be present in the precursors not only of mRNA but
also of ribosomal (rRNA) and transfer (tRNA) RNA. It soon became clear that in
eukaryotes genes with introns were not the exception but the rule.
First impressions
Remarkably, introns were immediately assumed to have a function. Very early on,
just as the first examples in eukaryotes were found, and even before it was known
for sure that introns are transcribed, researchers postulated that introns could have
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
Feb ▪▪▪ Jun Jul Aug Sep Oct Nov Dec Jan Feb
At the Cold SpringHarbor Symposiaseveral groups
studying animal viruses present evidence that mRNAs are complementary to noncontiguous regions of the viral genome
Sharp and Roberts are among the first to
publish their results (in August and September,
respectively), closely followed by the other
groups
… and thesurprising
discovery featured in the News section of Nature and Science
Studies on ovalbumin, beta-globin, immunoglobulin,
rRNA and tRNA genes soon demonstrate that the
phenomenon is widespread among eukaryotes and is not limited to messenger RNA.
Walter Gilbertcoined the terms
intron and exon
*
*
Paul Berg constructed the first recombinant-DNA molecule
Howard Temin and David Baltimore simultaneously discover reverse transcriptase
Hamilton Smith purified a restriction enzyme (HindII) and first showed that it cuts DNA with a specific sequence
R-loop technique is describedSouthern blotting
technique is developed
![Page 16: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/16.jpg)
16 ● Introduction
regulatory functions, including determining chromatin conformation during the
control of transcription (Williamson 1977), and regulating protein synthesis after
transcription (Marx 1978).
Another early speculation was that introns would be important for the evolution of
the genome. Perhaps the most influential article on this matter was Walter Gilbert’s
“news and views” piece early in 1978 (Gilbert 1978). In just about one thousand
words, Gilbert coins the terms intron and exon, predicts that introns account for far
more DNA than exons and foresees the disappearance of the one gene-one
polypeptide dogma. He also proposes that the presence of introns in genes can
speedup evolution by allowing rearrangements of the coding regions (also proposed
by Rogers, 1978), or by enabling single base pair changes to generate novel proteins
(instead of only changing a single amino acid), due to the deletion or addition of
whole sequences of amino acids, if those mutations occur near the splice sites and
alter the splicing pattern. He continues by speculating that splicing does not need to
be a hundred per cent efficient so that, in his own words, “evolution can seek new
solutions without destroying the old”.
Evolutionary perspective
(Learning to live with introns)
Soon after the discovery of introns it became apparent that, although they had never
been observed in bacteria, they were widespread in eukaryotes. But when and how
introns appeared and why they became so successful in eukaryotic genomes was a
mystery. Three decades later there are many models, several hypotheses, but no
definitive answers.
![Page 17: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/17.jpg)
Evolutionary perspective ● 17
Four kinds of introns
In the literature (and the remainder of this book) the word intron is frequently used
to refer to the prolific nuclear mRNA spliceosomal introns. There are however three
other less abundant classes of introns, known as group I, group II and tRNA and/or
archaeal introns, which differ in the mechanism by which they are spliced out.
tRNA and archaeal introns
Introns in tRNA, rRNA and mRNA genes of archaea and in tRNA genes in the nucleus
of eukaryotes share a splicing mechanism with a characteristic that sets them apart
from all the other classes of introns: they are spliced by protein enzymes, without
any RNA catalyst (Calvin and Li 2008).
First, a splicing endonuclease excises the intron, probably guided not by sequences in
the RNA, but by RNA structural features, and then a ligase joins the two exons.
Although the ligation reaction differs, the cleavage step is conserved in eukaryotes
and archaea. The similarity of the cleavage reaction, the sequence homology of the
splicing endonucleases and the shared preferential location of the intron in the tRNA
genes all support a common origin for these introns in the two
domains/superkingdoms of cellular organisms (Archaea and Eukaryota) (Lykke-
Andersen et al. 1997).
Organisms from the other domain/superkingdom (Bacteria) don’t have this class of
introns nor the splicing endonuclease. In these organisms introns found in tRNAs
genes belong to the group I class of self-splicing introns (Fujishima et al. 2010).
Self-splicing introns
Group I and II introns were originally described in fungal mitochondrial genes (Michel
et al. 1982) but have since been found in mitochondria from other eukaryotes and
also in chloroplast and bacterial genomes. Group I introns are also present in the
nuclear genome of eukaryotes from diverse phyla and, in fact, most of the
![Page 18: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/18.jpg)
18 ● Introduction
approximately 2,900 group I introns described so far are found in rRNA genes in the
nucleus, mainly of fungi. On the other hand, group II introns have been found in a
genus of archaea (Lambowitz and Zimmerly 2004), and most of the about 750 group
II introns are found on the chloroplast of green plants and algae (Cannone et al.
2002).
Introns in these two classes are capable of self-splicing, that is, they can extract
themselves from the RNA molecule without the help of proteins or other RNAs2. They
do so by folding themselves into specific three-dimensional structures that bring the
intron-exon junctions into close proximity and allow precisely positioned reactive
groups to perform the splicing reactions3. The folding itself occurs due to the
presence of conserved partially complementary sequence stretches in the RNA
molecules (Alberts et al. 2002, 6).
Group I and group II introns can be distinguished based on their conserved sequences
and secondary structures, on the splicing reaction requirements (group I introns use
a free guanosine, while group II introns use an especially reactive adenine residue in
the intron sequence itself to initiate self-splicing) and on the structure of the released
intron, which have the shape of a lariat in group II (Cech and Bass 1986; Vicens and
Cech 2006). These fundamental differences, besides justifying their classification into
separate groups, suggest that the two groups originated independently.
Self-splicing group I introns have very well conserved primary and secondary
structures which supports the idea that they share a common origin. Their
widespread but sporadic distribution in nature suggested that they may have spread
by horizontal transfer. Phylogenetic analyses confirmed this hypothesis when it was
2 Nonetheless, in the cell, self-splicing introns are normally aided by proteins that speed up
the reaction. 3 Self-splicing introns were actually the first example of RNA molecules with catalytic function.
Up to then all known biocatalysts were proteins and RNA was seen simply as the transmitter of genetic information from DNA to protein. The discovery that RNA can also be a biocatalyst awarded Sidney Altman and Thomas Cech the 1989 Nobel Prize in chemistry and made the RNA world hypothesis plausible.
![Page 19: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/19.jpg)
Evolutionary perspective ● 19
shown that introns located at homologous gene sites in different organisms tend to
be more closely related than those at heterologous sites within the same organism
(Hoshina and Imamura 2009).
On the other hand, the observed distribution of group II introns – mainly in bacteria,
mitochondria and chloroplast – suggests that they originated in bacteria and have
been kept since the bacterial endosymbionts that gave rise to those organelles. The
few group II introns found in archaea, on the other hand, are likely to be the result of
lateral transfer from bacteria (Lambowitz and Zimmerly 2004).
Both groups of introns are still capable of horizontal transfer through homing (a
process by which an intron spreads to a homologous position in an intronless allele)
and reverse splicing, and are thus currently viewed also as mobile genetic elements.
Group II introns, in particular, have been proposed to be ancestors of non-LTR
retrotransposons (Lambowitz and Zimmerly 2004).
About one-third of the introns in each group contain internal open reading frames
(ORFs) that may still code for proteins with endonuclease (group I) and/or reverse
transcriptase activity (group II), which promote their motility. Interestingly, some of
those genes embedded in the self-splicing introns, particularly homing endonuclease
genes (HEGs), are mobile genetic elements themselves. By their insertion into introns
they avoid disrupting host gene function and the introns on the other hand see their
mobility increased. What’s more, this intron-HEG relationship seems to have
strengthened during evolution since some of those intron-encoded proteins have
evolved to function also as maturases that assist in the splicing of their host intron.
Because the ability of self-splicing introns to remove themselves from the RNA
transcript in a precise manner partially explains their (and their embedded ORFs’)
success in spreading to new genes and new species – as it potentially renders them
neutral to the host – it was a change that benefited both (Lambowitz and Zimmerly
2004; Haugen et al. 2005; Stoddard 2005).
![Page 20: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/20.jpg)
20 ● Introduction
Spliceosomal introns
Introns have been found in all three domains/superkingdoms of cellular organisms
(Archaea, Bacteria and Eukaryota), different type of genes (protein, rRNA and tRNA
coding genes) and various eukaryotic organelles (nucleus, mitochondria and
chloroplast). The previous classes of introns can be found in at least two domains,
type of gene and/or organelles, but spliceosomal introns are only found in nuclear,
protein coding, eukaryotic genes. Yet, they are present in most, if not all, nuclear
eukaryotic genomes characterized to date and are by far the most common class of
introns in these organisms, reaching hundreds of thousands per genome in
vertebrates and plants (Roy and Gilbert 2006).
Contrary to group I and group II introns, spliceosomal introns do not fold into specific
three-dimensional structures and they are completely dependent on both proteins
and other RNAs (which form a large complex that gives them their name: the
spliceosome) for their extraction. Nevertheless, the chemistries of their splicing
reactions are very similar to group II introns, with spliceosomal introns being also
released in a lariat structure, and the RNA molecules at the core of the spliceosome
closely resemble a number of critical RNA domains of group II introns (Valadkhan and
Jaladat 2010). Because of the striking similarities between these two classes of
introns it has been proposed that spliceosomal introns evolved from group II introns
(Cech 1986) by the transfer of the splicing ability to other molecules and loss of the
conserved sequences that formed the typical secondary structures. As a
consequence, much more of the intron sequence is left free to diverge and many
more RNAs could be spliced (Alberts et al. 2002, 6).
Introns early vs late (when and where)
Two main theories have been proposed regarding the origin of (spliceosomal4)
introns that are the object of a long-standing debate.
4 As explained before, spliceosomal introns will frequently be referred to simply as introns.
![Page 21: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/21.jpg)
Evolutionary perspective ● 21
According to the Introns Early (IE) theory, introns were present in the ancestor of
prokaryotes and eukaryotes: the last universal common ancestor (LUCA). In this
ancestor, introns were initially just genomic regions between genes that coded for
small proteins and were concatenated to form modern multiple-domain proteins. It
was hypothesized that in this primitive organism the information copying
mechanisms were error prone and, in order to prevent information loss, LUCA’s
genome had to be highly redundant. Therefore, coding sequences would be present
in multiple copies undergoing rapid information decay, and recombination within
introns would enable the joining of functional exon copies. With the improvement in
fidelity of the information copying mechanisms introns became less relevant and
were eventually lost in prokaryotes as they evolved towards increased metabolic
economy. In eukaryotes they were kept by gaining new functions (Rodríguez-Trelles
et al. 2006).
This origin of introns in LUCA would avoid the deleterious effect of inserting
functionless sequences into previously continuous genes. Yet, it implies that massive
intron losses occurred independently across all prokaryote lineages.
A more parsimonious explanation, that spliceosomal introns only appeared in
eukaryotes, is defended by the Introns Late (IL) theory. This theory proposes that
some of the many genes that were transferred from the bacterial endosymbionts
that gave rise to eukaryote organelles to the nucleus, contained self-splicing group II
intron-like elements. In the eukaryotic nucleus they spread and the spliceosome
evolved through the fragmentation of a group II intron (Belshaw and Bensasson
2006).
The debate on whether spliceosomal introns were present in the eukaryote-
prokaryote ancestor, and were then extensively lost, or rather evolved from group II
introns after these invaded the nucleus, and greatly increased in numbers, is still
active (Coulombe-Huntington and Majewski 2007; Basu et al. 2008). What seems to
have reached consensus is that spliceosomal introns and the spliceosome arose
![Page 22: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/22.jpg)
22 ● Introduction
before the most recent common ancestor of living eukaryotes and that since then
introns have been gained and lost differently in different lineages making it hard to
infer the ancestral condition.
Mechanisms of intron gain and loss
Intron loss
Two main models of intron loss have been proposed. The first, genomic deletion, can
remove parts of introns, and sometimes of adjacent coding regions, or, if it occurs by
nonhomologous recombination between short direct repeats at both ends of the
intron, it can excise introns exactly. The second, recombination with a reverse-
transcribed copy of mRNA, will delete one or more adjacent introns in an exact
manner.
Because an mRNA intermediate is needed in the second model, it should mainly
affect genes expressed in germline cells. Additionally, since reverse transcription
occurs from the 3’ end to the 5’ end of the RNA template and often terminates
prematurely, intron loss by this method is predicted to be 3’ biased. And finally,
because recombination can involve regions spanning multiple intron positions,
concerted loss of adjacent introns is expected by the second model. Despite all the
different predictions made by the two models of intron loss, results have not been
conclusive on the relative contribution of each mechanism. Some studies have found
concerted loss of adjacent introns and 5’ intron location bias in intron-sparse genes
and genomes, which support the recombination with reverse-transcribed mRNA
model. Yet, this biased location of the introns could have resulted from the
preferential retention of 5’ introns if they are particularly enriched in functional
elements, and many studies, particularly with intron-rich organisms, do not find
either of these evidences (Belshaw and Bensasson 2006; Rodríguez-Trelles et al.
2006; Roy and Gilbert 2006).
![Page 23: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/23.jpg)
Evolutionary perspective ● 23
Intron gain
Five main models have been proposed to explain the origin of new introns.
The most popular one, the intron transposition model, involves the duplication of an
existing intron in a way similar to how Group II introns self-propagate. According to
this model a RNA intron sequence that has been spliced out of a transcript is reverse-
spliced into a new position of the same or a different mRNA. The new intron is finally
inserted into the genome by recombination of a reverse-transcribed copy of the
intron-acquiring transcript with its genomic template. Like with the second model of
intron loss described in the previous section, this mechanism should be 3’ biased.
Yet, because the recombination of the reverse-transcribed mRNA with the new
intron can at the same time involve loss of neighboring existing introns, it would not
necessarily lead to a bias in intron location towards the 3’ end of genes. According to
this model though, the new introns should show sequence similarity to their intron
sources but, so far, studies that have found new introns that resemble older introns
in the same genome are scarce and the regions showing inter-intron homology are
generally enriched in palindromic repetitive sequences that are also found in
intergenic regions, raising doubts that they may have resulted from the spread of
transposons with palindromic sequences.
Other models for intron gain include: Transposon insertion, in which a transposable
element inserts into an exonic portion of a gene and is removed from the RNA
transcript by the spliceosome, thus converting into a spliceosomal intron; Tandem
genomic duplication, where the duplicated region contains cryptic splice signals with
an AGGT sequence and the two copies of this sequence are recognized by the
spliceosome as the donor and acceptor splicing sites, restoring the original coding
sequence, and; Intron transfer among paralogs, in which an intron-containing paralog
transfers a copy of its intron to a paralog previously lacking an intron at that site
through homologous recombination.
![Page 24: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/24.jpg)
24 ● Introduction
Of these four models only the intron transposition and intron transfer mechanisms
ensure that the inserted sequence includes the necessary signals for correct splicing.
Although all of these models can explain current spliceosomal intron proliferation,
none of them can account for how spliceosomal introns first arose, since all require
the existence of a functional spliceosome. Only a fifth model for intron gain,
Conversion of group II introns, includes a mechanism for the origin of the
spliceosome. According to this model, group II introns from organellar genes were
transferred to the nucleus, where they were inserted into previously intronless sites.
With time, their splicing ability got transferred to trans-acting RNAs and other
molecules, with consequent degradation of their internal RNA structure and loss of
their ORFs, rendering them dependent to a common splicing apparatus: the
spliceosome (Rodríguez-Trelles et al. 2006; Roy and Gilbert 2006).
Splicing
(Getting rid of introns)
Most protein coding genes in the nucleus of eukaryotes produce transcripts with
intronic sequences that need to be removed in order to form a functional mRNA
molecule. The process by which they are extracted, pre-mRNA splicing, involves two
consecutive phosphoryl-transfer reactions, known as transesterifications, which join
the two exons and release the intron in the shape of a lariat.
In the first reaction, the 2’-OH of a specific adenine nucleotide in the intron attacks
the 5’ (donor) splice site breaking the sugar-phosphate backbone of the RNA and
thus separating the upstream exon from the intron. In the process the 5’ end of the
intron gets covalently linked to the adenine nucleotide, creating the loop in the lariat.
In the second reaction, the free 3’-OH at the end of the upstream exon attacks the 3’
(acceptor) splice site separating the intron from the downstream exon, and joining
![Page 25: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/25.jpg)
Splicing ● 25
the two exons together. After this second reaction the intron is released in the shape
of a lariat that ultimately gets degraded.
The spliceosome
These splicing reactions are performed by the spliceosome: one of the largest
molecular machines in the cell, a complex assembly of RNA and protein molecules
whose composition and structure change along the splicing process.
Like with the self-splicing introns, RNA, not proteins, play the main role in splicing.
These RNA molecules, known as snRNAs (small nuclear RNAs), lie at the core of the
spliceosome and both recognize the splice sites and participate in the chemistry of
splicing. In the major form of splicing (the minor form of spliceosome is described
later in this section) there are five snRNAs, named U1, U2, U4, U5, and U6, and each
forms complexes with at least seven protein subunits. Together, the snRNA and its
associated proteins, form a snRNP (small nuclear ribonucleoprotein). Including the
proteins that form the snRNPs, over 150 proteins integrate the spliceosome in
humans (Alberts et al. 2002, 6; Valadkhan and Jaladat 2010).
This large machine is assembled on the pre-mRNA as its snRNAs find complementary
sequences in the pre-mRNA, the splicing signals.
Splicing signals and the assembly of the spliceosome
There are three main splicing signals: the 5’ splice site, where the upstream exon
ends and the intron starts; the branch site, containing the adenine nucleotide
involved in the first transesterification and that forms the branch point of the lariat
produced by splicing, and; the polypyrimidine tract/3’ spice site, at the 3’ end of the
intron, just before the downstream exon (Schwartz et al. 2008).
The spliceosome recognizes them largely by base-pairing between the snRNAs and
conserved sequence motifs in the splicing signals. This recognition is done multiple
times along the process of the spliceosome assembly, as new components join the
![Page 26: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/26.jpg)
26 ● Introduction
ribonucleoprotein complex and replace previously bound molecules, so that the RNA
sequences are checked multiple times before the chemical reaction takes place.
In mammals there are four distinct spliceosomal complexes that vary in their snRNP
and auxiliary proteins composition: E, A, B and C (in temporal order).
Early in the spliceosomal assembly pathway the U1 snRNA and U1C, a U1-specific
protein, recognize the 5’ splice site, and the U2 snRNA together with the U2 auxiliary
factor U2AF, recognize the branch site and the polypyrimidine tract and 3’ spice site.
At this point, before the use of ATP, the interaction of the U2 snRNP with this region
of the pre-mRNA is loose, and we are at the splicing complex E. With the use of ATP
the association of U2 with this region is remodeled and strengthened, forming the
splicing complex A.
In the next step the U4/U6•U5 tri-snRNP enters the spliceosome originating the B
complex. In this triple snRNP, the U4 and U6 snRNAs are held firmly together by
base-pair interactions that keep U6 in an inactive conformation, and the U5 snRNP is
more loosely associated. Once the tri-snRNP joins the spliceosome several RNA-RNA
rearrangements break the U4-U6 basepairing, U1 and U4 leave the complex, U2
replaces U4 as the basepairing partner of U6 and U6 replaces U1 at the 5’ splice
junction as the B complex becomes catalytically active.
After the first transesterification reaction major structural rearrangements lead to
the formation of spliceosomal complex C. In this step the U5 snRNA forms base-pair
interactions with exon sequences at both the 5′ and 3′ splice site, bringing the two
exons into close proximity for the second transesterification.
Once the second splicing step is completed, the spliceosome complex disassembles,
the spliced mRNA and the excised intron are released and the spliceosome
components are recycled for further rounds of splicing, closing the spliceosomal cycle
of assembly, catalysis, disassembly and recycling (Valadkhan and Jaladat 2010; Will
and Lührmann 2001).
![Page 27: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/27.jpg)
Splicing ● 27
The minor form of spliceosome
A small fraction of introns in more complex eukaryotes, such as flies, mammals and
plants, have different conserved splicing motifs and are removed by a distinct
spliceosome. At the core of this spliceosome there are also five snRNPs, of which only
one, the U5 snRNP, is shared by both spliceosomes. The other four, U11, U12, U4atac
and U6atac, are low-abundance snRNPs functionally analogous to the major
spliceosome U1, U2, U4 and U6 snRNPs, respectively, making the same types of RNA-
RNA interactions with the pre-mRNA and with each other as do the major snRNPs.
This functional correspondence between major and minor class snRNPs is reflected in
the similarity of their secondary structures, but not their nucleotide sequence. It thus
seems that the low-abundance minor snRNPs are not simply a variant of the major
snRNPs and the similarities evolved not from homology but by analogy. In fact, both
models proposed for the origin of these two splicing systems assume they evolved
from self-splicing group II introns but that the differences existed already in the
progenitor of higher eukaryotes. According to one of the models the two
spliceosome types derive from two different group-II-like introns, while the other
model proposes that they evolved in separate lineages that later fused in the
ancestor of higher eukaryotes.
The introns spliced out by this spliceosome are known both as U12-type introns –
due to their dependency on that snRNP for splicing (while introns extracted by the
major form of spliceosome are named U2-type) – and as AT-AC introns – after the
first examples of this class of introns (which turned out not to be representative of
this class) that started with an AT and ended with an AC dinucleotide instead of the
canonical GT-AG. Although these introns are scarce nowadays, with only a few in the
genome of any given species, it is thought that they were much more frequent earlier
in evolution and have been either lost or converted to U2-type introns over time. Yet,
their persistence in homologous genes in highly diverged species and presence in
![Page 28: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/28.jpg)
28 ● Introduction
virtually all of metazoan evolution indicates that they must have an important
cellular function (Patel and Steitz 2003).
Finding the correct pair of splice sites
Even though the conserved sequence motifs in the splicing signals are read multiple
times by different components of the spliceosome – which increases the accuracy of
splice site selection – these motifs are short and degraded enough so that if the
recognition of the splice sites was done by this alone there would be numerous
splicing errors. The pairing of non-consecutive splice sites, for instance, would lead to
the exclusion of one or more exons from the spliced mRNA, an error known as exon
skipping, and the use of cryptic splice sites (locations in the pre-mRNA whose
nucleotide sequence resembles the one found in true splice sites) would lead to exon
truncation or incorporation of intronic sequence in the mature mRNA.
Besides the classical splicing signals there are other cis-acting elements with less
clearly identifiable consensus sequences, found both in introns (ISR, intronic splicing
regulators) and exons (ESR, exonic splicing regulators), which are important for
correct splice site identification. These elements are recognized by SR proteins
(serine- and arginine-rich proteins), hnRNPs (heterogeneous nuclear
ribonucleoproteins) and other proteins, which interact with the spliceosome either
enhancing or silencing splicing (Cartegni et al. 2002).
Two other factors are thought to help in choosing the correct splice site pair: co-
transcriptional assembly of the spliceosome and pairing of the splice sites across an
exon. As with other pre-mRNA processing factors (involved in 5’ end capping and 3’
end polyadenylation) some splicing factors are carried on the RNA polymerase II tail
during transcription and get transferred onto the nascent RNA at appropriate
locations. This way, a snRNP in the donor splice site only has one acceptor splice site
to choose from while the downstream acceptor sites have not yet been synthesized
(Alberts et al. 2002, 6).
![Page 29: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/29.jpg)
Splicing ● 29
The second mechanism that helps identifying the correct pair of splice sites has been
proposed particularly for large introns. While exon size tends to be fairly uniform
across eukaryotes, with an average of approximately 150 nucleotides, introns tend to
be much longer, typically hundreds to thousands of nucleotides or more, and vary
enormously in size even within a single organism. This makes locating splice sites
across long introns remarkably difficult compared to pairing splice sites across even
sized exons. Thus, the exon definition model proposes that first, splice sites are
paired across the exons and then, consecutive exon units are paired as the
spliceosome machinery assembles on the intervening intron. The pairing of splice
sites across exons is helped by SR proteins that bind to exonic sequence and help
recruit spliceosomal components and stabilize interactions (Berget 1995; Lim and
Burge 2001; Wang and Burge 2008).
Alternative splicing
What the correct pair of splice sites is can actually change with time and tissue.
The use of different splice site pairs can lead to complete exons being skipped or
included in the mature mRNA, exons being shortened or elongated by the use of
alternative 3’ and/or 5’ splice sites and introns being kept in the processed transcript.
This variation on how a particular RNA transcript is spliced, named alternative
splicing, leads to different parts of the primary transcript being present in the mature
mRNA and can thus generate diverse peptides from a single gene.
Alternative splicing, which may have been present already in early eukaryotes,
gained prominence along eukaryotic evolution: it is more abundant in higher
eukaryotes than lower eukaryotes and occurs in more genes in higher vertebrates
than in invertebrates (Keren et al. 2010). In humans, microarray profiling studies
estimate that about two-thirds of our genes contain one or more alternatively spliced
exon, and studies using high-throughput sequencing, a more sensitive technology,
bring the estimate of alternatively spliced genes to more than 90% (Castle et al.
![Page 30: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/30.jpg)
30 ● Introduction
2008; Pan et al. 2008). This can dramatically increase the number of proteins a
genome is capable of synthesizing.
Some of these genes are constitutively alternatively spliced and the different mRNA
isoforms are present in all the tissues in which that gene is expressed, but the
majority (over 60%) of alternative splicing events are tissue-specific, lending support
to the hypothesis that alternative splicing is a major contributor to phenotypic
complexity in higher vertebrates (Wang et al. 2008).
This flexibility in the pairing of the acceptor and donor splice sites that allows for
alternative splicing is achieved by relying less on the classic splice site motifs, which
tend to be weaker in alternatively spliced exons, and depending more on exonic and
intronic splicing regulators (ESR and ISR, described in the previous section), which
tend to be more conserved in these exons (Keren et al. 2010).
Although it is not clear what portion of alternative transcripts is functional, there is
no doubt that alternative splicing is a highly regulated process, as producing the
wrong transcript in the wrong place or at the wrong time can be deleterious to the
cell.
Why should we care about introns?
While it is still widely discussed whether introns flourished in eukaryotes due to
selection over some advantageous trait, like their potential to speedup evolution
initially proposed by Gilbert in 1978, or by a neutral process involving random genetic
drift (Lynch 2006), it is clear that introns now carry out many functions that are under
selection, most of which are probably the result of intron ‘domestication’ by
eukaryotic genomes and thus, not the reason for their initial spread.
![Page 31: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/31.jpg)
Why should we care about introns? ● 31
Boost mRNA quality
DNA rearrangements, frameshifts, nonsense mutations, transcriptional errors or
incorrect splicing can all lead to the production of mRNAs with premature
termination codons (PTCs) that could generate non-functional or deleterious
truncated proteins. Cells, from yeast to human, have an mRNA surveillance
mechanism, known as nonsense mediated decay (NMD), which targets this
prematurely terminated mRNAs for degradation, thus increasing mRNA quality.
Introns play a role in this process because NMD recognition of PTCs relies on the
spatial relationship between the stop codon and the introns: generally a termination
codon should only occur after all the introns. When introns are removed by splicing,
proteins in the nucleus bind to and thereby mark the exon-exon junctions. If one of
these junctions is found after a termination codon it triggers NMD (Cartegni et al.
2002).
Increase recombination
Linked loci interfere with each other's response to selection (Hill-Robertson effect),
which can lead to the loss of beneficial mutations – since beneficial mutations
occurring in different haplotypes have to compete among each other – and to the
long-term accumulation of deleterious mutations (Muller's ratchet). By breaking
down linkage disequilibrium, recombination increases the efficacy of natural
selection (Felsenstein 1974).
Introns increase the rate of intragenic meiotic crossing over, generally reducing
linkage disequilibrium between adjacent exons, and thus allow for more efficient
selection of mutations within the gene (Duret 2001).
Source of functional diversity
Thanks to alternative splicing, a single gene can encode many proteins. For instance
in humans, the approximately 24,000 protein-coding genes in the genome are
![Page 32: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/32.jpg)
32 ● Introduction
estimated to produce around 100,000 different proteins (Keren et al. 2010). In fact,
through alternative splicing, a single gene can generate more transcripts than the
number of genes in an entire genome (Graveley 2001).
Many cases of alternative splicing are tissue specific, and the alternative transcript
isoforms are differentially expressed in at least one tissue (Castle et al. 2008; Pan et
al. 2008; Wang et al. 2008), which greatly contributes to organism complexity.
Repositories of functional elements
Introns contain several regulatory elements, highly conserved sequences, and even
other genes.
Many noncoding RNAs, including microRNAs and small nucleolar RNAs (snoRNAs) are
encoded in introns of protein coding genes. After transcription, the intron removed
by splicing is processed to form these untranslated RNAs that play a role in a number
of cellular regulatory mechanisms (Brown et al. 2008).
Introns also contain about half of the ultraconserved elements found in genes. These
DNA sequences of more than 200 base pairs in length that have been perfectly
conserved for more than 85 million years are thought to play a role in the regulation
of early development (Bejerano et al. 2004; Visel et al. 2008).
Finally, the most common functional elements found in introns are involved in
regulating splicing and transcription. Splicing regulatory elements are essential in
alternative splicing to regulate splicing in a developmental and/or cell-type-specific
fashion as this complexity cannot be achieved by the classical splicing signals alone,
but they are also needed to recognize legitimate splice sites in general, particularly in
species with long introns, and thus they must be present in the majority of introns in
species like ours (Cartegni et al. 2002).
As to the elements that regulate gene expression, their presence in introns was
noticed just after introns themselves were discovered (Gruss et al. 1979) when it was
![Page 33: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/33.jpg)
Why should we care about introns? ● 33
observed that the expression profile of intronless versions of genes differed from the
original intron containing version. It is now known that introns influence many stages
of mRNA metabolism besides splicing, such as transcription, editing and
polyadenylation, nuclear export, translation and mRNA decay, all of which can affect
the expression of a gene (Le Hir et al. 2003).
Interestingly, some of these elements in introns are functional at the DNA level (like
the ultraconserved elements) while others function at the RNA level (noncoding
RNAs, for example), and some of the processes introns help regulate require
elements from both levels. For instance, in regulating transcription, intronic
transcription regulatory elements in the form of cis-acting transcription factor
binding sites, as well as nucleosome-positioning elements (that can regulate
transcription by controlling DNA accessibility) act at the DNA level, while splicing
signals in the introns after transcription, thus, at the RNA level, can affect both
transcription initiation and elongation (Le Hir et al. 2003). Also in splicing, both levels
seem to play a role, as it was recently proposed that introns contain pentamers that
disfavor nucleosome binding (Schwartz et al. 2009) and thus help position
nucleosomes preferentially in exons (at the DNA level). This in turn may help exon
recognition and selection in the RNA transcript either by slowing RNA polymerase II
as it reaches the exon and thus facilitating the transfer of splicing factors carried by
the RNA polymerase II tail onto the nascent RNA, or by the interaction of particular
histone modifications on the nucleosomes located in the exons with the splicing
machinery thus influencing its function (Tilgner et al. 2009; Keren et al. 2010). This
last model has in fact been proposed to explain the establishment of alternative
splicing patterns during development and cell differentiation just as the level of
activity of a gene is also determined: through the epigenetic memory contained in
histone modifications (Luco et al. 2010).
![Page 34: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/34.jpg)
34 ● Introduction
* * *
In summary, introns’ functions start before transcription and do not end with their
removal from the transcript. Despite the diversity of functions they have been
attributed so far, it is still possible that new, surprising functions are still to be
discovered, as the new-found interest in non-coding sequences continues to produce
its fruits.
Given the critical roles introns play in several mechanisms in the cell, it is expected
that selection modulates their evolution. Intron spatial distribution can be under
pressure to maximize NMD (Lynch and Kewalramani 2003), intron size under
selection, for instance, for its effect on recombination (Duret 2001), and intron
sequence influenced by the great variety of regulatory motifs and other functional
elements introns harbor.
This thesis concerns this last level of selection on introns, looking, in the first chapter
of the Results section, at sequence conservation to identify general intronic regions
with higher density and/or higher impact functional elements, and, in the second
chapter, at individual accelerated introns in the human lineage which may set us
apart from other primates.
![Page 35: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/35.jpg)
References ● 35
References
Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. 2002. Molecular Biology of
the Cell. 4th ed. Garland Science.
Aloni Y, Dhar R, Laub O, Horowitz M, Khoury G. 1977. Novel mechanism for RNA maturation: the leader sequences of simian virus 40 mRNA are not transcribed adjacent to the coding sequences. Proc. Natl. Acad. Sci. U.S.A. 74: 3686-3690.
Basu MK, Rogozin IB, Deusch O, Dagan T, Martin W, Koonin EV. 2008. Evolutionary dynamics of introns in plastid-derived genes in plants: saturation nearly reached but slow intron gain continues. Mol. Biol. Evol. 25: 111-119.
Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. 2004. Ultraconserved elements in the human genome. Science. 304: 1321-1325.
Belshaw R, Bensasson D. 2006. The rise and falls of introns. Heredity. 96: 208-213.
Berget SM. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270: 2411-2414.
Berget SM, Moore C, Sharp PA. 1977. Spliced segments at the 5’ terminus of adenovirus 2 late mRNA. Proc. Natl. Acad. Sci. U.S.A. 74: 3171-3175.
Breathnach R, Mandel JL, Chambon P. 1977. Ovalbumin gene is split in chicken DNA. Nature. 270: 314-319.
Brown JWS, Marshall DF, Echeverria M. 2008. Intronic noncoding RNAs and splicing. Trends in Plant Science. 13: 335-342.
Calvin K, Li H. 2008. RNA-splicing endonuclease structure and function. Cell. Mol. Life
Sci. 65: 1176-1185.
Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Müller KM, et al. 2002. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 3: 2.
Cartegni L, Chew SL, Krainer AR. 2002. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat. Rev. Genet. 3: 285-298.
![Page 36: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/36.jpg)
36 ● Introduction
Castle JC, Zhang C, Shah JK, Kulkarni AV, Kalsotra A, Cooper TA, Johnson JM. 2008. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nat. Genet. 40: 1416-1425.
Cech TR. 1986. The generality of self-splicing RNA: relationship to nuclear mRNA splicing. Cell. 44: 207-210.
Cech TR, Bass BL. 1986. Biological catalysis by RNA. Annu. Rev. Biochem. 55: 599-629.
Chow LT, Gelinas RE, Broker TR, Roberts RJ. 1977. An amazing sequence arrangement at the 5’ ends of adenovirus 2 messenger RNA. Cell. 12: 1-8.
Coulombe-Huntington J, Majewski J. 2007. Characterization of intron loss events in mammals. Genome Res. 17: 23-32.
Doel MT, Houghton M, Cook EA, Carey NH. 1977. The presence of ovalbumin mRNA coding sequences in multiple restriction fragments of chicken DNA. Nucleic
Acids Res. 4: 3701-3713.
Dunn AR, Hassell JA. 1977. A novel method to map transcripts: evidence for homology between an adenovirus mRNA and discrete multiple regions of the viral genome. Cell. 12: 23-36.
Duret L. 2001. Why do genes have introns? Recombination might add a new piece to the puzzle. Trends Genet. 17: 172-175.
Felsenstein J. 1974. The evolutionary advantage of recombination. Genetics. 78: 737-756.
Fujishima K, Sugahara J, Tomita M, Kanai A. 2010. Large-scale tRNA intron transposition in the archaeal order Thermoproteales represents a novel mechanism of intron gain. Mol. Biol. Evol. 27: 2233-2243.
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. 2007. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17: 669-681.
Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.
Graveley BR. 2001. Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics. 17: 100-107.
![Page 37: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/37.jpg)
References ● 37
Gruss P, Lai CJ, Dhar R, Khoury G. 1979. Splicing as a requirement for biogenesis of functional 16S mRNA of simian virus 40. Proc. Natl. Acad. Sci. U.S.A. 76: 4317-4321.
Haugen P, Simon DM, Bhattacharya D. 2005. The natural history of group I introns. Trends Genet. 21: 111-119.
Le Hir H, Nott A, Moore MJ. 2003. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28: 215-220.
Hoshina R, Imamura N. 2009. Phylogenetically Close Group I Introns with Different Positions among Paramecium bursaria Photobionts Imply a Primitive Stage of Intron Diversification. Molecular Biology and Evolution. 26: 1309 -1319.
Hsu MT, Ford J. 1977. Sequence arrangement of the 5’ ends of simian virus 40 16S and 19S mRNAs. Proc. Natl. Acad. Sci. U.S.A. 74: 4982-4985.
Keren H, Lev-Maor G, Ast G. 2010. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11: 345-355.
Kitchingman GR, Lai SP, Westphal H. 1977. Loop structures in hybrids of early RNA and the separated strands of adenovirus DNA. Proc. Natl. Acad. Sci. U.S.A. 74: 4392-4395.
Klessig DF. 1977. Two adenovirus mRNAs have a common 5’ terminal leader sequence encoded at least 10 kb upstream from their main coding regions. Cell. 12: 9-21.
Lambowitz AM, Zimmerly S. 2004. Mobile group II introns. Annu. Rev. Genet. 38: 1-35.
Lewis JB, Anderson CW, Atkins JF. 1977. Further mapping of late adenovirus genes by cell-free translation of RNA selected by hybridization to specific DNA fragments. Cell. 12: 37-44.
Lim LP, Burge CB. 2001. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. U.S.A. 98: 11193-11198.
Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. 2010. Regulation of alternative splicing by histone modifications. Science. 327: 996-1000.
![Page 38: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/38.jpg)
38 ● Introduction
Lykke-Andersen J, Aagaard C, Semionenkov M, Garrett RA. 1997. Archaeal introns: splicing, intercellular mobility and evolution. Trends Biochem. Sci. 22: 326-331.
Lynch M. 2006. The origins of eukaryotic gene structure. Mol. Biol. Evol. 23: 450-468.
Lynch M, Kewalramani A. 2003. Messenger RNA surveillance and the evolutionary proliferation of introns. Mol. Biol. Evol. 20: 563-571.
Marx JL. 1978. Gene structure: more surprising developments. Science. 199: 517-518.
Marx JL. 1977. Viral messenger structure: some surprising new developments. Science. 197: 853-923.
Michel F, Jacquier A, Dujon B. 1982. Comparison of fungal mitochondrial introns reveals extensive homologies in RNA secondary structure. Biochimie. 64: 867-881.
Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40: 1413-1415.
Patel AA, Steitz JA. 2003. Splicing double: insights from the second spliceosome. Nat.
Rev. Mol. Cell Biol. 4: 960-970.
Perry RP. 1976. Processing of RNA. Annu. Rev. Biochem. 45: 605-630.
Rodríguez-Trelles F, Tarrío R, Ayala FJ. 2006. Origins and evolution of spliceosomal introns. Annu. Rev. Genet. 40: 47-76.
Rogers J. 1978. Genes in pieces. New Scientist. 5 January: 18-20.
Roy SW, Gilbert W. 2006. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet. 7: 211-221.
Sambrook J. 1977. Adenovirus amazes at Cold Spring Harbor. Nature. 268: 101-104.
Schwartz SH, Silva J, Burstein D, Pupko T, Eyras E, Ast G. 2008. Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes. Genome Res. 18: 88-103.
Schwartz S, Meshorer E, Ast G. 2009. Chromatin organization marks exon-intron structure. Nat. Struct. Mol. Biol. 16: 990-995.
![Page 39: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/39.jpg)
References ● 39
Stoddard BL. 2005. Homing endonuclease structure and function. Q. Rev. Biophys. 38: 49-95.
Tilgner H, Nikolaou C, Althammer S, Sammeth M, Beato M, Valcárcel J, Guigó R. 2009. Nucleosome positioning as a determinant of exon recognition. Nat. Struct.
Mol. Biol. 16: 996-1001.
Valadkhan S, Jaladat Y. 2010. The spliceosomal proteome: at the heart of the largest cellular ribonucleoprotein machine. Proteomics. 10: 4128-4141.
Vicens Q, Cech TR. 2006. Atomic level architecture of group I introns revealed. Trends
Biochem. Sci. 31: 41-51.
Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD, Holt A, Plajzer-Frick I, Afzal V, Rubin EM, Pennacchio LA. 2008. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 40: 158-160.
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature. 456: 470-476.
Wang Z, Burge CB. 2008. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 14: 802-813.
Will CL, Lührmann R. 2001. Spliceosomal UsnRNP biogenesis, structure and function. Curr. Opin. Cell Biol. 13: 290-301.
Williamson B. 1977. DNA insertions and gene structure. Nature. 270: 295-297.
![Page 40: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/40.jpg)
![Page 41: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/41.jpg)
Results
![Page 42: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/42.jpg)
42 ● Results
“All models are wrong but some are useful.”
George E. P. Box, 1979.
Box GEP. 1979. Robustness is the strategy of scientific model building. In Launer RL,
Wilkinson GN, eds. Robustness in statistics. New Yourk: Academic Pr. p 201-36.
![Page 43: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/43.jpg)
PUBLICATION I
Intronic mutational constraints in Primates
Olga Fernando1,2, Arcadi Navarro1,3,4
1Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències Experimentals i
de la Salut, Universitat Pompeu Fabra, Barcelona, Spain.
2Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras,
Portugal.
3National Institute for Bioinformatics, Universitat Pompeu Fabra, Barcelona, Spain.
4Institució Catalana de Recerca i Estudis Avançats (ICREA). Catalonia, Spain.
[Submitted]
![Page 44: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/44.jpg)
44 ● Publication I
The author of the thesis collected the data, performed the analyses and drafted the
manuscript.
![Page 45: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/45.jpg)
Intronic mutational constraints in Primates ● 45
ABSTRACT
Introns are known to contain a variety of functional elements, the most common
being related with splicing and transcription. Many of them are present at variable
locations within the intron, have sequence motifs with low information content, and
act in a context dependent way, which difficult their identification and
characterization. In the present study we look at the frequency of substitutions along
human-chimpanzee-macaque orthologous introns in order to define regions in which
these elements are more likely to occur. We find a clear sign of the core splicing
elements present in the first and last few base pairs of introns, but also a significant
signal of the presence of other conserved elements, most likely related to splicing, up
to 400 bp from the closest splice site. We show that first introns, defined as the 5’-
most intron in the gene, form a separate class with a distinct substitution pattern and
biological role. In these introns conservation extends for several kilobases from the
donor splice site, most likely due to the presence of elements involved in
transcription. The regions here described can be used for defining target regions
when studying functional elements present in introns (either computational scans of
over-represented motifs or functional experiments), and for selecting intronic
regions in studies using introns as neutrally evolving sequences, from which these
more conserved regions should be excluded.
INTRODUCTION
Although the first sequence motifs involved in splicing were found almost at the
same time as introns themselves (Breathnach et al. 1978), 30 years later we are still a
long way from being able to predict splicing accurately from the DNA sequence alone
(Guigó et al. 2006). This is partially because the relatively easy to identify core
![Page 46: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/46.jpg)
46 ● Publication I
splicing signals – 5’ splice site, branch site, polypyrimidine tract (PPT), and 3’ splice
site – contain only about half of the information necessary to locate even short
human introns (Lim & Burge 2001). Much of the other half of the information is
expected to come from a large variety of much harder to identify short cis-acting
sequence elements.
These splicing regulatory elements (SREs) are located at variable distances from
splice sites (SSs) in both introns and exons, and enhance or inhibit splicing in a
context dependent way (i.e. the same element can act as an enhancer or an inhibitor
depending on its location) (Wang & Burge 2008). This complex regulation of splicing
together with the low information content of their motifs make it hard to locate SREs
accurately, despite their high frequency in human genes (Fairbrother et al. 2002).
Defining regions in which these elements are more likely to occur would facilitate
their study with both experimental approaches and computational screens for
overrepresented motifs.
The presence of functional elements should affect sequence conservation, which in
turn could be used to predict regions where they are more likely to be found. In this
study, we take advantage of levels of conservation along primate introns to locate
highly conserved regions that are more likely to be of functional relevance.
We focus on introns because little attention has been given to intronic SREs in
comparison with exonic elements (Sorek and Ast 2003) and, more importantly,
because introns may contain higher proportion of SREs, just like they contain the
great majority of sequence information at splice junctions (Stephens and Schneider
1992).
Nonetheless, introns contain other functional elements besides splicing related
sequences that can also affect conservation. Transcriptional regulatory elements are
common, mainly in first introns (Majewski and Ott 2002), and recently it has been
proposed that introns also contain sequences that help position nucleosomes
preferentially in exons (Schwartz et al. 2009). Thus, the patterns we obtain will also
![Page 47: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/47.jpg)
Intronic mutational constraints in Primates ● 47
reflect the presence of these and possibly other unidentified functional elements in
introns. Additionally, knowing which regions within introns have higher probability of
containing functional elements is also of extreme importance for population
genetics, historical inference, and other studies that use introns as neutrally evolving
sequences (Hare and Palumbi 2003).
MATERIALS AND METHODS
Genomic Sequences and Gene Annotations
Whole genome DNA sequences for human (hg18), chimpanzee (panTro2) and
macaque (rheMac2), together with chimpanzee and macaque sequence quality
scores, were downloaded from the UCSC Genome Browser
(http://genome.ucsc.edu/).
Human gene annotations and one-to-one orthology information were obtained from
Ensembl (http://www.ensembl.org/) release 48.
Gene Alignments
Full sequences of genes with at least one intron in the human gene annotation were
extracted from the corresponding chromosome sequence file of each species
according to the one-to-one orthology information. Nucleotides with quality scores
of less than 40 were masked in the chimpanzee and macaque gene sequences, which
leaves a high confidence sequence with an error rate of less than 1/10,000. A three-
species alignment was then produced with TBA (Blanchette et al. 2004) for each
gene.
Data Filtering
For those genes with multiple transcripts the transcript with highest exon coverage
(that is, the one with the longest sequence resulting from the concatenation of all its
![Page 48: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/48.jpg)
48 ● Publication I
exons) was chosen to represent the gene. As a further measure to ensure that each
locus is present only once in the final dataset, overlapping genes were excluded from
the analysis.
In order to avoid possible annotation errors, genes with incorrect splice sites, coding
sequences (CDS) not multiple of three, without a start or a stop codon, with non-
sense mutations or with introns smaller than 20 bp5, were excluded. Additionally, in
genes suspected to have incomplete annotation because they are missing a 5’ or 3’
UTR, the first or last intron of the gene, respectively, was excluded to avoid possible
misclassifications in the first, last and single intron classes.
Finally, introns whose aligned chimpanzee or macaque sequence contained more
than 50% of Ns and/or gaps were excluded as a measure to avoid possible false
orthology, leaving 9,106 genes with 74,756 introns for analysis.
Data Analysis and Plotting
We studied introns in a position-per-position basis. Each position along an intron was
labeled as the distance of that nucleotide from the closest splice site (SS). The total
number of introns in which that nucleotide was present in our alignments was
counted (alignment columns with Ns or gaps were deemed uninformative) and the
percentage of introns in which at least one of the species’ sequence differed from
humans at that position was measured. That percentage constitutes an estimate of
the degree of conservation of each nucleotide along an intron.
Fisher's exact test was performed with the R (R Development Core Team 2009)
function fisher.test and the resulting estimates of the odds ratio and p-value under a
two-sided alternative hypothesis were used to produce Figure 4 and Figure 6. The
number of substitutions observed in a given window of size k is simply the sum of the
5 20 bp is approximately the length of the smallest spliceosomal introns described (Gilson and
McFadden 1996) and the minimum sequence length containing essential splicing signals (Wieringa et al. 1984).
![Page 49: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/49.jpg)
Intronic mutational constraints in Primates ● 49
number of substitutions observed for each of the k nucleotides in that window. To
account for multiple comparisons resulting from testing several windows on the
same intron classes, p-values were conservatively adjusted using the Bonferroni
correction. The significance thresholds for the 50, 100 and 500 bp window analyses
were, respectively, 0.05, 0.01 and 0.001, accommodating for the fact that counts for
wider windows will tend to be higher – as a result of being the sum of a higher
number of observations/nucleotides – and thus yield smaller p-values.
Sequence logos (Schneider and Stephens 1990) were created with WebLogo (Crooks
et al. 2004) from intronic sequence aligned at the closest SS.
RESULTS
Conservation at the ends of introns extends up to 400 bp
The percentage of substitutions observed in human-chimpanzee-macaque
orthologous introns is shown up to 1 kb from the SSs in Figure 2. A low percentage of
introns with substitutions at a given position implies that the nucleotide at that
location has been conserved along the evolution of the three species in almost all the
introns, independently of what that nucleotide is.
Two general patterns standout in Figure 2: the 3’ and 5’ ends of introns are
approximately symmetrical, except for the ~100 bp closest to the nearest SS, and;
after a sharp initial increase, the number of substitutions continues to accumulate
steadily up to 400 bp towards the center of introns, when it stabilizes.
Given that the ends of introns contain sequence motifs essential for splicing, and that
these motifs are not equally distributed among both ends, they could be causing the
asymmetry found in Figure 2. We thus compared sequence conservation across
human introns due to the presence of the 5’ SS, PPT and 3’ SS sequence motifs (right
y-axis in Figure 3, measured in bits of information), with conservation across species
![Page 50: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/50.jpg)
50 ● Publication I
(left y-axis in Figure 3, measured as the percentage of introns without substitutions,
which is the complement of the percentage of introns with substitutions in the y-axis
of Figure 2). Only the human sequence logos are shown in Figure 3, since they are
identical to the chimpanzee and macaque (and previously published (e.g. Stephens &
Schneider 1992) human) logos. Thus, this striking relationship between the two
measures is actually present in all three species.
Figure 2 Distribution of substitutions in the first and last 1,000 bp of introns. Positions along the intron
are given as a distance from the closest SS, either the donor (red) or the acceptor (blue) SS. The inset
shows a close-up of the extreme-most 70 bp of introns; grey was used when the two colors overlapped.
To confirm the second pattern drawn from Figure 2, we compared the number of
substitutions observed in consecutive windows along the introns and found that, up
to the expected 400 bp from the closest SS, windows tend to have significantly less
substitutions than the next/previous window (Figure 4, “All” intron class).
![Page 51: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/51.jpg)
Intronic mutational constraints in Primates ● 51
Figure 3 Conserved motifs in human intron ends and sequence conservation in the three species
(human, chimpanzee and macaque). The total height of each stack of letters corresponds to the amount
of information at that position measured in bits (y-axis on the right). Within each stack letters are sorted
so that the most frequent appear on top, and their height within the stack is proportional to their
relative frequency. Black dashes mark the percentage (y-axis on the left) of introns with the same
nucleotide in the three species (regardless of what the actual base, A, C, T or G, is) in the first ten and
last 30 nucleotides of introns.
First introns have a different substitution profile
Because first introns are reported to have more regulatory elements than other
introns (Majewski and Ott 2002; Keightley and Gaffney 2003) and have been shown
to present different substitution rates than other introns (Gazave et al. 2007) we
looked at their substitution profile separately. Contrary to the pattern seen with all
introns, in first introns, after the sharp increase within the first 50 bp, the number of
substitutions starts dropping until, at around 750 bp from the 5’ SS, it begins to
increase slowly (Figure 5, top panel, and Figure 4, “1st” intron class).
These differences in substitution profiles translate into significant differences
between the two classes of introns (Figure 6, “1st_x_Rest” series). First introns have
on average more substitutions for the first 200 bp, and less from that point up to 3.5
kb, although only the first 2.5 kb are significantly different from other introns.
![Page 52: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/52.jpg)
52 ● Publication I
![Page 53: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/53.jpg)
Intronic mutational constraints in Primates ● 53
Figure 4 Differences in the number of substitutions in consecutive windows along introns belonging to
several classes. The number of substitutions in 500 (top panel) 100 (middle panel) and 50 bp windows
(bottom panel) is compared with the subsequent window (left of the dashed vertical line) or previous
window (right of the dashed vertical line). The magnitude of the differences, represented by the odds
ratio (see Material and Methods), are color coded according to the thresholds indicated in the figure
legend. Windows colored in blue have fewer substitutions than the contiguous window they were
compared to and windows colored in orange have more substitutions. Black borders were drawn around
windows with significant differences according to Fisher's exact test. Windows which could not be
studied (involving short introns) were colored grey, and windows with less than a mean of 100 intron
alignments were hatched. Grey polygons between panels emphasize the overlap in the x-axes. As in
previous plots, distance from the acceptor SS is given in negative values. Intron classes: All, all introns in
the study; Long, introns longer than 1455 bp; Short, introns shorter than 1456 bp; 1st, the first intron in
a gene; 1st_long, first intron in a gene if longer than 1455 bp; 1st_short, first intron in a gene if shorter
than 1456 bp; 1st_CDR, first intron in a gene if located in the coding region; 1st_5'UTR, first intron in a
gene if located in the 5’ UTR; CDR'1st_other, the first intron found in the coding region but not first in
the gene; 5'UTR_other, introns in the 5’ UTR other than the first; 4th, the fourth intron in the gene; Last,
last intron in the gene; Single, introns from genes with only two exons. Single introns were not included
in the first or the last intron classes.
At their 3’ end, first introns are not strikingly different from other introns (Figure 5,
top panel, and Figure 6, “1st_x_Rest” series), except for a tendency for higher
number of substitutions, that is also present, and more evident, in the central part of
large (> 8 kb long) first introns.
To check that the profile we see in our ‘first introns’ class is not actually characteristic
of coding-region (CDR) first introns – that is, the first intron found after the start
codon, which constitute the majority (73%) of our ‘first intron’ class and could thus
be driving the pattern – we focused on the first introns found in the CDR and
separated them into two groups, depending on whether or not they were also the
first intron in the gene. While CDR first introns that are also gene first introns show
the same pattern as our ‘first intron’ class, CDR first introns that come after the gene
first intron do not (Figure 5, middle panel, and Figure 4, “1st_CDR” and
![Page 54: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/54.jpg)
54 ● Publication I
“CDR’1st_other” intron classes), and there are significant differences between these
two classes (Figure 6, “CDR’1st—1st_x_Other”).
Figure 5 Distribution of substitutions along the first and last 5 Kb of introns. On the top panel introns
were separated into two classes, one with the first introns of genes and the other with the rest of the
![Page 55: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/55.jpg)
Intronic mutational constraints in Primates ● 55
introns. On the middle panel only the first introns in the CDR are shown, separated into two classes
depending on whether or not they are also the first intron in the gene. On the bottom panel first introns
of genes are separated according to their location, the 5’UTR or the CDR. Open circles represent the raw
data, one circle each bp, to which a LOESS curve was fitted.
Another possibility was that first introns in the 5’ UTR showed a different pattern
from those in the CDR, perhaps common to all the introns in the 5’ UTR. As shown in
the bottom panel of Figure 5, their substitution pattern is very similar to that of CDR
first introns that are also gene first introns, except for the 3’ end which shows
significantly less substitutions (Figure 6, “1st—5’UTR_x_CDR”). Moreover, 5’ UTR first
introns (which, by definition, are also gene first introns) are different from other
introns in the 5’ UTR (Figure 6, “5’UTR—1st_x_Other”).
Short introns evolve faster
As done by other authors (Haddrill et al. 2005; Gazave et al. 2007), we classified
introns as short or long according to the median length of all the introns studied. In
our current dataset, that median was 1,455 bp, which of course differs from the
median in other organisms. As when all introns were considered, in both short and
long intron classes there is an increase in the number of substitutions up to 400 bp
from each SS (Figure 4, “Short” and “Long”), but when compared to each other, short
introns exhibit significantly more substitutions in virtually all comparable windows
along their length (Figure 6, “Short_x_Long”).
When we divided first introns into long and short based on the same length
threshold, we found that the substitution profile of long first introns is essentially the
same as the whole first introns class, but in short first introns there is no clear
pattern (Figure 4, “1st_long” and “1st_short”). Yet, when compared to long first
introns, there is a tendency for short first introns to have more substitutions up to
half of their length and less in the second half (Figure 6, “1st—Short_x_Long”).
![Page 56: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/56.jpg)
56 ● Publication I
Figure 6 Differences in the number of substitutions in equivalent windows of distinct classes of introns.
Plot annotations are as in Figure 4. Comparisons: 1st_x_Rest, the first introns in a gene compared with
introns in other positions along the gene; CDR'1st--1st_x_Other, from the first introns found in the
coding region those that are also the first intron in the gene compared with those that are not (1st_CDR
vs CDR'1st_other); 5'UTR--1st_x_Other, from the introns found in the 5’ UTR those that are the first
intron in the gene compared with those that are not (1st_5'UTR vs 5'UTR_other); 1st--5'UTR_x_CDR,
![Page 57: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/57.jpg)
Intronic mutational constraints in Primates ● 57
first introns in the gene located in the 5’ UTR compared with those located in the CDR (1st_5'UTR vs
1st_CDR); Single_x_Rest, single introns compared to all the other introns; Single_x_1st, single introns vs
first introns in the gene; last_x_NON_1st
, last introns in the gene compared with the other introns in the
gene except first; 4th_x_NON_1st, fourth intron in the gene compared with the other introns in the
gene except first; Short_x_Long, introns shorter than 1456 bp compared with introns longer than 1455
bp; 1st--Short_x_Long, from the first intron in a gene those shorter than 1456 bp vs those longer than
1455 bp.
Other intron classes
The first kilobases in single introns are more conserved than in the rest of the introns
studied, including first introns (Figure 6, “Single_x_Rest” and “Single_x_1st”). In fact,
although few significant differences are found between single and first introns, single
introns don’t even show the higher number of substitutions in the initial 200 bp
typical of first introns when compared to other introns.
Last introns do not differ significantly from other non first introns in the gene, as
expected for a random non first intron, such as, for example, the fourth intron in the
gene (Figure 6, “Last_x_NON_1st” and “4th_x_NON_1st”). Still, last introns longer
than 3 kb seem to accumulate fewer substitutions in their central portion.
DISCUSSION
We looked at intron conservation to find regions where functional elements are
more likely to occur, and found signs of evolutionary constraints up to 400 bp from
both SSs. This distance is strikingly longer than many previous reports, using different
methods (e.g. 200 bp in Majewski & Ott 2002), but still reasonable according to
studies on conserved intronic SREs (Yeo et al. 2007), some of which found
throughout the 400 bp regions.
![Page 58: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/58.jpg)
58 ● Publication I
In these first and last 400 bp, the percentage of introns with substitutions increases
gradually with the distance to the closest SS, except for the SS neighboring
nucleotides where the increase is steep. Looking at intron sequence conservation at
single base pair resolution, we see that this sharp increase and the high conservation
in the first 6 and last 20 bp of an intron are explained by the presence of core splicing
motifs that are shared by the three species (Figure 3).
Due to its variable distance from the 3’ SS, the core splicing motif corresponding to
the branch site is not apparent in our sequence logos. Nevertheless, there is a clear
local decrease in the number of substitutions upstream of, and marginally
overlapping, the PPT motif (inset of Figure 2, and Figure 3) which almost perfectly
coincides with the reported preferential location of the branch site 18-37 nucleotides
upstream of the 3' SS (Green 1986). Likewise, the several SREs, which are also
present at variable distance from the SSs, are expected to increase sequence
conservation at their preferred locations.
Accordingly, we interpret the slow increase in the number of substitutions following
the core splicing signals as the result of a gradual decrease in the combined
frequency of distinct SREs. In fact, both SREs (Majewski and Ott 2002) and intronic
sequences disfavoring nucleosome binding (Schwartz et al. 2009) are expected to
have higher frequency close to the SSs. Two not necessarily mutually exclusive
scenarios can explain the observed pattern. If the majority of the motifs decrease in
frequency with the distance to the SS this would produce the gradual decrease in
conservation we found. Alternatively, the same result can be obtained if different
SREs have a frequency peak at different distances from the SS but there is a negative
correlation between the distance to the SS and the number of SREs that peak at that
distance.
The 5’ end of first introns is an exception to this 400 bp rule. In the intron closest to
the transcription start site, the first 2.5 kb are significantly more conserved than the
corresponding region in other introns. The fact that these are the intronic regions
![Page 59: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/59.jpg)
Intronic mutational constraints in Primates ● 59
closest to the start of transcription immediately suggests a role, not in splicing, but
instead in transcription regulation. In fact, first introns are known to be enriched in
transcriptional regulatory elements, especially in their 5’ end (Majewski and Ott
2002). Thus, according to our data, cis-regulatory elements involved in transcription
are frequent in primate introns up to 2.5 kb from the 5’ SS, a distance similar to that
found by Keightley & Gaffney (2003) comparing rat and mouse.
There is some confusion in the literature on what the term ‘exon’ refers to. The word
was first used to name the regions left after the removal of introns (Gilbert 1978),
but it has since been used also as a synonymous of coding sequence (Zhang 2002).
The latter usage fails to account for exons in UTRs, with implications on what is called
first intron. According to our data showing that first introns, defined as the 5’-most
intron in the gene, form a class with a distinct substitution pattern, the original
definition of exon makes more sense from a biological point of view.
We classified introns into short or long based on the median intron length. The
conservation up to 400 bp from each SS is present in both classes, suggesting that the
same mechanism is used to recognize short and long introns. At first sight this might
seem unexpected, as short introns are thought to be spliced via an “intron definition”
and long introns via an “exon definition” mechanism (McGuire et al. 2008; Lim &
Burge 2001). However, our threshold length is somewhat artificial and, if there are in
fact such two classes of introns in human genes, the threshold is likely to be much
lower (less than 134 bp (Lim & Burge 2001)). This would mean that the majority of
introns in our short intron class actually function as long introns, and explain the lack
of difference in these two classes.
Among first introns, those that belong to the short intron class do not exhibit the
substitution pattern typical of first introns. Since first introns tend to be longer
(Hawkins 1988), it is possible that our short first intron class is enriched with introns
misclassified as first in the gene, despite our efforts to identify genes with incomplete
annotation. Still, true short first introns will not display the substitution pattern
![Page 60: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/60.jpg)
60 ● Publication I
described for all first introns since the extent of conservation at the 5’ end is almost
twice as long as the longest intron in the short class.
Finally, we find that short introns evolve faster than long introns, both within and
outside the extreme-most 400 bp. Besides primates (in the present study), rodents
(Gaffney and Keightley 2006), Drosophila (Haddrill et al. 2005) and rice (Guo et al.
2007) also show higher conservation in longer introns, which seems to indicate that
this is a general trend among eukaryotes. A simple explanation could be that shorter
introns need less regulatory motifs to be correctly removed by splicing. Additionally,
long introns may harbor a higher number of other regulatory motifs not necessarily
related with splicing, such as the multispecies conserved sequence (MCS) elements
found mainly in longer introns by Sironi et al. (2005).
Lastly, introns contain a variety of functional elements that constrain their evolution.
Some elements are present in all introns (splicing related) while others are present
only in some – such as transcriptional regulatory elements, present mainly in first
introns, and a great variety of genes for non-coding RNAs, encoded at odd introns. By
pooling introns together, our method detects mainly elements shared by many of
those introns which produce general trends of sequence conservation. This
information is useful for defining target regions when studying functional elements
present in introns, but also for selecting intronic regions in studies using introns as
neutrally evolving sequences.
Based on this assumption of neutrality, introns have been used to estimate genetic
distances between species (Castresana 2002), estimate the neutral rate of nucleotide
substitution (Hoffman and Birney 2007), detect positive selection in exons (Resch et
al. 2007; Ke et al. 2008) among other. Many of these studies recognized the existence
of conserved regions in introns and exclude them from the rest of the analysis. Yet,
according to our study, they greatly underestimated the length of those regions, thus
failing to exclude a large portion of constrained sequence.
![Page 61: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/61.jpg)
Intronic mutational constraints in Primates ● 61
CONCLUSIONS
We find that sequence constraints at the 5’ and 3’ ends of introns in primates extend
for longer that what was found in most previous reports, up to 400 bp from each
splice site in most introns and for several kilobases from the donor splice site in first
introns. Knowing the extent of these regions is crucial for studies using introns as
neutrally evolving sequences, since including these regions can lead to wrong
estimates of the neutral mutation rate and generate false positives in tests of
positive selection. Because these regions are also the most likely location of intronic
regulatory sequences, involved, for instance, in splicing and transcription regulation,
our results are also relevant for defining target regions when studying functional
elements present in introns and for interpreting results of association studies when
the phenotype causing variant is found in introns past the core splicing signals.
ACKNOWLEDGMENTS
OF was supported by a PhD fellowship (SFRH/BD/15856/2005) from the Fundação
para a Ciência e a Tecnologia (Portugal). Financial support was provided by the
Spanish Ministry of Science and Innovation (Grant BFU2009-13409-C02-02 to AN) and
the Spanish National Institute for Bioinformatics (INB, www.inab.org).
![Page 62: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/62.jpg)
62 ● Publication I
REFERENCES
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708-715.
Breathnach R, Benoist C, O’Hare K, Gannon F, Chambon P. 1978. Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proc. Natl. Acad. Sci. U.S.A. 75: 4853-4857.
Castresana J. 2002. Estimation of genetic distances from human and mouse introns. Genome Biology. 3: research0028.1 - research0028.7.
Crooks GE, Hon G, Chandonia J-M, Brenner SE. 2004. WebLogo: a sequence logo generator. Genome Res. 14: 1188-1190.
Fairbrother WG, Yeh R-F, Sharp PA, Burge CB. 2002. Predictive Identification of Exonic Splicing Enhancers in Human Genes. Science. 297: 1007-1013.
Gaffney DJ, Keightley PD. 2006. Genomic selective constraints in murid noncoding DNA. PLoS Genet. 2: e204.
Gazave E, Marqués-Bonet T, Fernando O, Charlesworth B, Navarro A. 2007. Patterns and rates of intron divergence between humans and chimpanzees. Genome
Biol. 8: R21.
Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.
Gilson PR, McFadden GI. 1996. The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns. Proc. Natl. Acad. Sci. U.S.A. 93: 7737-7742.
Green MR. 1986. Pre-mRNA splicing. Annu. Rev. Genet. 20: 671-708.
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al. 2006. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7: S2.1-31.
Guo X, Wang Y, Keightley P, Fan L. 2007. Patterns of selective constraints in noncoding DNA of rice. BMC Evolutionary Biology. 7: 208.
![Page 63: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/63.jpg)
Intronic mutational constraints in Primates ● 63
Haddrill PR, Charlesworth B, Halligan DL, Andolfatto P. 2005. Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content. Genome Biol. 6: R67.
Hare MP, Palumbi SR. 2003. High intron sequence conservation across three mammalian orders suggests functional constraints. Mol. Biol. Evol. 20: 969-978.
Hawkins JD. 1988. A survey on intron and exon lengths. Nucleic Acids Res. 16: 9893-9908.
Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide substitution using introns. Mol. Biol. Evol. 24: 522-531.
Ke S, Zhang XH-F, Chasin LA. 2008. Positive selection acting on splicing motifs reflects compensatory evolution. Genome Res. 18: 533-543.
Keightley PD, Gaffney DJ. 2003. Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc. Natl. Acad. Sci. U.S.A. 100: 13402-13406.
Lim, L.P. & Burge, C.B., 2001. A computational analysis of sequence features involved in recognition of short introns. Proceedings of the National Academy of
Sciences of the United States of America, 98(20), 11193-11198.
Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.
McGuire AM, Pearson MD, Neafsey DE, Galagan JE. 2008. Cross-kingdom patterns of alternative splicing and splice recognition. Genome Biol. 9: R50.
R Development Core Team. 2009. R: A Language and Environment for Statistical
Computing. Vienna, Austria http://www.R-project.org.
Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.
Schneider TD, Stephens RM. 1990. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097-6100.
Schwartz S, Meshorer E, Ast G. 2009. Chromatin organization marks exon-intron structure. Nat. Struct. Mol. Biol. 16: 990-995.
![Page 64: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/64.jpg)
64 ● Publication I
Sironi M, Menozzi G, Comi GP, Bresolin N, Cagliani R, Pozzoli U. 2005. Fixation of conserved sequences shapes human intron size and influences transposon-insertion dynamics. Trends Genet. 21: 484-488.
Sorek R, Ast G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 1631-1637.
Stephens, R.M. & Schneider, T.D., 1992. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. Journal of Molecular Biology, 228(4), 1124-1136.
Wang Z, Burge CB. 2008. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 14: 802-813.
Wieringa, B., Hofer, E. & Weissmann, C., 1984. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell, 37(3), 915-925.
Yeo GW, Van Nostrand EL, Nostrand ELV, Liang TY. 2007. Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet. 3: e85.
Zhang MQ. 2002. Computational prediction of eukaryotic protein-coding genes. Nat.
Rev. Genet. 3: 698-709.
![Page 65: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/65.jpg)
PUBLICATION II
Accelerated evolution in Human introns
Olga Fernando1,2, Arcadi Navarro1
1Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències Experimentals i
de la Salut, Universitat Pompeu Fabra, Barcelona, Spain.
2Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras,
Portugal.
[In preparation]
![Page 66: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/66.jpg)
66 ● Publication II
The author of the thesis collected the data, performed the analyses and drafted the
manuscript.
![Page 67: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/67.jpg)
Accelerated evolution in Human introns ● 67
ABSTRACT
Non-protein-coding regions of the genome contain the majority of the nucleotides
under selection in mammals and have been proposed to harbor a great part of the
differences that separate humans from other hominoids. Within non-protein-coding
regions, introns contain a variety of functional elements which when disrupted can
have dramatic effects. Many of these functional elements are involved in the
regulation of splicing and gene expression and could thus be responsible for some of
the organismal differences between great apes.
We performed a genome-wide scan for introns with evidence of having evolved
under positive/directional selection in the human lineage (PSIs) by performing a
maximum likelihood test using the models described in Haygood et al. (2007), with
chimpanzee and macaque as the background lineages, and found 86 candidate
introns in 83 genes. Analysis of the distribution of these introns along the gene and
comparisons with the results of an independent study of positive selection on
promoter regions indicates that the functional sequences in these fast evolving
introns are likely to have a role in the control of transcription and gene expression.
Regulation of alternative splicing on the other hand does not seem to be a major
source of PSIs. Functional analysis of genes containing these introns did not identify
and particular biological process or molecular function of interest, which can happen
if these sequences in the intron are selected by the effect they have on a neighboring
gene instead of the gene where the intron lies.
INTRODUCTION
Perhaps partially because most of the biochemical methods available at the time of
the first evolutionary studies calculating genetic distances were based on comparing
![Page 68: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/68.jpg)
68 ● Publication II
proteins, much of the attention since then has been dedicated to non-synonymous
variation. Yet, as noticed by King and Wilson already in the 1970s (King and Wilson
1975), genetic distances between humans and chimpanzees seemed too small to
account for all the organismal differences observed between these species, which led
them to propose that most of those differences could be due to changes in the
expression of genes rather than in the sequence of the protein.
Current results, based on DNA sequencing techniques and thus not limited to protein
coding regions of the genome, seem to support a smaller role for protein sequence
changes. For instance, top signals in genome-wide association studies of human
diseases and variable traits often occur at DNA sites that do not encode amino acids
(Lomelin et al. 2010) and, although only around 1.2% of the genome encodes for
proteins, the estimated fraction of constrained nucleotides in mammals is of 3 to 6
percent, meaning that the majority of these sites under selection do not encode
amino acids (Koonin and Wolf 2010).
Among non-protein-coding sequences, introns are a likely location for a good portion
of these nucleotides since they harbor a variety of functional elements involved in
critical processes such as splicing and gene expression, both processes highly
regulated in the cell.
Incorrect splicing is estimated to account for at least 15% (Krawczak et al. 1992),
considering only changes in canonical splice signals, up to 50% (Wang and Cooper
2007) of human diseases caused by mutations. This translates in 8% to 27% of human
deaths being the result of mutations that affect splicing (Lynch 2010) and most part
of the core splice site motifs (5’ splice site, branch point sequence, polypyrimidine
tract and 3’ spice site) and cis-regulatory elements (both enhancers and silencers)
that regulate splicing are found in the intronic portions of the transcript.
The effect of introns on gene expression was noticed soon after introns themselves
were discovered and they are now known to affect directly or indirectly, in the act of
their removal, almost every step of mRNA metabolism (Le Hir et al. 2003). Through
![Page 69: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/69.jpg)
Accelerated evolution in Human introns ● 69
their regulation of splicing they can suppress gene expression by introducing
premature termination codons that activate nonsense-mediated mRNA decay, a
process that can be quite common since one-third of alternatively spliced transcripts
is estimated to contain a premature termination codon (Wang and Cooper 2007).
Last but not least, the ENCODE project (Birney et al. 2007) revealed that sequences
involved in regulating transcription, such as transcription factor binding sites, are
symmetrically distributed around transcription start sites and can be found thousand
of base pairs away from the transcription start site. This means that a good portion of
the information we usually associate with promoters is actually present in the first
introns of genes.
Another, more surprising, finding of the ENCODE project was that many of the
experimentally found functional elements are not evolutionarily constrained in
mammals and may serve as a reservoir of elements for natural selection to model in
a lineage-specific way. This would mean that differences between species, some of
which adaptive, would accumulate in regulatory regions, supporting King and
Wilson’s initial proposal.
In the present study we apply a maximum likelihood test, performed using the Null
and Alternative Models described in Haygood et al. (2007) and with chimpanzee and
macaque as the background lineages, to compare rates of evolution along the human
lineage between an intron and nearby putatively neutral intronic sequences in search
for introns with fast evolving sites in the human lineage since they can contain
regulatory elements under positive selection that could account for part of the
organismal differences between humans and our closest relatives that cannot be
explained by similar studies focused on protein sequence evolution. We are
encouraged by a similar study done on promoter regions, which found evidence for
positive selection in human promoters of neural- and nutrition-related genes
(Haygood et al. 2007), by recent findings that a considerable portion of fast-evolving
regions is located in introns (Pollard et al. 2006; Kim and Pritchard 2007), and by the
![Page 70: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/70.jpg)
70 ● Publication II
classical example of positive selection in human populations for the ability to digest
lactose into adulthood. This lactase persistence trait, lactase being the enzyme that
breaks down lactose into absorbable sugars, results from the continued expression of
its gene, LTC, which would normally become inactive around the age of 12 (Wooding
2007). The mutations responsible for this phenotype eluded researchers for decades
after the mapping of the LTC gene and they were finally found to be located in the
introns of a neighboring gene, MCM6, with unrelated functions (Tishkoff et al. 2007;
Ingram et al. 2009).
MATERIALS AND METHODS
Gene alignments
We downloaded whole genome DNA sequences for human (hg18), chimpanzee
(panTro2) and macaque (rheMac2), and sequence quality scores for chimpanzee and
macaque, from the UCSC Genome Browser (http://genome.ucsc.edu/). Human gene
annotations and one-to-one orthology information were retrieved from Ensembl
(http://www.ensembl.org/) release 48 using BioMart
(http://www.ensembl.org/biomart/).
For all genes with one-to-one orthologs in all three species, and at least one intron
annotated in humans (14,286 genes), the full sequence was extracted from the
corresponding chromosome sequence file in each species. Gene sequences were
then aligned with TBA (Blanchette et al. 2004), after masking all nucleotides with
quality scores of less than 40 (finished sequence standard, comparable to human
(Schmutz et al. 2004)) in the chimpanzee and macaque sequences.
Reference and Test sets
Since our method is based on comparing 'test' introns against carefully selected
neutral intron fragments we used the annotation of all the human genes in Ensembl
![Page 71: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/71.jpg)
Accelerated evolution in Human introns ● 71
release 48 to produce a list of coordinates of central parts of introns, for the
Reference Set (RS), and a list of coordinates of full introns for the Test Set (TS). We
define the central part of an intron as the part that is left after excluding 400 bp from
each end of the intron and, in the case of first introns, after excluding another 3,100
bp (3,500 bp in total) from the 5’ end (see Figure 7), which tend to be more
constrained (Fernando and Navarro Submitted). From the coordinates for the RS we
removed all positions that were annotated as exons or non-central parts of introns in
other transcripts. After discarding duplicated entries, the list for the RS consisted of
non-overlapping genomic coordinates for strict central parts of introns.
Figure 7 Schematic representation of a portion of the genome. In the upper part of the figure white
boxes represent genes. In the bottom part, a close-up on Gene B, taller boxes represent exons and
shorter ones introns. After removing the intron portions defined in the main text the red intronic
portions remain. These were used to construct the Reference Set.
![Page 72: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/72.jpg)
72 ● Publication II
When several transcripts included the same intron, only one set of coordinates was
kept in the list for the TS, but the information regarding the transcripts containing
that intron was kept. Overlapping introns were kept as long as at least one of the
start or end coordinates was different.
Both lists of coordinates were then filtered to include only coordinates represented
in the gene alignments. To minimize possible false orthologs in the TS, gene
alignments with less than 75% of the CDS aligned in all three species were not used.
In order to avoid false positives, the TS was further filtered to exclude introns
without support from any valid transcript after checking for possible annotation
errors, namely: incorrect splice sites, CDS not multiple of three, lack of the start or
the stop codon, presence of non-sense mutations or introns smaller than 20 bp6.
Each intron left in the TS was extracted from the corresponding gene alignment and
windows of 51 ungapped and unmasked sites with at least 12 differences between
human and chimpanzee or 17 differences between human and macaque were
masked (similar to Haygood et al. 2007). Introns with either more than 0.06% of thus
masked bases, more than 30% gaps, or more than 10% low quality score nucleotides
were excluded, also with the aim to avoid false positives in our results.
A reference sequence alignment was constructed for each intron in the TS by
concatenating all segments in the RS within a 100 kb window centered on that intron
excluding all segments overlapping the intron itself.
Finally, all columns in the alignments of both the Reference and Test Sets with gaps
or masked bases were removed, and only introns with alignments longer than 20 bp
and corresponding reference alignment longer than 7,000 bp were analyzed.
6 20 bp is approximately the length of the smallest spliceosomal introns described (Gilson and
McFadden 1996) and the minimum sequence length containing essential splicing signals (Wieringa et al. 1984).
![Page 73: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/73.jpg)
Accelerated evolution in Human introns ● 73
Positive selection test
A maximum likelihood test was performed using the Null and Alternative Models
described in Haygood et al. (2007), fitted with HyPhy (Pond et al. 2005) to our introns
in the TS and corresponding reference sequence in the RS. The two models, of single-
nucleotide substitutions, allow for different classes of intron sites, so that the test
can detect positive selection even if it is acting on only a limited number of sites, and
can also distinguish between positive selection and relaxation of negative selection
(accommodated for in the Null Model).
Following the strategy of Haygood et al. (2007), we fitted each model to our data ten
times, starting from random points, to guard against local maxima of the likelihood
function. The likelihood ratio test was done by comparing twice the difference
between the best log likelihood of each model with a χ2 distribution with one degree
of freedom. Additionally, for each intron in the TS, we constructed 100 bootstrap
replicates over the corresponding reference sequence in the RS. For each bootstrap
replicate we fitted the two models ten times and calculated the P value as described
for the original reference sequence. The median of all P values was then chosen as
the representative P value for that intron.
To account for multiple testing, false discovery rates (FDR) Q values were calculated
with the qvalue package in R (R Development Core Team 2009) using the bootstrap
method and we considered introns to have significant evidence of positive selection
when Q < 0.05.
Data Analysis and Plotting
Fisher's exact test, Spearman's rank correlation and Mann-Whitney tests were
performed with R (R Development Core Team 2009).
![Page 74: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/74.jpg)
74 ● Publication II
Functional analysis
We used PANTHER’s “Gene Expression Data Analysis” tools (Thomas et al. 2006),
both the binomial statistics tool the Mann-Whitney U Test tool, and GOstat
(Beissbarth and Speed 2004) and its variant Rank GOstat, to look for statistically over-
and under-represented biological processes, molecular functions, cellular
components and pathways among the genes whose introns were analyzed in this
study. The “Gene Expression Data Analysis” tools use the PANTHER database
(Thomas et al. 2003) while GOstat ant its variants use the Gene Ontology (GO)
database (Ashburner et al. 2000) annotations. The multiple testing correction option
was used in all tools.
RESULTS
After applying several filters to control for potential annotation errors and for the
quality of the alignments (see Methods) we were left with 87,631 introns in 17,859
valid transcripts belonging to 8,979 genes, all of which with an associated reference
alignment of at least 7,000 bp coming from less than 50 kb to each side of the intron.
For more than half of the introns the reference alignment contained sequences
coming from at least two different genes.
P values showed a weak correlation with intron length (Spearman's rank correlation
rS = -0.101, two-tailed P << 0.001), but no or very weak correlation with the
percentage of possible indicators of bad sequence or alignment quality, such as gaps,
divergence masked bases or low quality score nucleotides, or with the length of the
reference alignment (rS = -0.076, -0.003, -0.056 and -0.013, two-tailed P <<0.001,
=0.373, <<0.001 and <<0.001, respectively). The frequency of GC and of CG-
susceptible sites (Keightley and Gaffney 2003) also had no correlation with P values
(both rS = -0.007, two-tailed P = 0.043 and 0.049, respectively).
![Page 75: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/75.jpg)
Accelerated evolution in Human introns ● 75
Because genes can have more than one transcript and overlap other genes, some of
the tested introns belong to more than one transcript (or gene) and others overlap to
various degrees. Introns shared between transcripts were tested only once, but
overlapping introns (12,040) were tested for positive selection independently.
Positively selected introns
The likelihood ratio test (LRT) based on the branch-site models described in Haygood
et al. (2007) identified 86 introns with evidence for positive selection in the human
branch (PSIs) after correcting for multiple testing (Q < 0.05; Supplementary Table 1).
These introns are distributed over 83 genes, with three genes containing two PSIs
each.
Since some of the introns tested for positive selection overlap, their results are
expected to be correlated. In fact, considering all 9,549 possible pairs of introns that
overlap, there is a negative correlation between the percentage of overlap and the
absolute difference in HyPhy parameter estimates (such as the transition to
transversion ratio: rS = -0.469, two-tailed P << 0.001), and also the absolute
difference in P values (rS = -0.324, two-tailed P << 0.001) of those introns. Among the
86 PSIs there are two pairs of overlapping introns, each pair belonging to the same
gene. In other words, the PSIs in two of the genes with multiple PSIs overlap.
If for some reason overlapping introns tended to have smaller or larger P values than
non-overlapping introns, our number of PSIs could be overestimated or
underestimated, respectively. This was not the case, as the observed number of
overlapping introns with Q < 0.05 was slightly less, but not significantly different,
from the expected (Fisher's exact test, two-tailed P = 0.753) nor was there a
significant difference in Q values between the overlapping and non-overlapping sets
(Mann-Whitney test, two-tailed P = 0.338). Repeating the analysis with introns with P
< 0.05 (4,185 high scoring introns, HSIs) we reach the same conclusions (Fisher's
exact test, two-tailed P = 0.241, and Mann-Whitney test, two-tailed P = 0.192).
![Page 76: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/76.jpg)
76 ● Publication II
First introns have lower P and Q values
Several reports indicate that first introns are enriched in functional elements
(Chamary and Hurst 2004 and references therein) and previous results from our
group (see previous chapter) show that first introns have a distinct conservation
profile. With this in mind we tested if there was an enrichment of first introns in PSIs
or HSIs and if P or Q values are different in first introns compared to other introns.
We repeated this analysis with second introns, for comparison purposes, and other
classes of introns of interest, namely, last introns and introns in UTRs. The results are
summarized on Table 1.
Table 1 Distribution of P and Q values by several classes of introns.
HSIs a PSIs a Class b N OR c ∆ Mean P d OR ∆ Mean Q d
First 6696 1.23 ** -3.39 x 10-2 ** 1.24 -3.64 x 10-3 ** Second 8435 1.06 -1.15 x 10-2 ** 0.96 7.58 x 10-4
Last 9119 0.98 1.04 x 10-3 0.76 4.33 x 10-4 5’UTR 7412 1.06 -1.23 x 10-2 ** 1.26 -1.07 x 10-3 3’UTR 3361 0.90 7.88 x 10-3 0.60 3.27 x 10-4 a Introns with P (HSIs) or Q (PSIs) < 0.05 compared to the remaining introns.
b First, second and last introns in the gene and introns in the 5’ or 3’ UTRs compared to introns in other
locations in the gene. c Odds Ratio. A value larger than one indicates that more HSIs or PSIs were found in that class (
b) than
expected. Significant Fisher’s exact tests are marked with asterisks. d Difference between the mean P or Q values in that class of introns and the mean of all the other
introns not in that class. Significant Mann-Whitney tests are marked with asterisks. * Fisher or Mann-Whitney two-tailed P < 0.05 (*) or < 0.001 (**).
Contrary to other intron classes, first introns have significantly more HSIs than
expected and lower P and Q values. Second introns and introns in the 5’UTR (the
majority of which are first and second introns in the gene) also have significantly
lower P values.
Since the introns being studied can belong to more than one transcript, intron
classification is not always straightforward. The results reported in Table 1 were
![Page 77: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/77.jpg)
Accelerated evolution in Human introns ● 77
obtained by including introns in a given class as long as at least one of its transcripts
supported that classification. We repeated the analysis using a different classification
criterion in which introns were put in a given class only if all the transcripts they
belong to support that decision. The results, in Supplementary Table 2, are very
similar to the ones presented here.
Functional analysis
We used both the PANTHER and GO ontologies to explore the function of the genes
containing PSIs.
In a first approach we used PANTHER’s binomial statistics tool and GOstat to
compare the list of genes with PSIs against the list of the other genes with analyzed
introns. With the PANTHER annotation no term was significantly over- or under-
represented in the group of genes with PSIs after correction for multiple testing.
Using GOstat 14 biological process terms were significantly overrepresented in the 79
genes with at least one GO annotation out of the 83 genes with PSIs. Eleven of those
terms are parent to two of the significant terms: “positive regulation of interleukin-
10 biosynthetic process” (GO:0045082) and “T-helper cell differentiation”
(GO:0042093). The remaining significant term is “pyrimidine deoxyribonucleotide
metabolic process” (GO:0009219). However, all the significant immunity related
terms contain only two genes (BCL3 and IRF4) and the remaining significant term is
also due to the presence of only two genes (TYMS and DUT).
We thus tried another common strategy which, for each term with analyzed genes,
tests if there is an enrichment in lower or higher P values relative to the overall P
value distribution and is implemented in both PANTHER’s “Gene Expression Data
Analysis” tools and Rank GOstat. In order to do this, each gene must have a single P
value so, in genes with multiple introns, one needs to choose one P value to
represent the gene.
![Page 78: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/78.jpg)
78 ● Publication II
Our first approach was to choose the lowest P value among the introns in the gene,
which resulted in several significant terms both with PANTHER and GO. One problem
with this approach is that genes with more introns analyzed tend to have lower P
values (the median number of analyzed introns per gene in genes with P < 0.05 is 12,
twice the median in other genes; Mann-Whitney test, two-tailed P << 0.001) and the
two variables are strongly correlated (rS = -0.565, two-tailed P << 0.001). The number
of analyzed introns itself is very strongly correlated with the total number of introns
in the gene (rS = 0.872, two-tailed P << 0.001), so that the genes with P < 0.05 are
more intron-rich (median of 17 introns per gene versus 10 in the other genes; Mann-
Whitney test, two-tailed P << 0.001) and there is also a strong correlation between
the number of introns a gene has and it’s P value (rS = -0.478, two-tailed P << 0.001).
In an attempt to reduce this bias we multiplied each gene P value by the number of
analyzed introns in the gene. This ended the correlation between gene P values and
the number of analyzed introns per gene (rS = -0.036, two-tailed P < 0.001), but genes
with smaller P values still have more introns (median of 10 versus 7 analyzed introns
per gene; Mann-Whitney test, two-tailed P << 0.001).
Finally, we corrected the gene P value taking into account the number of analyzed
introns in the gene (N) by sampling N introns, without replacement, 1,000,000 times,
from the total 87,631 introns analyzed, and keeping the smallest of the N sampled P
values. The proportion of times the uncorrected gene P value was smaller than this
value was then used as the corrected gene P value, which is no longer associated
(median 8 versus 8, Mann-Whitney test, two-tailed P = 0.382) or correlated (rS =
0.017, two-tailed P = 0.112) with the number of analyzed introns.
With the PANTHER annotation only "Other homeostasis activities" in the "Biological
Process" ontology was marginally significant (P = 0.040) with an enrichment in genes
with lower P values. With the GO annotation "RNA metabolic process" (GO:0016070)
and "regulation of metabolic process" (GO:0019222) in the "Biological process"
ontology showed a marginally significant (P = 0.035) enrichment in genes with higher
![Page 79: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/79.jpg)
Accelerated evolution in Human introns ● 79
P values, and in the "Cellular component" ontology, "intracellular part" (GO:0044424)
and two of its parental terms were enriched for genes with low P values.
Additionally, because first introns are enriched in sequences involved in the
regulation of transcription of the gene, and thus, elements under positive selection in
these introns are more likely to affect the gene the intron belongs to than elements
in other introns more distant from the gene’s transcription start site (TSS), we did a
functional analysis study based only on the information from these introns, so that
the gene P value is the first intron’s P value.
Of the 5,271 genes with a first intron analyzed, after correcting for the number of
genes tested by FDR, 11 genes had Q < 0.05. The only significant result with
PANTHER’s binomial statistics tool was an enrichment of genes with Q < 0.05 in the
"De novo pyrimidine deoxyribonucleotide biosynthesis" Pathway, but only 2 of the 11
genes fitted in that category. With GOstat, 65 biological process terms were
significantly enriched in genes with Q < 0.05, including the eleven terms identified
initially using all genes with PSIs. Yet, except for two terms related to nucleotide
metabolic process (GO:0055086 and GO:0009117) which contained the same four
genes, all other significant terms were due to a single gene each. Eight of these 65
terms had also significantly lower P values according to Rank GOstat, but all of them
were due to gene IRF4. Another 15 “Molecular function” GO terms were significantly
enriched in genes with Q < 0.05, all of them again with only one gene, except for
“magnesium ion binding” (GO:0000287) and “pyrophosphatase activity”
(GO:0016462) plus two of its parental terms, with three genes each (two of them
shared by all four terms). In the "Cellular component" ontology we got the same
results as when the resample corrected P values were used. With PANTHER’s Mann-
Whitney U Test tool no term was significantly enriched in higher or lower P values.
![Page 80: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/80.jpg)
80 ● Publication II
Overlap with other non-coding regions under positive selection
We compared our results in introns with those obtained by Haygood et al. (2007) for
promoters on the human branch, since elements that regulate transcription can be
found in both types of non-coding sequences. At first sight positive selection seems
to affect introns and promoters independently, as neither the number of genes with
both introns and promoters under positive selection, nor the number of genes which
have HSIs and also P < 0.05 in the promoter study, are significantly different from the
expected if the two are independent (Fisher's exact test, two-tailed P = 1 and 0.839,
respectively). Yet, when we consider only first introns, there are significantly more
genes with P < 0.05 in both studies than expected by chance (odds ratio = 2.607;
Fisher's exact test, two-tailed P < 0.001).
DISCUSSION
Although, in absolute terms, non-protein-coding regions have more nucleotides in
functional elements compared to protein coding regions, the relative frequency of
these nucleotides is much lower in the former. It is thus not surprising that the
number of PSI is relatively small considering the number of tested introns. The fact
that the P values were not correlated with the percentage of gaps, low quality, or
divergence masked bases, which could indicate poor sequence or alignment quality,
or with the frequency of GC or CG-susceptible sites, gives us confidence in that these
are true PSI. The weak negative correlation found between P values and intron length
may actually be expected since as more intronic sites are analyzed, more sites under
positive selection may be included. We note though that, due to the presence of
overlapping introns in our test set, our correlation estimates may be inflated.
Since most of our overlapping introns result from alternative splicing, the lack of a
significant difference in the P or Q values between the overlapping and non-
overlapping sets of introns and of the expected and observed numbers of PSI and HSI
![Page 81: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/81.jpg)
Accelerated evolution in Human introns ● 81
in these two sets indicate that introns involved alternative splicing events are not
contributing disproportionally to the PSI and HIS classes, and thus, that regulation of
alternative splicing is not a particular target of positive selection in introns.
Our finding that the 5’-most introns in the gene (first, second and 5’ UTR introns)
have significantly lower P values and first introns in particular have also significantly
lower Q values and more HSIs than other introns indicates instead that these fast
evolving intronic sequences are more likely to be involved in the control of gene
expression, as elements involved in the regulation of gene expression are more
frequent in those introns closer to the transcription start site (Majewski and Ott
2002).
The comparison of the results from this study with those from Haygood et al. (2007)
provided additional compelling evidence for the role of the accelerated elements in
first introns in regulating gene expression. In that other study the authors identified
genes whose promoter region upstream of the TSS showed evidence of positive
selection. Since elements involved in regulating transcription are also found
downstream of the TSS, manly in the first intron, and positive selection on the
regulation of gene expression may act simultaneously on multiple regulatory
elements of the same gene, one might expect to find a significant overlap of genes
with high scoring (P < 0.05) promoters and first introns in particular, which is exactly
what was found.
In order to determine if our PSIs belonged to genes with particular functions we have
to take into account that functional information in the PANTHER and GO databases is
provided per gene but, by studying the genes’ introns, we are testing genes with
multiple introns several times, such that the more introns a gene has, the more likely
it is to contain a PSI and a lower P value. We found that applying a resampling
strategy to correct gene P values effectively cleared both the association between
low gene P values and high number of analyzed introns and the correlation between
these two variables.
![Page 82: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/82.jpg)
82 ● Publication II
Only a few PANTHER and GO terms stood out from our functional analysis, mostly
related with nucleotide metabolism and immunity. Yet, they were all supported by a
very small number of genes and, thus, are not reliable. This lack of association
between the selection in introns and the function of the protein coded by the gene
they are in is consistent with previous observations that the evolution of protein
sequences is decoupled from the evolution of non-protein-coding sequences (Resch
et al. 2007). It is possible that the accelerated elements in PSIs act on a neighboring
gene of unrelated function (Kleinjan and van Heyningen 2005), either close to the
gene containing the PSI, such as in the case of introns in MCM6 affecting the
activation of the LTC gene (Tishkoff et al. 2007), or even a distant gene, as in the case
of intron 5 of LMBR1 which contains a long-range regulatory element of the SHH
gene (He et al. 2008).
![Page 83: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/83.jpg)
Accelerated evolution in Human introns ● 83
ACKNOWLEDGMENTS
Ralph Haygood and Olivier Fedrigo for providing their HyPhy Batch Language scripts
and the HyPhy team for teaching OF how to use their software.
OF was supported by a PhD fellowship (SFRH/BD/15856/2005) from the Fundação
para a Ciência e a Tecnologia (Portugal).
![Page 84: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/84.jpg)
84 ● Publication II
![Page 85: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/85.jpg)
Accelerated evolution in Human introns ● 85
![Page 86: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/86.jpg)
86 ● Publication II
![Page 87: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/87.jpg)
Accelerated evolution in Human introns ● 87
![Page 88: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/88.jpg)
88 ● Publication II
![Page 89: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/89.jpg)
Accelerated evolution in Human introns ● 89
![Page 90: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/90.jpg)
90 ● Publication II
![Page 91: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/91.jpg)
Accelerated evolution in Human introns ● 91
![Page 92: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/92.jpg)
92 ● Publication II
Supplementary Table 2 Distribution of P and Q values by several classes of introns.
HSIs a PSIs a
Class b N OR c ∆ Mean P d OR ∆ Mean Q d
First 5763 1.24 ** -3.66 x 10-2 ** 1.06 -3.70 x 10-3 *
Second 5897 1.05 -1.04 x 10-2 * 0.86 8.32 x 10-4
Last 8164 0.98 -6.88 x 10-4 0.60 5.42 x 10-4
5’UTR 2838 1.08 -1.72 x 10-2 ** 1.08 -1.80 x 10-3
3’UTR 388 0.86 3.36 x 10-2 * 2.65 -1.84 x 10-3 a Introns with P (HSIs) or Q (PSIs) < 0.05 compared to the remaining introns.
b First, second and last introns in the gene and introns in the 5’ or 3’ UTRs compared to introns in other
locations in the gene. c Odds Ratio. A value larger than one indicates that more HSIs or PSIs were found in that class (
b) than
expected. Significant Fisher’s exact tests are marked with asterisks. d Difference between the mean P or Q values in that class of introns and the mean of all the other
introns not in that class. Significant Mann-Whitney tests are marked with asterisks. * Fisher or Mann-Whitney two-tailed P < 0.05 (*) or < 0.001 (**).
![Page 93: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/93.jpg)
Accelerated evolution in Human introns ● 93
REFERENCES
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene Ontology: tool for the unification of biology. Nat Genet. 25: 25-29.
Beissbarth T, Speed TP. 2004. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 20: 1464-1465.
Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 447: 799-816.
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708-715.
Chamary J-V, Hurst LD. 2004. Similar rates but different modes of sequence evolution in introns and at exonic silent sites in rodents: evidence for selectively driven codon usage. Mol. Biol. Evol. 21: 1014-1023.
Fernando O, Navarro A. Submitted. Intronic mutational constraints in Primates.
Gilson PR, McFadden GI. 1996. The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns. Proc. Natl. Acad. Sci. U.S.A. 93: 7737-7742.
Haygood R, Fedrigo O, Hanson B, Yokoyama K-D, Wray GA. 2007. Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat. Genet. 39: 1140-1144.
He F, Wu D-D, Kong Q-P, Zhang Y-P. 2008. Intriguing balancing selection on the intron 5 region of LMBR1 in human population. PLoS ONE. 3: e2948.
Le Hir H, Nott A, Moore MJ. 2003. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28: 215-220.
Ingram C, Raga T, Tarekegn A, Browning S, Elamin M, Bekele E, Thomas M, Weale M, Bradman N, Swallow D. 2009. Multiple Rare Variants as a Cause of a Common Phenotype: Several Different Lactase Persistence Associated Alleles in a Single Ethnic Group. J. Mol. Evol.
![Page 94: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/94.jpg)
94 ● Publication II
http://www.ncbi.nlm.nih.gov/pubmed/19937006 (Accessed November 26, 2009).
Keightley PD, Gaffney DJ. 2003. Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc. Natl. Acad. Sci. U.S.A. 100: 13402-13406.
Kim SY, Pritchard JK. 2007. Adaptive evolution of conserved noncoding elements in mammals. PLoS Genet. 3: 1572-1586.
King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science. 188: 107-116.
Kleinjan DA, van Heyningen V. 2005. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76: 8-32.
Koonin EV, Wolf YI. 2010. Constraints and plasticity in genome and molecular-phenome evolution. Nat. Rev. Genet. 11: 487-498.
Krawczak M, Reiss J, Cooper DN. 1992. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum. Genet. 90: 41-54.
Lomelin D, Jorgenson E, Risch N. 2010. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 20: 311-319.
Lynch M. 2010. Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. U.S.A. 107: 961-968.
Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.
Pollard KS, Salama SR, King B, Kern AD, Dreszer T, Katzman S, Siepel A, Pedersen JS, Bejerano G, Baertsch R, et al. 2006. Forces shaping the fastest evolving regions in the human genome. PLoS Genet. 2: e168.
Pond SLK, Frost SDW, Muse SV. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 21: 676-679.
R Development Core Team. 2009. R: A Language and Environment for Statistical
Computing. Vienna, Austria http://www.R-project.org.
![Page 95: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/95.jpg)
Accelerated evolution in Human introns ● 95
Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.
Schmutz J, Wheeler J, Grimwood J, Dickson M, Yang J, Caoile C, Bajorek E, Black S, Chan YM, Denys M, et al. 2004. Quality assessment of the human genome sequence. Nature. 429: 365-368.
Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. 2003. PANTHER: A Library of Protein Families and Subfamilies Indexed by Function. Genome Research. 13: 2129-2141.
Thomas PD, Kejariwal A, Guo N, Mi H, Campbell MJ, Muruganujan A, Lazareva-Ulitsky B. 2006. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucl. Acids
Res. 34: W645-650.
Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, et al. 2007. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39: 31-40.
Wang G-S, Cooper TA. 2007. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 8: 749-761.
Wieringa, B., Hofer, E. & Weissmann, C., 1984. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell, 37(3), 915-925.
Wooding SP. 2007. Following the herd. Nat. Genet. 39: 7-8.
![Page 96: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/96.jpg)
![Page 97: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/97.jpg)
General discussion and
conclusions
![Page 98: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/98.jpg)
![Page 99: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/99.jpg)
Discussion and conclusions ● 99
Although at the time of their discovery introns were already expected to have a role
in many cellular functions and even the evolution of genomes (Williamson 1977;
Marx 1978; Gilbert 1978), the three decades that have passed since have confirmed
many of those hypothesis and increased the repertoire of intronic functions beyond
what was initially imagined.
We now know that introns contain a variety of functional elements and even other
genes. Besides the majority of the core splicing signals, introns also contain
regulatory elements essential for splicing and transcription which are expected to
affect the evolution of these sequences by being a target for negative/purifying or
positive/directional selection.
Constraints on the evolution of intronic sequences
Several studies have found that intron nucleotides closer to the splice sites show a
higher degree of conservation, but the reported length of these conserved regions
varies greatly in the literature (Majewski and Ott 2002; Hare and Palumbi 2003; Sorek
and Ast 2003; Kaufmann et al. 2004). Inconsistencies among the different studies are
likely to be the result of differences in the methods used to estimate conservation,
the species studied and the subsets of introns used.
We were interested in determining the length of these constrained regions at the 5’
and 3’ ends of introns in primates because they are the most likely location of
intronic regulatory sequences, and also because by defining these regions we also
identify the complementary regions, in the middle of the intron, that are most likely
to be evolving neutrally.
In order to do that, we looked at the frequency of substitutions along human-
chimpanzee-macaque orthologous introns from each splice site and found that
sequence constraints extend for longer that what was found in most previous
reports, up to 400 bp from each splice site. In the first (5’-most) intron of the gene,
conservation of the 5’end extends up to several kilobases from the donor splice site,
![Page 100: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/100.jpg)
100 ● Discussion and conclusions
most likely due to the presence of regulatory elements involved in transcription,
which tend to be located close to the transcription start site.
The knowledge of the extent of these regions is useful for defining target regions
when studying functional elements present in introns (either computational scans of
over-represented motifs or functional experiments), and also for selecting intronic
regions in studies using introns as neutrally evolving sequences (from which these
more conserved regions should be excluded) such as to estimate genetic distances
between species or to detect positive selection.
Accelerated evolution of intronic sequences
It has been suggested that the majority of changes that separate humans from their
closest relatives lie in regulatory regions rather than in protein coding sequences, and
it is possible that many of these changes are adaptations. Since introns carry so many
regulatory elements involved in several steps of splicing and transcription control,
they are a promising location for these adaptive changes in different lineages.
We performed a genome-wide scan for introns with evidence of having evolved
under positive selection in the human lineage using the central part of introns (after
excluding the constrained regions identified in our previous study) as our neutrally
evolving sequences to which we compare the substitution rates in our test introns.
Traditionally, synonymous sites in protein-coding regions and ancestral repeats have
been used with this purpose, but evidence has been accumulating that selection also
acts on these regions (Lomelin et al. 2010; Hellmann et al. 2003; Hirsh et al. 2005;
Chamary et al. 2006; Imamura et al. 2009; Faulkner and Carninci 2009). Our decision
to use the central portions of introns comes from our observations in the previous
study, from examples of successful use of intronic sequences in independent studies
(Haygood et al. 2007; Parsch et al. 2010; Hoffman and Birney 2007; Resch et al. 2007;
Ke et al. 2008) and from the need to use sequences from the same genomic region as
![Page 101: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/101.jpg)
Discussion and conclusions ● 101
the sequences being tested to minimize differences in the mutation rate, which can
vary along the genome.
We found evidence for positive selection in 86 human introns mostly belonging to
different genes. Our functional analysis of the genes to which these introns belong
did not yield any biological process or molecular function particularly enriched with
these genes, which might not be an unexpected result if the selected sequences in
these introns act on a neighboring gene of unrelated function, likely as a distant
transcription regulatory element. In fact, there is evidence that many genes require
distant cis-regulatory elements for their correct spatial and temporal expression, and
that these elements can be found up to one mega base pairs from the gene, often
embedded within another gene, generally within its introns, that fulfills a very
different function from the regulated gene (Kleinjan and van Heyningen 2005).
We were still able to infer that transcription regulation is a more likely target of
positive selection in introns than regulation of alternative splicing given that
overlapping introns (which mainly result from alternative splicing events) were not
particularly enriched in PSIs, but introns closer to the TSS (which are enriched for
transcription regulatory elements), especially the first intron, were. The fact that
genes with fast evolving promoter regions were more likely to have also fast evolving
first introns also supports the notion that accelerated elements in first introns are
likely regulating gene expression.
![Page 102: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/102.jpg)
102 ● Discussion and conclusions
REFERENCES
Chamary JV, Parmley JL, Hurst LD. 2006. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat. Rev. Genet. 7: 98-108.
Faulkner GJ, Carninci P. 2009. Altruistic functions for selfish DNA. Cell Cycle. 8: 2895-2900.
Gilbert W. 1978. Why genes in pieces? Nature. 271: 501.
Hare MP, Palumbi SR. 2003. High intron sequence conservation across three mammalian orders suggests functional constraints. Mol. Biol. Evol. 20: 969-978.
Haygood R, Fedrigo O, Hanson B, Yokoyama K-D, Wray GA. 2007. Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat. Genet. 39: 1140-1144.
Hellmann I, Zollner S, Enard W, Ebersberger I, Nickel B, Paabo S. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13: 831-837.
Hirsh AE, Fraser HB, Wall DP. 2005. Adjusting for Selection on Synonymous Sites in Estimates of Evolutionary Distance. Mol Biol Evol. 22: 174-177.
Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide substitution using introns. Mol. Biol. Evol. 24: 522-531.
Imamura H, Karro J, Chuang J. 2009. Weak preservation of local neutral substitution rates across mammalian genomes. BMC Evolutionary Biology. 9: 89.
Kaufmann D, Kenner O, Nurnberg P, Vogel W, Bartelt B. 2004. In NF1, CFTR, PER3, CARS and SYT7, alternatively included exons show higher conservation of surrounding intron sequences than constitutive exons. Eur. J. Hum. Genet. 12: 139-149.
Ke S, Zhang XH-F, Chasin LA. 2008. Positive selection acting on splicing motifs reflects compensatory evolution. Genome Res. 18: 533-543.
Kleinjan DA, van Heyningen V. 2005. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76: 8-32.
![Page 103: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/103.jpg)
Discussion and conclusions ● 103
Lomelin D, Jorgenson E, Risch N. 2010. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 20: 311-319.
Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 1827-1836.
Marx JL. 1978. Gene structure: more surprising developments. Science. 199: 517-518.
Parsch J, Novozhilov S, Saminadin-Peter SS, Wong KM, Andolfatto P. 2010. On the utility of short intron sequences as a reference for the detection of positive and negative selection in Drosophila. Mol. Biol. Evol. 27: 1226-1234.
Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. 2007. Widespread positive selection in synonymous sites of mammalian genes. Mol. Biol. Evol. 24: 1821-1831.
Sorek R, Ast G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 1631-1637.
Williamson B. 1977. DNA insertions and gene structure. Nature. 270: 295-297.
![Page 104: Dissertação apresentada para obtenção do grau de doutor thesis... · 2016. 7. 25. · Os intrões dependentes do spliceossoma, a classe de intrões mais comum em eucatiotas ...](https://reader031.fdocumentos.tips/reader031/viewer/2022012002/60976fa8676ae35a546fb869/html5/thumbnails/104.jpg)