• Keine Ergebnisse gefunden

5. Discussion

5.3. Selection of candidate genes using different methods

Within this study several approaches were used and are presented to identify the most promising candidate genes for further functional analysis. Therefore the functional annotation and the digital gene expression are important tools. One approach based on the digital gene expression, used the top 20 differentially expressed contigs from every comparison as candidate genes (table 13-15).

Another approach was made using an enrichment analysis and to identify within the enriched GO terms possible candidate genes (figure 22-24). The third approach was based on my own classification system; thereby candidate genes within the whole differentially expressed genes were used as a basis for classification (table 11, figure 19 and figure 20). In this study candidate genes have been selected based on the annotation in combination with literature research. Based on the logFoldChanges the top 60 DE genes have been determined as candidate genes. Based on the putative description further literature search was performed to attain more information and to get an impression of the processes during regrowth. In the following the literature research results of all top 20 DE genes is presented and evaluated. Based on the Deseq2 analysis DE genes or contig could be identified. Those contigs can provide necessary information which processes and pathways are involved in the regrowth process and which genes are characteristic for the different transcriptomes.

In order to get a better impression of the processes, pathways and genes, all top 20 contigs from every comparison have been described and their potential role during the regrowth process will be discussed in the following paragraph. To provide a better overview the discussion is separated in the different transcriptome comparisons and within those groups additional separated in the condition

“mown” and “not mown”. The information is based on the description and information provided by Phytozome and TAIR, and is related to the T. pratense description, the description of the next homolog specie and the A. thaliana description.

73

5.4. Evaluation of different classification systems to structure RNA-Seq data for candidate selection

RNA-Seq experiments create a large amount of data. Several steps in the downstream analysis of such large datasets are necessary to extract the relevant information for the experiment. The digital gene expression as well as the functional annotation provides a good overview of the generated data. Further enrichment analysis is thought to facilitate the process of candidate gene selection. In the following paragraph the results of the processes to extract relevant information of the large dataset including, the digital gene expression, the functional annotation (including the classification in defined groups) as well as the GO enrichment analysis are discussed and evaluated. For the digital gene expression the DeSeq2 analysis was chosen. Contrasting to the rough transcript abundance estimation using the TPM normalization, the DeSeq2 analysis includes a statistical evaluation of the differentially expressed contigs. Therefore providing a p-value for each transcript within a comparison, making the results more reliable. The digital gene expression showed the number of differential expressed contigs for each analysis, revealing that using a logFoldChange of <2 the number of DE contigs for every analysis as well as the number of up- and downregulated contigs is very similar (table 10). By using a logFoldChange of <1 for the analysis of the DE contigs within the greenhouse samples it came out, that much more contigs are differentially expressed and the largest proportion was upregulated within the mown samples (table 10). Leading to the suggestion, that during the regrowth process contigs slightly differential expressed between the treatments are responsible to coordinate the processes during regrowth, rather than contigs very highly expressed.

In addition it seemed that within the mown plants more contigs are upregulated, so more genes or processes are involved in the regrowth process. To further explore the DE contigs from every analysis and to determine if the DE contigs could be responsible for the processes during regrowth or if they are just involved in maintain general cell functions, a classification system was created, based on literature research. This resulted in 16 main classes and was used to describe the list of upregulated/downregulated contigs for every analysis (table 11, figure 19, figure 20). The hypothesis was that genes related to biotic stress, transcription, development, growth, signaling and phytohormones are involved in the regrowth processes. Therefore I checked to which proportion those groups are represented in the mown/not mown list of DE contigs for each analysis. The results showed that within the mown greenhouse samples 54% of the contigs are in those groups (figure 19). For the field samples just 38% and 41% of the contigs are in the classes suspected to be involved in the regrowth process (figure 19). A possible explanation is, that the field plants are more exposed to environmental conditions and have to struggle, e. g. due to the loss of protective biomass, with the consequences of mowing due to exposure to these environmental conditions (Gastal and

74 Lemaire 2015). Therefore the greenhouse samples are supposed to reflect more the processes that take place during regrowth. The extension of the list of DE contigs by repeating the analysis with a lower logFoldChange for the greenhouse samples, resulted in the same proportion of contigs (54%) grouped in classes related to regrowth (figure 20). This supports the former impression and justified the decision to selected candidate genes preferentially of the differentially expressed genes of the greenhouse. Even though the analysis of the shared contigs between the DE contigs of all lists led to the results that no contigs are shared between all analyses, exploring the genes description and looking at the classification results led to another impression (table 12, e-Appendix TpT_06_Classes_DEG). Based on those results it is possible to state that within all transcriptomes, similar genes are expressed. The classification of the DE contigs, could help to describe the long gene lists and provided a good overview. However, compared to “ready to use” functional annotations, like provided from GO the system, it is more time consuming. Nevertheless it is possible to define groups individually related to the questions of the experiment. This makes it possible to have a defined vocabulary which is understandable and project related. In addition as many genes have more than one function or are involved in more than one pathway depending on many factors like abiotic or biotic stresses or species related it is possible to group a certain gene in the most likely correct class based on the knowledge and the context. For example a gene that is involved in reproduction in A. thaliana can be involved in growth processes in M. truncatula. As M. truncatula is more closely related to T. pratense, it makes more sense to place the gene in the group of growth rather than reproduction (development). Especially for non-model organisms it makes sense to define own classes including the results when possible of studies including species other than A.

thaliana. In my opinion, this is a disadvantage when using GO annotations. It is not always clear how the annotations are made or it would also be time consuming to find out how exactly for each annotation. Moreover there are defined GO terms which cannot be interpreted intuitively or they are too general. Nevertheless the advantage of using GO terms is that they are more global and it is possible to directly compare them to other studies, as they provide controlled vocabularies of defined terms based gene product characteristics. These cover three domains: Cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and biological process, operations or set of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units; cells, tissues, organs, and organisms (Ashburner et al. 2000; The Gene Ontology Consortium 2017). The GO ontology is structured as a direct acyclic graph where each term has defined relationships to one or more other terms in the same domain, and sometimes the other domains. The given examples of contigs within the GO terms and the corresponding annotation in

75 the results section show, that GO terms and enrichment analysis should be handled with caution (e-Appendix for further details TpT_12_GO_enrichment_Goseq). Especially for non model organisms like T. pratense, the GO terms lead to misinterpretation or were too general to draw conclusions or select the most promising candidate genes. Therefore the other approach to search in the literature for basic information of each annotation and group them in own defined classes, seems more promising. In both cases the information should be checked and does not exclude further proper literature research. Nevertheless the own classification system worked better and is more suitable to draw conclusions or select candidate genes. Carnielli et al. (2015) reviewed the functional annotation of large datasets with the enrichment analysis, that results of the top enriched GO terms can differ based on the GO annotations that were used, as the results can change if GO annotations are actualized. This would mean to repeat the enrichment analysis with every update of the GO database to guarantee actual results that can be globally interpreted. Huang et al. (2009) did GO enrichment analysis with different tools and found out, that the top ten enriched GO terms differed depending on the used tool. Even though this do not influence the quality of one of the used tools, it does mean, that results of GO enrichment analysis can just be interpreted global when the compared studies used the same tool for the enrichment analysis.

5.5. Top 20 DE contigs of field and greenhouse transcriptomes show