• Keine Ergebnisse gefunden

Multivariate mutual information in the context of inter-regional cooperating TFs

analysis of cooperating TFs

6.2. Multivariate mutual information in the context of inter-regional cooperating TFs

The ability for the interaction of two gene regulatory regions, like enhancer and promoter regions, is formed by the transcription factors binding to these regions. Thereby, some tran-scription factors are more important for this interaction than others and stay in association with other factors on the pairing DNA region. Thereby, the binding behaviour and thus, the binding site distributions are associated with each other. I tried to measure the level of dependence using different mutual information metrics that consider three random vari-ables. The third random variable provides information about the origin of the underlying data since I created a background sequence set. This background sequence set is required to decrease the effect of false positive TFBS predictions that can lead to false positive asso-ciations between two TFBSs. Thereby, noise is separated from the signal arising from real TFBS associations. The background sequences are created using uShuffle [36] algorithm that keeps the general nucleotide composition as well as the frequency ofk-mers in the input sequences. I evaluated the performance of theMMIfor differentk. It turned out that the performance fork∈ {1,2}appeared to be continuously of high quality. However, this can be explained by the fact that some binding sites did not occur in the background sequences and the pairs are thus high ranked due to the lack of their corresponding binding sites in the background sequences. In turn,k∈ {4,5}kept the sequences too similar to the original ones and no differences in the binding sites counts could be determined. Following these findings, I keptk=3 (as I did in the extended version of my fist approach) enabling that TFBSs still occur in the background sequences but the count value distributions differ from those of the original input sequences if their binding sites have any biological importance.

In order to avoid the correlation of TFBSs due to zero count values on both sides, I filtered all TFBSs that have more than 50% zero entries in the input sequence set.

For the purpose of a proper comparison between the count value distributions of the TF-BSs, I first normalized all count values and, second, assigned them to predefined intervals.

As normalization strategy, I chose themin-max normalizationusing global minimum and global maximum count values. Using the column minimum/maximum led to a poisson distribution of the normalized count values in the range between zero and one, and thereby, complicating the differentiation between the individual distributions. Using the global mini-mum/maximum relocated the original count value distribution to the range between zero and one and kept the original distribution properties. After the normalization of count values, the count values were assigned toq+1 intervals whereqintervals are equally distributed in [0,1]and one additional interval. All values of zero are assigned to this additional interval in order to differentiate between a low number count and the non existence of this binding site in a sequence.

I performed a comparative study between four different mutual information measures re-garding three random variables: dual total correlation (DTC), conditional mutual infor-mation (CMI), joint mutual information (JMI) and the multivariate mutual information (MMI). The definitions of the different measures are given in Chapter 3. For this, I gen-erated synthetic paired sequence sets and inserted three TFBS pairs in these sets by con-structing different conditions regarding the TFBS frequency and the association strength of each pair. Although theCMIperformed well in most cases for the synthetic sequence set, it performed poorly in the small starting example. It turned out thatCMIis not able to predict perfect associated pairs, if the information of the label does not offer additional information.

TheJMIperformed well in most cases for both sets. However, it high-ranks some pairs that do not show any association with each other, but one binding site distribution is somehow in dependence on the label distribution (see TFBS pairTE1−TP3in Table 5.9). TheMMI andDTC clearly outperformed the other two measures. Although theDTCconsistently identified the inserted pair, a closer look to its predictions revealed that its results strongly depend on the distribution of just one TFBS binding site. Since theDTCis build of the CMIand the JMIits prediction performance is not reliable and binding sites ranked high that do not show any association linked to the origin of the data (input and background set).

Therefore, I stayed with theMMIfor further analysis of real biological data.

In order to evaluate my method, I compared its performance with MotifHyades [15], a tool published by Wong in 2017. It turned out that MotifHyades performed quite well on the synthetic sequence sets. However, for low numbers of TFBSs or low association strength of the TFBS pairs, MotifHyades performed poorly and theMMI-approach clearly outperformed it. As explained by Hu, the algorithm has been developed for predicting statistically significant over-represented TFBS pairs of enhancer and promoter sequences.

Thus, low associated pairs are not targeted.

I decided to go without a statistical analysis which would be based on data bases like Bi-oGRID, TransCOMPEL or STRING, since this approach is not targeting the direct physical interaction between two transcription factors. It is much more likely that the cooperation is mediated by other factors such as co-factors. Exceeding the definition of true interactions in a way that direct as well as third-party interactions are included would lead to the fact that every factor can interact with every other factor throughout some highly connected factors such as EP300. Consequently, such a statistical analysis would not be meaningful.

I analyzed known promoter enhancer interactions based on ChIA-PET data of six human cell lines in order to determine the transcription factors that play important roles for the formation of these interactions. It turned out that the single TFBSs forming the identified pairs show a huge overlap between the different cell lines. In contrast, the overlap of the determined TFBS pairs themselves appears to be rather small suggesting that the differences in gene regulation are more on the level of paired transcription factor interactions than on the single factors. This finding is in consideration with that of my first approach where the overlap between the intra-regional TF cooperations was much higher than that of the single

binding sites themselves. Thus, both results support the hyopthesis that single TFs and their binding sites are re-used for different purposes, e.g. cellular contexts and, consequently, a flexible and specific gene regulation is mainly based on the combinatorial binding of TFs.

As exemplarily shown for cell line K562, the degree distribution of the nodes follows a power-law distribution, and thus, the network is scale-free. This, in turn, reveals that some TFs participating in many pairings and are represented as hub nodes in the network while the majority of TFs is only involved in a few pairings. Consequently, some TFs are of major importance for the regulation of the underlying gene set but have to cooperate with other factors in order to fulfill their regulatory functions. These highly interacting TFs are presented as hubs in the underlying cooperation networks. The biological evaluation of theses factors showed that some of them have already been linked to the analyzed cell lines or their corresponding phenotypes (i.e. leukemia).

Comparing the hub nodes of the different cell lines reveals that there are no overlapping hubs representing enhancer TFs. In turn, there are two transcription factors identified in promoter sequences the binding sites of which represent hub nodes in at least two cell lines:

TOPORS and MEF2A. Members of the MEF2A family are known to be involved in the upregulation of genes in cell lines K562 and GM12878 [88, 89]. TOPORS in turn is known to be involved in promyelocytic leukemia [76].

A biological evaluation of the identified associated TFBS pairs for cell line K562 (a chronic myeloid leukemia (CML) cell line) points out that i.e. the identified transcription factors YY1 and ATF2 are both known to be involved in CML and are enhancer binding factors [95, 96, 97, 98, 106].

6.3. Complementarity of PMI and MMI in a biological