Example dataset - Identification of inter-regional associated TFs using multivariate mutual inf

analysis of cooperating TFs

5.2. Identification of inter-regional associated TFs using multivariate mutual information

5.2.1. Example dataset

In order to illustrate my basic idea to the reader and to get a first impression about the general performance of the different multivariate mutual information metrics, I constructed two small TFBS count matrices, one for enhancer and one for promoter sequences (see Figure 5.9). One row in each matrix corresponds to an enhancer/promoter sequence, respectively, where both sequences correspond to a certain PEI. Columns represent the TFBS names and an entry in the matrix refers to the number of predicted PWM matches. The first four PEIs (E1/P1,..,E4/P4) are defined as real/input PEIs, while the others are treated as background (E1sh/P1sh,..,E4sh/P4sh). The type of the pairing L∈ {I,B} is given in the vector V^lab. Three TFBS types are predicted in the enhancer regions (TE1,TE2 andTE3) and three in the promoters (TP1,TP2 and T_P3). Having three predictable binding site motifs in each, enhancer and promoter sequence, there are in total nine pairwise enhancer-promoter TFBS combinations.

A closer look at the count value distributions of the individual TFBSs provides hints about their general binding behaviour in the sequence set under study and provides a first insight about the pairwise association between two TFBSs of enhancer and promoter sequences.

In this synthetic example, the binding behaviour ofTE1 and TP1 appears to be associated in the real PEIs as well as in the background set. This pair is perfectly associated in my point of view, but it has to be pointed out that the modeled association in the background sequences is not likely to generate with my background set. The motif pair ofTE2 andTP2 is associated in the real PEIs, but not in the background set and therefore, it appears to be the second best associated TFBS pair in this example. In contrast,T_E3 andT_P3 show an associated behaviour that is not related to the label (input or background) of PEI and refers to be the non-associated pair.

By assuming that the count matrices already contain the interval identifiers assigned in Phase 5, I applied the information theoretic measures to this example. Afterwards, I nor-malized the outcomes of the different quantities using the logarithm of the maximal

alpha-M^enh M^prom V^lab

Figure 5.9.: Example dataset:Synthetic generated TFBS-count matricesM^enhandM^prom and label vectorV^lab. The rows ofMcorrespond to PEIs and the columns to TFBS names and an entry in the matrix is the frequency of predicted TFBSs in the respective sequence.

V^lab indicates the label of the interaction type (I refers to real/input PEIs, B indicates the background).

bet size in order to reduce the influence of alphabet size and to enable a proper comparison between the different quantities (see Table 5.13).

UsingMMI, the best associated pairTP1-TE1 has the highest value withMMI(TE1;TE2;L) = 1. The non-associated pair results inMMI(TE3;TP3;L) =0, indicating that the three vari-ables TE3, TP3 and L do not contain any information about another. The second best associated pair gets a value ofMMI(TE2;TP2;L) =0.43 and thus, it is in the intermediate position.

Using theJMI, I successfully identified the best associated pair as top ranking. The non-associated pair gets a value of 0, indicating that the joint distribution of TP3−TE3 and the label L do not share any commonality. The second best associated pair results in JMI(TE2,TP2;L)=0.43 and is therefore on intermediate position of the three considered pairs. Having a look at the other potential pairings between the TFBSs of enhancer and pro-moter, respectively, it is remarkable that the pairsT_P3−T_E1 andT_P1−T_E3 show a higher JMI(TE,TP;L)-value than the second best associated pair, although, the two columns do not show any dependence of each other at all. However, both distributions highly depend on the label vector, resulting in the high value calculated by this quantity. This in turn implies that a dependence between the two TFBSs is not required for a high value of this quantity as long as one or both distributions show any commonality to the label vector.

Using theCMIresults inCMI(TE1;TP1|L) =0 for the best associated pair, indicating that TE1 andTP1 do not share any additional information about each other, if the label is known.

Table 5.13.: Results of the synthetic generated count matrices. Shown are the result values for all TFBS pairs for the joint mutual information (JMI), multivariate mutual infor-mation (MMI), conditional mutual information (CMI), dual total correlation (DTC) and pairwise mutual information (I). While the first four metrics consider the TFBS distribu-tions ofTEandTPas well as the labelLof the PEIs (input or background PEI), the pairwise mutual information just focuses on the TFBS distributions in the input sequences, neglect-ing the generated background. All values are normalized by the alphabet size.

TFBS enhancer

TFBS promoter

JMI(TE,TP;L) MMI(TE;T_P;L) CMI(TE;T_P|L) DTC(TE,TP,L) I(TE;T_P)

T_P1 T_E1 1.0 1.0 0.0 1.0 0.0

TP2 TE1 0.43 0.43 0.0 0.43 0.0

TP3 TE1 0.5 0.0 0.0 0.5 0.0

TP1 TE2 0.43 0.43 0.0 0.43 0.0

TP2 TE2 0.43 0.43 0.43 0.86 0.0

TP3 TE2 0.43 0.0 0.43 0.86 0.0

TP1 TE3 0.5 0.0 0.0 0.5 0.0

T_P2 T_E3 0.43 0.0 0.43 0.86 0.0

TP3 TE3 0.0 0.0 1.0 1.0 1.0

The non-associated pair gets the highest value by this quantity withCMI(TE3;TP3|L) =1.

The second best associated pair results in a value ofCMI(TE2;TP2|L) =0.43 and thus, it is on position two in the ranking of theCMI-values. Therefore, theCMIpredicts the pairs in the reverse order.

Using theDTCresults inDTC-value of one for the best as well as for the non-associated pair. The second best associated pair gets a value ofDTC(TE2,T_P2,L) =0.86 which implies that theDTCdoes not reflect the order of the associated pairs.

Finally, I applied the pairwise mutual information (I) to the example dataset in order to demonstrate the importance of the background set and consequently, the requirement of the third variable L. For the calculation of the pairwise mutual information I(TE;TP), I considered only the input PEIs, since this quantity offers no differentiation between the input and the background. TheI results in I(TE3;T_P3) =1.0 for the non-associated pair and all other pairs have a value of 0, indicating that both binding sites do not offer any information about each other. This implies that the association between two TFBSs can in some cases only be captured by the consideration of the background set.

Summarizing my findings, theMMIandJMIarrange the pairs in the correct order accord-ing to their association strength.

Table 5.14.: Inserted associated TFBSs in enhancer and promoter sequences with the representing logoplots.

Pair TFBS enhancer TFBS promoter

1 V$NFAT5_Q5_02 V$ROAZ_01

2 V$E2_01 V$GZF1_01

3 V$ZNF143_03 V$IRF1_01

Im Dokument Information theoretical approaches for the identification of potentially cooperating transcription factors (Seite 103-106)