Simulation datasets - Identification of inter-regional associated TFs using multivariate mutual

analysis of cooperating TFs

5.2. Identification of inter-regional associated TFs using multivariate mutual information

5.2.2. Simulation datasets

For a more conscientious evaluation of the different mutual information quantities, I con-structed a collection of simulated sequence sets in which I artificially inserted associated TFBS pairs. In the first step, I trained two Markov chain models for the generation of syn-thetic enhancer and promoter sequences that show a nucleotide distribution close to natural sequences. For the generation of enhancer sequences, I trained the Markov chain model on p300 ChIP-Seq peaks provided by ENCODE (https://www.encodeproject.org/) of cell lines MCF7, IMR90 and K562 that do not have any overlap with known promoter regions. The model-generated enhancer sequences have an average length of 600bp and can vary in their length by +/-100bp. The Markov chain model for the generation of promoter sequences has been trained on a non-overlapping set of promoter sequences of genome wide RefSeq genes using the promoter region of -1000bp to +100bp relative to the transcription start site (TSS). The Markov model generated promoter sequences are of length 1100 bp.

Using these models, I generated sequence sets each of which consisted of 1000 synthetic enhancer and 1000 promoter sequences and defined thei^thpromoter sequence in the set to be paired to thei^thenhancer sequence.

In the next step, I inserted associated TFBSs in the sequences. In total, I chose three TFBS pairs (see Table 5.14) that are indicated by the names of the corresponding PWMs with an additional index for enhancer or promoter: V$ROAZ_01prom - V$NFAT5_Q5_02enh, V$GZF1_01prom-V$E2_01enhand V$IRF1_01prom-V$ZNF143_03enh.

For each pair, I constructed a synthetic set of 1000 PEIs and only inserted one specified TFBS pair inside the sequences of a PEI set to avoid cross effects of different pairs.

In order to bring the synthetic example in line with realistic scenarios, I incorporated the

“association strength” as an additional parameter. For this, I defined the association strength as the proportion of PEIs the inserted pair is important for, meaning, the fraction of promoter and enhancer sequences in which the two TFBS motifs show a dependence in their binding behaviour. Considering this, I assume that, in a set of PEIs, there is a certain kind of vari-ability in the association strength of TFs on enhancer and promoter sequences. That means, some TF pairs are important for a huge number of PEIs of the entire set and therefore, the

TFBSs of enhancer and promoter sequences are strongly associated. However, some other TF pairs appear to be incorporated in a minority of PEIs and thus, the association strength of their binding site distributions is on a lower level. Addressing this point, I incorporated the association strength as a discrete variable with three states (low, mediumandhigh) in my analysis where for alowassociation strength the pair is associated in 20% of all PEIs and formediumandhighit is important for 50% and 90% of all input PEIs, respectively (see Table 5.15). To this end, I analyzed each of the three TFBS pairs for all three association strength resulting in total, in nine synthetic generated input sequence sets.

Motivated by the idea that the interplay between a certain TF in enhancer and a TF in pro-moter regions can be important for the PEI, although one or both considered TFs occur with a low frequency, I added the parameter “TFBS frequency” in the creation of the synthetic sets. Following this, I determined for each TFBS pair the number of single TFBS instances per sequence for the TFBS frequency stateslow, mediumandhigh. Due to the fact that a fixed number of TFBS instances per sequence is unrealistic, I further incorporated a certain term of variability. Thus, the number of TFBS instances per sequence is determined by the TFBS frequency +/- the variability term whereas the TFBSs show a low variability in their binding site behaviour in the PEIs they are associated in and a larger variability in the PEIs, they are not associated (see Table 5.16).

To summarize, I have three different TFBS pairs each of which occurs with three different association strength and three different TFBS frequency levels resulting in 27 synthetic generated sequence sets under study each of which consists of 1000 PEIs.

Table 5.15.: Visualization of the different states of the “association strength” variable.

The yellow and the blue TFs are important for the establishment of the underlying PEIs (←→) or are not involved in the PEIs (←→). Regarding the different association strength:

forlowthe TF pair is important for/associated in 20% of all input PEIs, formediumin 50%

and forhighin 90% of all PEIs under study.

High Medium Low

Comparison of the different multivariate mutual information metrics For the analysis of the 27 generated synthetic datasets, I set the parameters for the overall workflow as

Table 5.16.: Numbers of inserted TFBS instances for each artificially inserted associ-ated TFBS pairing. The numbers of insertions (# Motifs) varies according to the column

“Variablity” and further, the numbers differ among the PEIs the pair is important for/asso-ciated in (←→) or not (←→).

TFBS fre-quency

←→ ←→

V$ROAZ_01 V$NFAT_Q5_01 V$ROAZ_01 V$NFAT_Q5_01

# Motifs Variability # Motifs Variability #Motifs Variability # Motifs Variability

Low 1 1 0 0 2 0 2 1

Medium 3 2 1 0 5 1 4 1

High 5 3 2 2 7 1 6 1

V$GZF1_01 V$E2_01 V$GZF1_01 V$E2_01

# Motifs Variability # Motifs Variability #Motifs Variability # Motifs Variability

Low 1 1 0 0 1 0 2 0

Medium 2 1 1 1 3 0 3 0

High 3 2 2 1 7 2 6 1

V$IRF1_01 V$ZNF143_03 V$IRF1_01 V$ZNF143_03

# Motifs Variability # Motifs Variability #Motifs Variability # Motifs Variability

Low 3 2 2 1 4 0 3 0

Medium 3 2 2 1 7 2 6 1

High 3 3 2 1 9 3 8 2

follows: I filtered all columns in the matrix that had more than 50% of zero entries by settingt=0.5 and I set the number of intervals, the count values were assigned into, to q=30. I further used a PWM-library of 166 matrices and ran the Match^{T M}-algorithm by setting the parameter tominimize the number of false positive(minFP) predictions. After applying our approach to these datasets, I determined a TFBS pair to be significant if its mutual information value is≥0.

I applied all four different information theoretic quantities to these simulated sets and show the number of significant pairs of theMMI in Table 5.17. It can be seen that the number of significant pairs strongly depends on the association strength as well as on the TFBS frequency. Considering my findings of the determination of specific intra-regional cooper-ating TFs in the simulation dataset (see Section 5.1.2), the different numbers of significant pairs can be explained by the unintentional insertion of additional TFBS pairs that match to the inserted consensus sequences as well and their frequency of occurrence depends on the TFBS frequency and association strength constraints. The differences among the three pairs in turn can be attributed to the number of PWMs that match to the individual consensus sequences.

Table 5.17.: Number of significant pairs identified byMMIfor the simulation dataset of each condition.

TFBS frequency Association strength Low Medium High Pair 1

Low 3 5 12

Medium 6 25 70

High 34 56 106

Pair 2

Low 0 3 0

Medium 5 6 9

High 3 8 22

Pair 3

Low 3 5 12

Medium 33 55 90

High 46 65 124

Using a library of 166 PWMs, there are 27556 possible TFBS pairs between enhancer and promoters. Table 5.18 depicts the position of the inserted pair in the ranking of the different mutual information measures. For example, consideringPair 1with alowTFBS frequency and alowassociation strength, theMMIvotesPair 1on rank one indicating that it shows the highestMMI-value among all other potential TFBS pairs.

Application ofDTC Considering the performance ofDTCforPair 1, it is high ranked in six cases. For alowormediumTFBS frequency with alowassociation strength, it was not identified at all. For ahighassociation strength in combination with ahighTFBS frequency, it is on ranking position 13. In the analysis ofPair 2, theDTCcorrectly high ranks it in all cases except for alowTFBS frequency and alowassociation strength, for which the pair was not identified. ForPair 3, it is on top position in all cases.

Application ofCMI For the CMI, the performance regarding the dataset of Pair 1is quite diverse and successful forlowandmediumTFBS frequency combined withmedium andhighassociation strength as well ashighTFBS frequency andlowandmedium associ-ation strength.Pair 2has successfully been identified as important in all cases except for a lowassociation strength combined with alowandmediumTFBS frequency and for ahigh association strength and amediumTFBS frequency. RegardingPair 3, it is on top in the pair ranking for all combinations regarding amediumandhighTFBS frequency.

Table 5.18.: Results for the simulation dataset.The table gives the position of the inserted pair in the ranking of the underlying metrics for all condition combinations. In total, there are 27556 TFBS pairings participating in each ranking. Further, the number of intervals was set toq=30 and the threshold for zero entries filtering was set tot=0.5.

TFBS fre-quency

Association strength

Rank DTC(tE,tP,tL)

Rank CMI(tE;tP|tL)

Rank JMI(tE,tP;tL)

Rank MMI(tE;tP;tL)

Pair 1

Low Low - - -

-Low Medium 1 1 1 2

Low High 1 1 1 1

Medium Low - - -

-Medium Medium 1 1 1 1

Medium High 1 11 4 1

High Low 1 14 17 1

High Medium 1 1 14 1

High High 13 222 24 1

Pair 2

Low Low - - -

-Low Medium 1 1 1 6840

Low High 1 1 1 6745

Medium Low 1 324 1 1

Medium Medium 1 1 1 1

Medium High 1 423 1 1

High Low 1 1 2 1

High Medium 1 1 9 1

High High 2 2 80 1

Pair 3

Low Low 1 767 1 1

Low Medium 1 533 1 1

Low High 1 1588 1 1

Medium Low 1 1 1 1

Medium Medium 1 1 1 1

Medium High 1 3 47 1

High Low 1 1 1 1

High Medium 1 1 1 1

High High 1 1 56 1

Application ofJMI TheJMIin general shows a mixed performance. Pair 1was suc-cessfully identified as most important in three cases:lowTFBS frequency withmediumand highassociation strength andmediumTFBS frequency withmediumassociation strength.

Pair 2was high ranked for all combinations regarding alowandmediumTFBS frequency exceptlowTFBS frequency withlowassociation strength. TheJMIshows its best perfor-mance in the analysis ofPair 3, where it identifies it correctly on the first position for all combinations exceptmediumTFBS frequency andhighassociation strength andhighTFBS frequency withhighassociation strength.

Application of MMI The MMIidentifies Pair 1as top candidate pair for all combi-nations of TFBS frequency and association strength exceptlowTFBS frequency withlow association strength as well asmedium TFBS frequency combined withlow association strength. ConsideringPair 2 for the condition oflowTFBS frequency my approach was not able to identify it as the most important one but identified it correctly in all other cases.

RegardingPair 3, my approach successfully high-ranked the inserted pair throughout all conditions.

To summarize, the DTCand the MMI show the best performance in the analysis of the simulation datasets. However, regarding the first small example the performance of the DTCwas inscrutable to some degree, since it is a combination ofCMIandJMI. Therefore, I decided to use theMMIas the preferred metric for the determination of associated TFBSs between enhancer and promoter regions.

Im Dokument Information theoretical approaches for the identification of potentially cooperating transcription factors (Seite 106-111)