Cooperating TFs - Identification of intra-regional cooperating TFs using pointwise mutual infor

analysis of cooperating TFs

4.1. Identification of intra-regional cooperating TFs using pointwise mutual information

4.1.1. Cooperating TFs

In this section, I introduce the idea for using pointwise mutual information for the identifi-cation of cooperating transcription factors based on the co-occurrence in a set of sequences.

Pre-processing work In the first step, I obtain all promoter regions for the set of RefSeq genes under study based on their annotated transcription start site (TSS) using UCSC Table Browser [45]. Thereby, I use the hg19 release of the human genome and consider only chromosome annotations of chromosome chr1-chr22, chrX and chrY. Due to the fact that alternative promoter regions for the same gene tend to overlap resulting from the underlying RefSeq annotations, I filter redundant promoters based on their TSS by randomly picking one of the redundant promoter sequences and, consequently, regard only sequences in the analysis which have no overlap.

Afterwards, I predict all potential transcription factor binding sites (TFBSs) in the obtained sequences and their reverse complement using the Match^{T M} program by setting the pro-file parameters as specified by [4]. I further use the PWM library proposed by [4] of TRANSFAC^rrelease 2014.1.

Workflow The algorithm for the determination of co-occurring TFBSs comprises six phases that are explained in detail in the following.

Phase 1: Construction and filtering of TFBS-sequence matrix Based on the number of predicted TFBSs in each sequence under study a TFBS-sequence matrixMis generated where rows correspond to the sequence IDs and columns to the names of PWMs. Thereby, an entry in M is defined as follows: Let TFBS tj be a TFBS predicted by PWM j (j∈ 1, ...,n, where m is the number of PWMs in the library) ands_i (i∈1, ...,m, where n is the number of sequences under study) be a promoter sequence, an entry f_{i j} inMcorresponds to the frequency oftjinsi(see Figure 4.1). It turned out that some TFBSs are highly over-represented, while some other TFBSs occur rarely in a minority of the sequences. In order to reduce the bias of highly represented TFBSs or noisy effects arising from insufficient data the corresponding columns are filtered by removing all columns that i) contain more zero entries that average and ii) having a column sum≤3×σ, whereσis the standard deviation of all column sums inM.

Phase 2: Identification of important TFBSs in each sequence Following the idea of linguistics for document summarizing processes, I characterize the important TFBSs for each sequence based on the filtered M by calculating the pointwise mutual information (PMI^st) between a sequences_iand a TFBSt_jas

PMI(si,tj) =log₂ p(si,tj)

p(si)p(tj), (4.1.1)

TSS

Match^TM S₁

S₂ S₃ Sm

S₁ S₂ S₃ S_m

TSS

S₁ S2

S₃ S_m

T1 T2 T3 T4

S1 1 1 1 1

S2 1 0 1 2

S3 1 1 1 0

...

Sm 2 1 2 0

Figure 4.1.: Construction of TFBS-sequence matrix. a)In a first step, for all sequences under study, all potential transcription factor binding sites are predicted using Match^TM program [35]. b) In the next step, the TFBS-sequence matrix is generated where rows correspond to the promoter sequences and columns to PWMs used for TFBS prediction.

An entry in the matrix refers to the frequency of predicted TFBSs in the corresponding sequence. For example, for PWM4one corresponding TFBS in sequences₁and two TFBSs ins₂are identified.

wherep(si,tj)is the joint probability for TFBStj occurring in sequencesi with respect to the entire sequence set. It is defined as follows:

p(si,tj) = fi j

∑^mi=1∑ⁿj=1f_{i j} (4.1.2) p(si)andp(tj)are the marginal probabilities ofs_iandt_j, respectively. They are defined as:

p(si) = ∑ⁿ_j=1fi j

∑^m_i=1∑ⁿ_j=1fi j

(4.1.3)

and

p(tj) = ∑^m_i=1fi j

∑^m_i=1∑ⁿ_j=1fi j

(4.1.4) Finally, a TFBSt_jis considered to be important for sequences_iifPMI(si,tj)>0 indicating thatt_joccurs more often than expected by pure chance ins_i. In the following analysis steps, only those TFBSs are considered that have been identified as important by this criterion.

Phase 3: Filter to avoid overlaps The Match^{T M}algorithm predicts all potential TFBSs based on a PWM library, which can result in multiple predictions for the same sequence region and, thus, overlapping TFBSs. These overlaps can be explained by i) the similarity of some PWMs, ii) the palindromicity of TFBSs (the referse complement is the same as the original sequence) and iii) some PWMs are larger than the real binding sites of TFs. The overlap of two TFBSs can be partially or a TFBS can totally be included in another binding site (see Figure 4.2) . In analogy to [46], I define two TFBSs to be overlapping if their overlapping region exceeds a length of 4 bp.

Partially overlapping TFBSs

Partially overlapping TFBSs of the same type

Totally overlapping TFBSs

Figure 4.2.: Different scenarios for overlapping TFBSs.On the left side, the binding sites of the blue and the gray transcription factors share a few overlapping nucleotides, while the two gray TFBSs have a large overlapping region. On the right, the binding site of the green TF is totally included in the binding site of the yellow one, indicating that the binding of the two TFs is mutually exclusive.

The overlap of TFBSs of the same type can result in their over-representation in the follow-ing analysis steps. Thus, overlappfollow-ing TFBSs of the same type are filtered in a way that the TFBS survives that has a closer distance to transcription start site (TSS), since the functional important TFBSs are closer to TSS [47]. This filtering process is depicted in Figure 4.3.

Figure 4.3.: Filter to avoid overlaps. Overlapping TFBSs of the same type (marked by dashed circles) are filtered in a way that the TFBS survives that has a closer distance to the transcription start site (TSS) in order to avoid the overestimation of a certain TFBS.

Phase 4: Construction of TFBS pairs The distancedt_A,tB between two TFBSstAandtB

is defined as the distance of their centersC_t_A andC_t_B:

dt_A,tB=|Ct_A−Ct_B| (4.1.5) Thereby, the centerCt_A of a TFBStA is defined asb^length₂ ^Ac where lengthA indicates the length oftA (see Figure 4.4).

Two TFBSs form a pair ifd_min≤d_t_A_,t_B≤d_max, whered_minandd_maxare pre-defined minimal and maximal distance thresholds, thereby a slight overlap of the TFBSs of at most 4 bps is allowed as suggested in [46]. In this thesis, I setdmin=5 bp which is about half of the length of an average TFBS and tested several different d_max constraints. In the analysis, I have to deal with homotypic clusters, an accumulation of TFBSs of the same type in a certain DNA region that are not necessarily overlapping. This accumulation of a certain TFBS results in a multitude of false positive pairs containing this TFBS. In order to avoid such over-estimations, a TFBS instance can only participate in one pairing of a specified TFBS pair (see the example above for details).

Example: Homotypic cluster problem

The green TFBSs t_greenform a homotypic cluster and the gray TFBS t_grayis included in the cluster. By simply counting all possible pairs of t_green-t_grayresult in four pair instances.

A homotypic cluster is the accumulation of TFBSs of the same type in a certain DNA region.

In the example above, the green TFBSst_greenbuild an homotypic cluster and the gray TFBS t_gray is incorporated in this cluster which leads to four pair instances oft_green-t_gray. However, the number of these pairings is an overestimation of the considered pair, since: i) the green binding sites are not all occupied by TFs at the same time and ii) the gray TF can not interact with all green TFs at the same time. In order to avoid this overestimation of TFBS pairs, the pair instances were identified in a way that I consider a certain pair of TFBSstAandtBand scan the DNA in 5’- 3’ direction to detect instances of this pair. After a certain TFBSt_Aort_B is incorporated in a pair instance, it is blocked for additional pairings and cannot participate in another pair instance. Applying this strategy on the example above for the pairt_green-t_gray results in one pair instance instead of four.

Considered pair:tgreen-t_gray Considered pair:tgreen-t_green

Scanning the sequence from left to right for in-stances of the pairtgreen-tgray , the first green TFBS is paired to the gray one (depicted by red lines). Afterwards, the gray TFBS is blocked for additional pairings resulting in just one pair instance oftgreen-tgrayinstead of four (depicted by gray lines).

Scanning the sequence from left to right for in-stances of the pairtgreen-tgreen, the first and the second green TFBSs are paired and afterwards blocked for additional pairs. However, the third and the fourth green TFBSs are not blocked yet and can form an additional pair which results in two pair instances oftgreen-t_green.

Phase 5: Weighted cumulative pointwise mutual information For the identification of potentially collaborating TF pairs, thePMIbetween all TFBSst_Aandt_Bis calculated as follows

PMI(tA;tB) =log₂ p(tA,tB)

p(tA)p(tB) (4.1.6)

49 S₂ S₃ S_m

S₂ S₃ S_m

TSS

S₁ S₂ S₃ S_m

TSS TSS

S₁ S₂ S₃ S_m

TSS

TFBS pair

Figure 4.4.: TFBS pair construction. The TFBS pairs were identified based on the dis-tance of their centers and are marked by red lines.

where p(tA;tB) is the joint probability for TFBSs t_A andt_B and p(tA) and p(tB) are the marginal probabilities, respectively. ThePMIin general is rather susceptible to low number counts [41]. In order to overcome this property to some extent, thePMI(tA;tB)is scaled by the joint probability p(tA,tB)and the weightw_sof the corresponding sequences, resulting in the weighted pointwise mutual informationPMI^sp(tA;tB).

PMI^sp(tA;tB) =w_s·p(tA,tB)·PMI(tA;tB) (4.1.7) The weightws of a sequences is defined as all TFBS pairsNs in s divided by the total number of TFBS pairs in the sequence setS.

ws= Ns

∑si∈SNs_i

(4.1.8) Finally, in order to determine the important pairs inS, thePMI^sp(tA;tB)for each TFBS pair tA andtB is summed up over all sequences resulting in the cumulative pointwise mutual informationPMI^pc(tA;tB).

PMI^pc(tA;tB) =

∑

s∈S

PMI^sp(tA;tB) (4.1.9)

Phase 6: Background noise reduction of TFBSs using average product correction To reduce the effect of false positive TFBS pairs I apply the average product correction (APC) procedure [48] on thePMI^pc(tA;tB)values. I estimate for each TFBS pairtA andtB

the background noiseAPC(tA,tB)as follows:

APC(tA,tB) = PMI^pc(tA;tx)·PMI^pc(tB;tx)

PMI^pc , (4.1.10)

wherePMI^pc(tA;tx) is the averagePMI^pcvalue oftAwith all other binding sites andPMI^pc is the overall mean of all calculatedPMI^pcvalues. PMI^pc(tA;tx)is calculated as:

PMI^pc(tA;tx) = 1 n−1

x=1

∑

PMI^pc(ta;tx), (4.1.11) wherex=1, ...,nandx6=a. This estimated background noiseAPC(tA,tB)is then subtracted from the originalPMI^pc(tA,tB)-value, resulting in the finalPMI^APCpc (ta;tb)-value.

PMI^APCpc (ta;tb) =PMI^pc(tA;tB)−APC(tA,tB) (4.1.12) Based on the finalPMI^APCpc (ta;tb)-values, thez-scorefor each TFBS pairtAandtBis calcu-lated and a pair is considered as significant, if it’sz-score(tA,tB)≥3.

Thez-scoreis calculated as:

z-score(tA,tB) =PMI^APCpc (ta;tb)−PMI^APCpc

σ_PMIAPC pc

, (4.1.13)

where PMI^APCpc is the overall mean of PMI^APCpc -values and σ_PMI^APC

pc is the corresponding standard deviation.

Im Dokument Information theoretical approaches for the identification of potentially cooperating transcription factors (Seite 60-66)