• Keine Ergebnisse gefunden

analysis of cooperating TFs

4.1. Identification of intra-regional cooperating TFs using pointwise mutual information

4.1.1. Cooperating TFs

In this section, I introduce the idea for using pointwise mutual information for the identifi-cation of cooperating transcription factors based on the co-occurrence in a set of sequences.

Pre-processing work In the first step, I obtain all promoter regions for the set of RefSeq genes under study based on their annotated transcription start site (TSS) using UCSC Table Browser [45]. Thereby, I use the hg19 release of the human genome and consider only chromosome annotations of chromosome chr1-chr22, chrX and chrY. Due to the fact that alternative promoter regions for the same gene tend to overlap resulting from the underlying RefSeq annotations, I filter redundant promoters based on their TSS by randomly picking one of the redundant promoter sequences and, consequently, regard only sequences in the analysis which have no overlap.

Afterwards, I predict all potential transcription factor binding sites (TFBSs) in the obtained sequences and their reverse complement using the MatchT M program by setting the pro-file parameters as specified by [4]. I further use the PWM library proposed by [4] of TRANSFACrrelease 2014.1.

Workflow The algorithm for the determination of co-occurring TFBSs comprises six phases that are explained in detail in the following.

Phase 1: Construction and filtering of TFBS-sequence matrix Based on the number of predicted TFBSs in each sequence under study a TFBS-sequence matrixMis generated where rows correspond to the sequence IDs and columns to the names of PWMs. Thereby, an entry in M is defined as follows: Let TFBS tj be a TFBS predicted by PWM j (j∈ 1, ...,n, where m is the number of PWMs in the library) andsi (i∈1, ...,m, where n is the number of sequences under study) be a promoter sequence, an entry fi j inMcorresponds to the frequency oftjinsi(see Figure 4.1). It turned out that some TFBSs are highly over-represented, while some other TFBSs occur rarely in a minority of the sequences. In order to reduce the bias of highly represented TFBSs or noisy effects arising from insufficient data the corresponding columns are filtered by removing all columns that i) contain more zero entries that average and ii) having a column sum≤3×σ, whereσis the standard deviation of all column sums inM.

Phase 2: Identification of important TFBSs in each sequence Following the idea of linguistics for document summarizing processes, I characterize the important TFBSs for each sequence based on the filtered M by calculating the pointwise mutual information (PMIst) between a sequencesiand a TFBStjas

PMI(si,tj) =log2 p(si,tj)

p(si)p(tj), (4.1.1)

TSS

MatchTM S1

S2 S3 Sm

S1 S2 S3 Sm

TSS

TSS

S1 S2

S3 Sm

T1 T2 T3 T4

S1 1 1 1 1

S2 1 0 1 2

S3 1 1 1 0

...

Sm 2 1 2 0

a)

b)

Figure 4.1.: Construction of TFBS-sequence matrix. a)In a first step, for all sequences under study, all potential transcription factor binding sites are predicted using MatchTM program [35]. b) In the next step, the TFBS-sequence matrix is generated where rows correspond to the promoter sequences and columns to PWMs used for TFBS prediction.

An entry in the matrix refers to the frequency of predicted TFBSs in the corresponding sequence. For example, for PWM4one corresponding TFBS in sequences1and two TFBSs ins2are identified.

wherep(si,tj)is the joint probability for TFBStj occurring in sequencesi with respect to the entire sequence set. It is defined as follows:

p(si,tj) = fi j

mi=1nj=1fi j (4.1.2) p(si)andp(tj)are the marginal probabilities ofsiandtj, respectively. They are defined as:

p(si) = ∑nj=1fi j

mi=1nj=1fi j

(4.1.3)

and

p(tj) = ∑mi=1fi j

mi=1nj=1fi j

(4.1.4) Finally, a TFBStjis considered to be important for sequencesiifPMI(si,tj)>0 indicating thattjoccurs more often than expected by pure chance insi. In the following analysis steps, only those TFBSs are considered that have been identified as important by this criterion.

Phase 3: Filter to avoid overlaps The MatchT Malgorithm predicts all potential TFBSs based on a PWM library, which can result in multiple predictions for the same sequence region and, thus, overlapping TFBSs. These overlaps can be explained by i) the similarity of some PWMs, ii) the palindromicity of TFBSs (the referse complement is the same as the original sequence) and iii) some PWMs are larger than the real binding sites of TFs. The overlap of two TFBSs can be partially or a TFBS can totally be included in another binding site (see Figure 4.2) . In analogy to [46], I define two TFBSs to be overlapping if their overlapping region exceeds a length of 4 bp.

Partially overlapping TFBSs

Partially overlapping TFBSs of the same type

Totally overlapping TFBSs

Figure 4.2.: Different scenarios for overlapping TFBSs.On the left side, the binding sites of the blue and the gray transcription factors share a few overlapping nucleotides, while the two gray TFBSs have a large overlapping region. On the right, the binding site of the green TF is totally included in the binding site of the yellow one, indicating that the binding of the two TFs is mutually exclusive.

The overlap of TFBSs of the same type can result in their over-representation in the follow-ing analysis steps. Thus, overlappfollow-ing TFBSs of the same type are filtered in a way that the TFBS survives that has a closer distance to transcription start site (TSS), since the functional important TFBSs are closer to TSS [47]. This filtering process is depicted in Figure 4.3.

Figure 4.3.: Filter to avoid overlaps. Overlapping TFBSs of the same type (marked by dashed circles) are filtered in a way that the TFBS survives that has a closer distance to the transcription start site (TSS) in order to avoid the overestimation of a certain TFBS.

Phase 4: Construction of TFBS pairs The distancedtA,tB between two TFBSstAandtB

is defined as the distance of their centersCtA andCtB:

dtA,tB=|CtA−CtB| (4.1.5) Thereby, the centerCtA of a TFBStA is defined asblength2 Ac where lengthA indicates the length oftA (see Figure 4.4).

Two TFBSs form a pair ifdmin≤dtA,tB≤dmax, wheredminanddmaxare pre-defined minimal and maximal distance thresholds, thereby a slight overlap of the TFBSs of at most 4 bps is allowed as suggested in [46]. In this thesis, I setdmin=5 bp which is about half of the length of an average TFBS and tested several different dmax constraints. In the analysis, I have to deal with homotypic clusters, an accumulation of TFBSs of the same type in a certain DNA region that are not necessarily overlapping. This accumulation of a certain TFBS results in a multitude of false positive pairs containing this TFBS. In order to avoid such over-estimations, a TFBS instance can only participate in one pairing of a specified TFBS pair (see the example above for details).

Example: Homotypic cluster problem

The green TFBSs tgreenform a homotypic cluster and the gray TFBS tgrayis included in the cluster. By simply counting all possible pairs of tgreen-tgrayresult in four pair instances.

A homotypic cluster is the accumulation of TFBSs of the same type in a certain DNA region.

In the example above, the green TFBSstgreenbuild an homotypic cluster and the gray TFBS tgray is incorporated in this cluster which leads to four pair instances oftgreen-tgray. However, the number of these pairings is an overestimation of the considered pair, since: i) the green binding sites are not all occupied by TFs at the same time and ii) the gray TF can not interact with all green TFs at the same time. In order to avoid this overestimation of TFBS pairs, the pair instances were identified in a way that I consider a certain pair of TFBSstAandtBand scan the DNA in 5’- 3’ direction to detect instances of this pair. After a certain TFBStAortB is incorporated in a pair instance, it is blocked for additional pairings and cannot participate in another pair instance. Applying this strategy on the example above for the pairtgreen-tgray results in one pair instance instead of four.

Considered pair:tgreen-tgray Considered pair:tgreen-tgreen

Scanning the sequence from left to right for in-stances of the pairtgreen-tgray , the first green TFBS is paired to the gray one (depicted by red lines). Afterwards, the gray TFBS is blocked for additional pairings resulting in just one pair instance oftgreen-tgrayinstead of four (depicted by gray lines).

Scanning the sequence from left to right for in-stances of the pairtgreen-tgreen, the first and the second green TFBSs are paired and afterwards blocked for additional pairs. However, the third and the fourth green TFBSs are not blocked yet and can form an additional pair which results in two pair instances oftgreen-tgreen.

Phase 5: Weighted cumulative pointwise mutual information For the identification of potentially collaborating TF pairs, thePMIbetween all TFBSstAandtBis calculated as follows

PMI(tA;tB) =log2 p(tA,tB)

p(tA)p(tB) (4.1.6)

49 S2 S3 Sm

S2 S3 Sm

TSS

S1 S2 S3 Sm

TSS TSS

S1 S2 S3 Sm

TSS

TFBS pair

Figure 4.4.: TFBS pair construction. The TFBS pairs were identified based on the dis-tance of their centers and are marked by red lines.

where p(tA;tB) is the joint probability for TFBSs tA andtB and p(tA) and p(tB) are the marginal probabilities, respectively. ThePMIin general is rather susceptible to low number counts [41]. In order to overcome this property to some extent, thePMI(tA;tB)is scaled by the joint probability p(tA,tB)and the weightwsof the corresponding sequences, resulting in the weighted pointwise mutual informationPMIsp(tA;tB).

PMIsp(tA;tB) =ws·p(tA,tB)·PMI(tA;tB) (4.1.7) The weightws of a sequences is defined as all TFBS pairsNs in s divided by the total number of TFBS pairs in the sequence setS.

ws= Ns

siSNsi

(4.1.8) Finally, in order to determine the important pairs inS, thePMIsp(tA;tB)for each TFBS pair tA andtB is summed up over all sequences resulting in the cumulative pointwise mutual informationPMIpc(tA;tB).

PMIpc(tA;tB) =

sS

PMIsp(tA;tB) (4.1.9)

Phase 6: Background noise reduction of TFBSs using average product correction To reduce the effect of false positive TFBS pairs I apply the average product correction (APC) procedure [48] on thePMIpc(tA;tB)values. I estimate for each TFBS pairtA andtB

the background noiseAPC(tA,tB)as follows:

APC(tA,tB) = PMIpc(tA;tx)·PMIpc(tB;tx)

PMIpc , (4.1.10)

wherePMIpc(tA;tx) is the averagePMIpcvalue oftAwith all other binding sites andPMIpc is the overall mean of all calculatedPMIpcvalues. PMIpc(tA;tx)is calculated as:

PMIpc(tA;tx) = 1 n−1

n

x=1

PMIpc(ta;tx), (4.1.11) wherex=1, ...,nandx6=a. This estimated background noiseAPC(tA,tB)is then subtracted from the originalPMIpc(tA,tB)-value, resulting in the finalPMIAPCpc (ta;tb)-value.

PMIAPCpc (ta;tb) =PMIpc(tA;tB)−APC(tA,tB) (4.1.12) Based on the finalPMIAPCpc (ta;tb)-values, thez-scorefor each TFBS pairtAandtBis calcu-lated and a pair is considered as significant, if it’sz-score(tA,tB)≥3.

Thez-scoreis calculated as:

z-score(tA,tB) =PMIAPCpc (ta;tb)−PMIAPCpc

σPMIAPC pc

, (4.1.13)

where PMIAPCpc is the overall mean of PMIAPCpc -values and σPMIAPC

pc is the corresponding standard deviation.