• Keine Ergebnisse gefunden

Modeling molecular signaling and gene expression using Dynamic Nested Effects Models

N/A
N/A
Protected

Academic year: 2022

Aktie "Modeling molecular signaling and gene expression using Dynamic Nested Effects Models"

Copied!
132
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Modeling molecular signaling and gene expression using Dynamic

Nested Effects Models

DISSERTATION ZUR ERLANGUNG DES DOKTORGRADES DER NATURWISSENSCHAFTEN(DR.RER.NAT.) DER FAKULT ¨AT F ¨UR BIOLOGIE

UND VORKLINISCHE MEDIZIN DER UNIVERSIT ¨AT REGENSBURG

vorgelegt von

Benedict Nchang Anchang

aus

Yaounde, Kamerun im Jahr

2011

(2)
(3)

Modeling molecular signaling and gene expression using Dynamic

Nested Effects Models

Benedict Nchang Anchang

Department of Statistical Bioinformatics University of Regensburg

A thesis submitted for the degree of

Doctor of Natural Sciences (Dr.rer.nat.) in the Faculty of Biology and Preclinical Medicine.

October 2011

(4)

1. Supervisor: Prof. Dr. Rainer Spang

2. Supervisor: Prof. Dr. Wolfram Gronwald

Date of application for admission: 02-11-2011

Day of the defense: 09-12-2011

Signature from Author:

Signature from head of PhD committee:

(5)

Abstract

Cellular decision making in differentiation, proliferation or apoptosis is me- diated by molecular signaling processes, which control the regulation and expression of genes. Vice versa, the expression of genes can trigger the activity of signaling pathways. I summarize methodology by Markowetz et al. known as the Nested Effects Models (NEMs) to reconstruct static non-transcriptional networks using subset relationships from perturbation data and bring out its limitation to model slow-going biological processes like cell differentiation. I introduce and describe new statistical methodolo- gies called Dynamic Nested Effects Models (DNEMs) and Cyclic Dynamic Nested Effects Models (CDNEMs) for analyzing the temporal interplay of cell signaling and gene expression. DNEMs and CDNEMs are Bayesian models of signal propagation in a network. They decompose observed time delays of multiple step signaling processes into single steps. Time delays are assumed to be exponentially distributed. Rate constants of signal propaga- tion are model parameters, whose joint posterior distribution is assessedvia Gibbs sampling. They hold information on the interplay of different forms of biological signal propagation: Molecular signaling in the cytoplasm acts at high rates, direct signal propagationvia transcription and translation at intermediate rates, while secondary effects operate at low rates. I evaluate my methods in simulation experiments and demonstrate their practical ap- plications to embryonic stem cell development in mice. The results from these models explain how stem cells could succeed to carry out differentia- tion to specialized cells of the body such as muscle cells or neurons, a process that goes in one direction. The inferred molecular communication underly- ing such a process proposes how organisms protect themselves against the reversal of cell differentiation and thereby against cancer.

(6)

Zusammenfassung

Die zellul¨are Entscheidungsfindung in der Differenzierung, der Zellprolifera- tion oder der Apoptose wird durch molekulare Signalprozesse, die die Gen- regulation und -expression steuern, vermittelt. Andersherum kann die Gen- expression die Aktivit¨at der Signalverl¨aufe ausl¨osen. Ich fasse die Methodik von Markowetz et al., bekannt als Nested Effects Models (NEMs), zusam- men um statische nicht-transkriptionelle Netzwerke anhand von Teilmen- genbeziehungen aus Perturbationsdaten zu rekonstruieren. Dabei zeige ich die Anwendungsgrenzen dieser Methodik zur Modellierung langsam- laufender biologischer Prozesse wie z.B. Zelldifferenzierung. In dieser Ar- beit f¨uhre ich neue statistische Methoden namens “Dynamic Nested Ef- fects Models” (DNEMs) und “Cyclic Dynamic Nested Effects Models” (CD- NEMs) mitsamt deren Beschreibung f¨ur die Analyse des zeitlichen Zusam- menspiels von zellul¨arer Signal¨ubertragung und Genexpression ein. DNEMs und CDNEMs sind Bayessche Modelle der Signalweiterleitung in einem Net- zwerk. Sie zerlegen beobachtete Zeitverz¨ogerungen der Signalprozesse von mehreren Schritten in einzelne Schritte. Zeitverz¨ogerungen werden als expo- nential verteilt angenommen. Geschwindigkeitskonstanten der Signalweit- erleitung sind Modellparameter, deren gemeinsame posteriori-Verteilung

¨

uber Gibbs-Sampling beurteilt wird. Sie enthalten Informationen ¨uber das Zusammenspiel der verschiedenen Arten von biologischer Signalweit- erleitung: Molekulare Signalweiterleitung ins Zytoplasma findet mit ho- her, direkte Signalweiterleitung ¨uber Transkription und Translation mit mittlerer und sekundaere Effekte mit niedriger Geschwindigkeit statt. Ich beurteile meine Methoden mit numerischen Simulationsexperimenten und zeige ihre praktische Anwendbarkeit anhand von Daten aus murinen embry- onischen Stammzellen. Die Ergebnisse aus diesen Modellen erl¨autern wie es Stammzellen gelingt zu spezialisierten Zellen des K¨orpers wie Muskelzellen oder Nervenzellen zu differenzieren. Der Prozess im wesentlichen in eine Richtung. Die hieraus abgeleiteten molekularen Kommunikationsmechanis- men eines solchen Prozesses stellen dar, wie sich ein Organismus vor der Umkehrung der Zelldifferenzierung und damit vor Krebs sch¨utzen kann.

(7)

To my Father and late Mother

(8)

Acknowledgements

This work was carried out at the Institute of Functional Genomics and Bioinformatics in the department of Statistical Bioinformatics, all part of the University of Regensburg. It was supported by the Bavarian Genome Network BayGene and the ReForM-M program of the Regensburg School of Medicine. I would like to thank all past and present colleagues for their moral, friendly and technical support.

Special thanks goes to my supervisorRainer Spangfor suggesting the topic, his scientific support, and the opportunity to write this thesis under his guidance. Further thanks goes toWolfram Gronwald andPeter Oefner for their supervision and counsel during my PhD work.

The successful completion of this thesis is also the result of scientific advice and recommendations of other collaborators. I am particularly grateful to Achim Tresch,Florian Markowetz, and Marcel O Vlad who is of late.

Final gratitude goes to the moral support of my family and friends especially Kilian Awemo who did not fully understand my research topic yet went through the pains of reading my work and offering a few recommendations.

Publications Part of this thesis has been published in peer-reviewed jour- nals. The main content of Chapter 4 has been published in the Proceedings of National Academy of Science(PNAS) and has been presented at the Inter- national Conference on Intelligent Systems for Molecular Biology(ISMB), Boston, USA, 9-13 July, 2010. ISMB is the leading Bioinformatics con- ference world wide. The program committee has selected the paper for their highlight track which consists of the most significant Bioinformatics contributions of the year.

(9)

Figures This thesis reproduces/adapts figures from other publications.

Figures 1.1 and 1.3 from chapter 1 are adapted with permission from Al- bertset al.(2008) (1). Figure 1.4 has been taken from Wikipedia.org, where it is published under GNU Free Documentation Licence. Figure 2.5 is repro- duced from Vaskeet al.(2009) (2). Figure 1.7 is reproduced from Alon(2007) (3) and finally the adapted Figures 1.2 and 2.1 are from Niwa(2007) (4) and Wagner 2001 respectively.

(10)
(11)

Contents

List of Figures xi

List of Tables xiii

Glossary xv

1 Introduction 1

1.1 Cell decision making in biological processes . . . 1

1.1.1 Key players in the molecular mechanism in early stem cell differ- entiation in mouse . . . 3

1.2 Properties of cellular decision making processes . . . 5

1.3 Cellular decision making processes can be represented by hierarchies . . 6

1.4 Statistical methods for analyzing decision making processes . . . 8

1.4.1 Learning from experimental interventions . . . 8

1.5 Motivation for dynamic modeling of cell decision making processes . . . 9

1.5.1 The Feed-Forward Loop Network motif . . . 9

1.5.2 Feed-Back Loop Network motif . . . 10

1.6 Thesis Organization . . . 12

1.6.1 Nested Effects Models . . . 12

1.6.2 Gibbs sampling and Nested Effects Models . . . 12

1.6.3 Dynamic Nested Effects Models . . . 12

1.6.4 Cyclic Dynamic Nested Effects Models . . . 13

2 Nested Effects Models 15 2.1 Nested Effects Models(NEMs) . . . 16

2.1.1 Marginal likelihood scoring . . . 18

(12)

CONTENTS

2.1.2 Maximum likelihood scoring . . . 20

2.1.3 NEMs as a Bayesian network . . . 21

2.1.4 Factor graph NEMs . . . 23

2.1.4.1 Structure of factor graph NEMs and Network inference 24 2.2 Network learning algorithms in NEMs . . . 27

2.2.1 Pairwise and triple search . . . 27

2.2.2 Module networks . . . 28

2.2.3 Greedy hill search . . . 28

2.3 Overview and differences on Nested effects models . . . 28

3 Gibbs sampling 31 3.1 Background on Gibbs sampling . . . 31

3.2 Gibbs sampling . . . 32

3.2.1 The Markov chain property . . . 33

3.2.2 The Monte Carlo property . . . 34

3.2.3 Variations of Gibbs sampling . . . 34

3.2.3.1 Blocked Gibbs sampler . . . 35

3.2.3.2 Collapsed Gibbs sampler . . . 35

3.2.3.3 Gibbs sampler with ordered overrelaxation . . . 35

3.2.4 Extensions of Gibbs sampling . . . 35

3.2.5 Assessing convergence . . . 36

3.2.6 Monitoring the convergence of each parameter of interest . . . . 36

3.2.6.1 Trace plots . . . 37

3.2.6.2 Autocorrelation . . . 37

3.2.6.3 Gelman and Rubin diagnostic . . . 37

3.2.6.4 Geweke diagnostic . . . 38

3.2.6.5 Raftery and Lewis diagnostic . . . 38

3.2.6.6 Heidelberg and Welch diagnostic . . . 39

3.3 Motivation of Gibbs sampling in modeling the dynamics of a cell decision process . . . 39

(13)

CONTENTS

4 Dynamic Nested Effects Models 41

4.1 Dynamic Nested Effects Models(DNEMs) . . . 41

4.1.1 DNEM algorithm . . . 44

4.1.1.1 Model parameters . . . 44

4.1.1.2 Prior distributions for model parameters . . . 45

4.1.1.3 Probability density of signal propagation between input and output signal . . . 46

4.1.1.4 Probability density of signal times generalized to phase- type distributions . . . 47

4.1.1.5 Probability density function when alternative paths do not share edges . . . 48

4.1.1.6 Probability density function when alternative paths share edges . . . 49

4.1.1.7 Sensitivity analysis between independent and depen- dent signals . . . 50

4.1.1.8 Marginal likelihood for discrete models . . . 50

4.1.1.9 Marginal likelihood for continuous models . . . 52

4.1.1.10 Discrete Gibbs sampling . . . 52

4.1.1.11 Inference of E-gene positions . . . 53

4.1.1.12 Inference of signal flow . . . 54

4.1.1.13 Model comparisons using DNEMs . . . 55

4.1.2 Speed up by stochasticity effects in signaling networks . . . 56

4.2 Simulation results . . . 58

4.2.1 Data generation . . . 58

4.2.2 A sparse network . . . 58

4.2.3 Dense networks and detection of shortcut pathways . . . 61

4.3 Application to cell differentiation in embryonic stem cells . . . 61

4.3.1 Data preprocessing . . . 62

4.3.2 Binary data . . . 62

4.3.3 Time series analysis . . . 63

4.3.4 Stability analysis . . . 63

4.3.5 Stability of network topologies in the static NEM analysis . . . . 64 4.3.6 Stability of feed-forward loop detection in the DNEM analysis . 65

(14)

CONTENTS

4.3.7 Decision betweenOct4 andTcl1 direction in network . . . 65

4.3.8 Convergence and Mixing analysis of the Gibbs sampler . . . 67

4.3.9 Inference of signaling in Network . . . 67

5 Cyclic Dynamic Nested Effects Models(CDNEMs) 71 5.1 Model parameterization of CDNEMs . . . 71

5.2 Probability density of signal propagation in a directed cyclic graph . . . 72

5.2.1 Algorithm to enumerate all paths in a directed cyclic graph be- tween two nodes . . . 72

5.2.2 Probability density of signal propagation in CDNEMs . . . 73

5.3 Simulation results . . . 75

5.3.1 Data generation . . . 75

5.3.2 Results . . . 75

5.3.2.1 Fully connected network with equal time delays . . . . 77

5.3.2.2 Cyclic dense network with varying time delays . . . 77

5.3.2.3 Directed acyclic networks . . . 79

5.3.2.4 Network with only one cycle . . . 79

5.4 Application of CDNEMs to cell differentiation in embryonic stem cells . 79 5.4.1 Inference of signaling in Network during early ESC differentiation 80 6 Impact of Dynamic Nested Effects Models 85 6.1 Fast Cyclic Dynamic Nested Effects Models(FCDNEMs) . . . 85

6.1.1 Model parameters for FCDNEMs . . . 86

6.1.2 Marginal likelihood for discrete model . . . 86

6.1.3 Using Priors for network structures and time delays . . . 88

6.1.4 Network Learning for FCDNEMs . . . 89

6.2 Application of FCDNEMs to cell differentiation in embryonic stem cells 90 7 Summary and Outlook 93 7.1 Conclusions . . . 93

7.2 Future directions . . . 94

7.2.1 Combinatorial perturbations . . . 95

7.2.2 NEMs and drug interventions . . . 95

7.3 Food for thought . . . 96

(15)

CONTENTS

8 Appendix 97

8.1 Derivation of the Probability density of signal propagation along a linear path . . . 97

References 99

(16)

CONTENTS

(17)

List of Figures

1.1 Cell decision making processes . . . 2

1.2 Pluripotent lineages in the mouse embryo with key Transcription factors 5 1.3 Properties of cell decision making . . . 6

1.4 Properties of cell decision making . . . 7

1.5 cellular decision processes can be modeled as hierarchies . . . 7

1.6 Transitive edge and Feed-Forward Loop . . . 10

1.7 Feed-Forward Loops(FFLs) . . . 11

2.1 Cellular networks underlying observable phenotypes . . . 16

2.2 Cellular pathways can be reconstructed from the nested structure of downstream effects . . . 17

2.3 Bayesian Nested effects models . . . 22

2.4 Bayesian network next to corresponding factor graph . . . 25

2.5 Structure of factor graph for network inference in Factor graph NEMs . 26 2.6 A guide to the literature on NEMs . . . 30

4.1 Time delays in a given network . . . 42

4.2 Idea of DNEMs in an elementary example . . . 43

4.3 Parameterization of DNEM. . . 45

4.4 Graph of 5 S-genes with alternative paths sharing an edge. . . 49

4.5 QQ-plots comparing distributions of signal times between two nodesS1 and S5 in the graph in Figure 4.4 under dependent and independent assumptions. . . 51

4.6 Updating E-gene positions inside Gibbs. . . 54

4.7 Speed up by stochasticity effect of DNEMs . . . 57

(18)

LIST OF FIGURES

4.8 The transitive network used for simulation . . . 59

4.9 Heat map of the posterior distribution of average time delays . . . 60

4.10 Stem cell data analysis . . . 64

4.11 Stability Analysis . . . 66

4.12 Diagnostic Plots for the Gibbs sampler . . . 68

4.13 DNEM inference on signal propagation . . . 69

5.1 A directed cyclic graph . . . 72

5.2 A directed cyclic graph with parameters . . . 76

5.3 Cyclic DNEM models with 3 nodes and varying rates . . . 77

5.4 Heat map of the posterior distribution of CDNEM model . . . 78

5.5 Fully connected directed input graph with 6 key TFs . . . 80

5.6 Heatmap of posterior distribution and output graph of signal flow under non-stochastic assumptions . . . 81

5.7 Heatmap of posterior distribution with output graph of signal flow under stochastic signaling . . . 82

6.1 Standard NEM with 3 nodes . . . 87

6.2 Inferred network for murine stem cell development . . . 91

(19)

List of Tables

2.1 The distribution of binary data . . . 18 5.1 Algorithm to enumerate all paths between two nodes in a graph . . . 73 5.2 Simulated time delays for edges in cyclic graph . . . 76

(20)

GLOSSARY

(21)

Glossary

α Probability to observe a false positive effect β Probability to observe a false negative effect κ Discretization threshold parameter

Ei The set of E-genes attached to the same S-geneSi

Θ Parameter for E-gene positions

θk = i Position parameter for E-geneEk linked toSi

D Perturbation data matrix

Dikls Observed E-gene expression level Ek after perturbation Si for replicate experiment l in time pointts. Sometimes also represented aseikls

E Effect genes known as E-genes. Also known as observablesO H0 Set of Hidden nodes in a graph

K Set of rate parameters between S-S-genes and S-E-genes S Signaling genes known as S-genes

BIC Bayesian information criterion; a criterion for model selection among a class of parametric models with different numbers of parameters

CDNEMs Cyclic Dynamic Nested Effects Models DAG Directed acyclic graph

DCG Directed cyclic graph

DIC Deviance Information Criteria ; a criterion for model selection particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation

DNEMs Dynamic Nested Effects Models

EM Expectation-Maximization; an algorithm for solving incomplete data problems ESC Embryonic stem cells

FBLs Feed-Back Loops

(22)

GLOSSARY

FCDNEMs Fast Cyclic Dynamic Nested Effects Models FFLs Feed-Forward Loops

HMG High-Mobility Group; a group of chromosomal proteins that help with transcription, repli- cation, recombination, and DNA repair

ICM Inner cell mass

LIF Leukemia Inhibitory Factor; a cytokine that affects cell growth and development

MAP Maximum a posteriori probability. The MAP estimate is a mode of the posterior distribu- tion

MAP kinase Mitogen-activated protein kinase; a serine/threonine-specific protein kinase that re- sponds to extracellular stimuli (mitogens, osmotic stress, heat shock and proinflammatory cytokines) and regulate various cellular activities, such as gene expression, mitosis, differ- entiation, proliferation, and cell survival/apoptosis

MCMC Markov chain Monte Carlo; a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution

NEMs Nested Effects Models; a class of models for the analysis of non-transcriptional signalling networks. NEMs infer the graph of upstream/downstream relations for a set of signalling genes from perturbation effects.

PH The phase-type distribution of the time until an absortion state in a Markov chain

RA Retinoic Acid; is a metabolite of vitamin A (retinol) that mediates the functions of vitamin A required for growth and development

RNA Ribonucleic acid RNAi RNA interference

TE Trophectoderm

TFs Transcription factors

(23)

1

Introduction

Cellular decision making in biological processes such as differentiation, proliferation or cell death is mediated by molecular signaling processes, which control the regulation and expression of genes. Changes in gene expression can activate further signaling pro- cesses, leading to secondary effects, which themselves give rise to tertiary effects and so on. The result is an intricate interplay of cell signaling and gene regulation. While pro- tein modification in the cytoplasm can propagate signals in seconds, transcription and translation processes last hours, and secondary effects often become visible only after days. I develop statistical methodology that models the processes of cellular decision making using data, which reports downstream effects of molecular perturbations. In addition, I discuss which role such a model can play in biology and biomedicine. The first chapter gives some background on cell decision making processes and introduces key molecular players involved in stem cell differentiation. I focus on those properties which make it possible to systematically analyze and model such a process using high throughput perturbation data. I go further and discuss the methodology available for analyzing perturbation data with particular emphasis on limitation to time series data and provide motivation for taking the time into account.

.

1.1 Cell decision making in biological processes

Cellular decision making is involved in several biological processes such as cell division, cell proliferation, apoptosis or cell differentiation (Figure 1.1). Each of these processes

(24)

1. INTRODUCTION

Figure 1.1: Cell decision making processes - Cell decision making is involved in cell growth, cell proliferation, apoptosis, and differentiation. Mitogens and growth factors induce the process of cell growth and cell division respectively. Both actions usually occur simultaneously. Death receptors trigger extrinsic apoptosis. The entire programming of a single death cell involves cell shrinkage, cell membrane blebbing, nuclear collapse, and apoptotic body formation. Stem cell development is mediated by both self renewal and differentiation. Nanog, Sox2, and Oct4 play a key role in driving stem cells from a self renewal state into early differentiation. The figure is a modification from figures in (1).

(25)

1.1 Cell decision making in biological processes

is tightly regulated and controlled both by intracellular programs and extracellular sig- naling molecules whose mechanisms are still not clear. For example, the process of cell growth and cell division is stimulated by chemical substances like mitogens and growth factors respectively (1). Mitogens interact with cell surface receptors to trig- ger multiple intracellular signaling pathways during cell division. Although extensive research has been carried out in this area (1), its still not clear how a proliferating cell coordinates its growth with cell division so as to maintain its appropriate size. Alter- natively, a process such as apoptosis which is useful for the elimination of unwanted cells in the body is triggered by death receptors on their surface. For example, Fas ligand, a transmembrane protein on the surface of a killer lymphocyte binds to the Fas receptor on the cell surface to trigger the extrinsic apoptosis pathway (5). Apoptosis can also be triggered from within the cell (6) and in some cases a combination of both external and internal signaling are involved to amplify the process (1). The mechanism underlying such a coordination is still not fully understood. In addition during the process of cell differentiation, stem cells need to decide when and how to move from the state of self renewal into differentiation. Such a complex process is governed by transcription networks known as developmental transcription networks(7), which need to make irreversible decisions on a slow time scale of one or more cell generations.

1.1.1 Key players in the molecular mechanism in early stem cell dif- ferentiation in mouse

The zygote can give rise to a complex organism through cell division, growth(proliferation) and cell specialization(differentiation). The first differentiation event, is the segrega- tion of the trophectoderm (TE) and the inner cell mass (ICM) in the blastocyst. See the adapted Figure 1.2 from Niwa(2007) (4). The zygote is totipotent, developing into not only the fetus but also the placenta. The totipotency is maintained in cells known as blastomeres of the two-cell stage embryo. After mechanical separation of the blas- tomeres of the two-cell stage embryo, each blastomere is able to give rise to an adult organism, for example a mouse(8). These cells which have the ability to self-renew as well as differentiate into different cell types of the vertebrate embryo leading to the for- mation of an entire organism are known as embryonic stem cells(ESC), and the cells of the embryonic inner cell mass from which mouse ESC are derived are called pluripotent

(26)

1. INTRODUCTION

because of their ability to give rise to all of the cells of an embryo and adult(9). Pluripo- tency is maintained during ESC self-renewal through the prevention of differentiation and the promotion of proliferation. In fact, ESC can self-renew continuously for years if they are cultured under conditions that prevent their differentiation; for example, in the presence of leukemia inhibitory factor (LIF), a growth factor that is necessary for main- taining mouse ESC in a proliferative, undifferentiated state (10). Studies over the past few years have revealed the role that transcription factor networks play in the mainte- nance of ESC pluripotency (11, 12, 13, 14, 15, 16, 17, 18). From these studies we have transcription factors(TFs) that are pivotal for maintaining ESC in their self-renewal state when overexpressed such as Nanog, a homeobox transcription factor expressed throughout the pluripotent cells of the ICM with the particular goal to prevent en- doderm differentiation (13, 19); Oct3/4, also called Pou5f1, an important regulator of pluripotency that acts as a gatekeeper to prevent ESC differentiation(16); and Sox2, a member of the Sox (SRY-related HMG box) family of proteins that bind to DNA through the 79-amino acid HMG(high mobility group) domain. Sox2 is co-expressed withOct4 in the ICM (20). These TFs form a core transcriptional network associated with pluripotency in ESC (15, 18, 21, 22). Alternatively, the differentiation of mouse ESC can be induced by the expression of certain transcription factors. For example, the expression of the transcription factorGata6 in ESC results in their differentiation into primitive endoderm(12). Likewise, the expression of the caudal-type homeobox transcription factor 2 (Cdx2) induces ESC to differentiate into trophectoderm (23).

Model relative to roles played by the above transcription factors during early embry- onic development is shown in Figure 1.2. We have two types of transcription factors which play a role in ESC. (i) TFs with target genes that are expressed in undifferen- tiated ESC. (ii) TFs with target genes that are not expressed in undifferentiated cells but induced in differentiated ESC. The overexpression of type (i) TFs will maintain ESC in their self renewal state while overexpression of type (ii) will likely trigger the differentiation of ESC. These transcription factors function in combination with other processes and on the accessibility of their target genes, which are made more or less accessible by the modification of their DNA, histones, or chromatin structure. The chal- lenge would be to understand how these TFs interact dynamically with each other to regulate the processes between self-renewal and differentiation. More so, understanding

(27)

1.2 Properties of cellular decision making processes

the mechanisms underlying the processes of pluripotency, self-renewal and subsequent differentiation in embryonic stem cells is central to utilizing them therapeutically.

Figure 1.2: Pluripotent lineages in the mouse embryo with key Transcription factors- Model relative to roles played byOct4,Nanog,Sox2,foxD3,Cdx2,Gata6 during early mouse preimplantation development. Pluripotent stem cells (green) are imaged in a morula as the inner cells, which then form the inner cell mass (ICM) of the blastocyst.

Oct3/4 is essential in the first embryonic lineage specification. Nanogfunction is to prevent endoderm differentiation of ICM.Sox2 and FoxD3 are essential in the maintainance of a pluripotent epiblast. The figure is adapted from Niwa(2007) (4).

1.2 Properties of cellular decision making processes

All of the processes mentioned in the last section take time to completion. In early murine embryonic development for example, it takes about 1 week for stem cells to move from a pluripotent state to a differentiated state. Early stage differentiation actually starts after about 2 days (18). More so, the entire cellular decision process proceed in multiple steps controlled by different signals. For example the different stages on the way from a single stem cell to a specialized cell are controlled by different signals as shown in (Figure 1.3) modified from Alberts et al.(2008)(1). Cellular decision processes are controlled by complex signaling networks. For example the Wnt signaling pathway which plays a key role in the development of tissues and organs in multicellular organism involves a complex signaling mechanism taking place at the cell membrane, cytoplasm and inside the nucleus(Figure 1.4). Once the pathway is active traces in gene expression profiles can be observed reflecting the hierarchy of events along the pathway.

My goal is to model the temporal interplay of signaling and expression in such complex

(28)

1. INTRODUCTION

biological processes involving several signaling pathways and spanning multiple rounds of cell signaling, gene regulation, and gene expression.

Figure 1.3: Properties of cell decision making- Cell decision making takes time and operate in multiple steps. Stem cell differentiation to a specialized cell fate involves a series of decision-making steps. Each decision making step is triggered by external or internal signals. The figure is a modification from (1).

1.3 Cellular decision making processes can be represented by hierarchies

Cellular decision processes can be modelled as hierarchies. A given hierarchy of signal- ing steps can be represented by a graph or network of upstream / downstream relations where nodes can be steps or controlling signaling genes and edges indicate upstream / downstream relations. Based on such a relationship we would expect a transitive closed graph. If S1 is upstream of S2 and S2 is upstream of S3, then by definition S1 is upstream of S3. Figure 1.5 shows a decision process comprising of 5 steps S1-S5. Steps S2-S5 can only occur after step S1 has occurred. For example, the MAP kinase cascade is activated byRas which further leads to the activation of other important regulatory proteins such as Myc. So Ras acts upstream of Myc and this property is represented by a directed edge from Ras toMyc. The biological meaning of a network component depends on what kind of data we analyze. We mostly speak of network components as genes. However statistical methods available for gene regulatory networks can also be generalized for protein data (24, 25, 26).

(29)

1.3 Cellular decision making processes can be represented by hierarchies

Figure 1.4: Properties of cell decision making- Wnt signaling involves a very com- plex signaling mechanism at the cell membrane, cytoplasm and nucleus. Wnt proteins bind to receptors on the cell surface to induce several intracellular signal transduction pathways.

Through several cytoplasmic relay molecules, the signal is transduced toβ-catenin, which enters the nucleus and forms a complex withTCF protein to activate transcription of Wnt target genes. This Figure has been taken from Wikipedia.org.

Figure 1.5: cellular decision processes can be modeled as hierarchies- The figure shows a cellular decision process comprising of 5 steps S1-S5. Steps S2-S5 can only occur after step S1 has occurred.

(30)

1. INTRODUCTION

1.4 Statistical methods for analyzing decision making pro- cesses

With the advent of high throughput genomic technologies such as microarrays that can capture the expression of thousands of genes and the availability of powerful com- putational approaches, the practice of studying cellular processes at the systems level has greatly improved in the last decades. Numerous statistical methods have been suggested for the analysis and reconstruction of regulatory networks. Among the most widely used are relevance networks (27), graphical Gaussian models (28, 29), meth- ods from information theory (30), Bayesian networks (31) including dynamic Bayesian networks (32, 33), Boolean networks (34, 35, 36) and methods based on ordinary differ- ential equations (37, 38, 39). All these methods employ pure observational data, where the network was not perturbed experimentally. Relevance networks, graphical Gaus- sian models, Information theory approaches, Boolean networks, Bayesian and dynamic Bayesian networks are probabilistic in design motivated by the fact that signal trans- duction, gene expression and its regulation are stochastic processes (40, 41, 41). They mainly account for transcriptional effects in the cell. Apart from dynamic Bayesian and Boolean networks, they infer static transcriptional regulatory networks. Ordinary differential equations(ODEs) on the other hand are deterministic making strong as- sumptions on the network structure and interactions. The famous Michealis-Menten equation in the context of enzyme kinetics is an example (42). Similar to dynamic Bayesian networks ODEs allow for changes over time. A comprehensive overview of these methods in relation to transcriptional regulatory networks can be found in (7, 43).

1.4.1 Learning from experimental interventions

Simulation (44, 45) and experimental studies (24, 44) show that perturbation experi- ments greatly improve performance in network reconstruction. Rung et al. (46) built a directed disruption graph by connecting two genes where perturbation of the first gene resulted in expression changes in the other gene. However, disruption networks do not separate direct from indirect effects. Wagner (47) uses transitive reductions to find parsimonious subgraphs explaining a disruption network. The framework of Bayesian networks was also extended to account for perturbation data (48, 49). Yeang et al. search for topologies that are consistent with observed downstream effects of

(31)

1.5 Motivation for dynamic modeling of cell decision making processes

interventions (50). Although this algorithm is not confined to the transcriptional level of regulation, it requires that most signaling genes show effects when perturbing others.

The methods described exhaustively in this thesis build on Nested Effects Models (NEMs), which were first proposed by Markowetz et al. (49) for the analysis of non- transcriptional signaling networks. They differ from other statistical approaches like Bayesian networks or Boolean Networks by encoding subset relations instead of partial correlations. NEMs infer the graph of upstream/downstream relations for a set of signaling genes from perturbation effects. Since non-transcriptional signaling is too fast to be analyzed by delays of downstream effects, time series are not used. This changes when analyzing slow-going biological processes like cell differentiation.

1.5 Motivation for dynamic modeling of cell decision mak- ing processes

Note that there is a difference between the upstream/downstream relations of a network and the actual signal flow: If geneS1 is upstream ofS2 and geneS2 is upstream ofS3, consistency requires thatS1 is also upstream of S3. While the consistency argument is valid for upstream/downstream relations, it does not hold for signal flows. Assume we have a linear cascade of signaling genes where the signal flows fromS1via S2 toS3(See Figure 1.6). Whether there is an alternative signal flow from S1 directly to S3 does not follow from upstream/downstream relations. However, evidence of the alternative signal flow comes from time delays of downstream effects. Assume that the time spent to propagate a downstream effect from S1 to S2 plus the time spent to propagate it from S2 to S3 is larger than the time to propagate the effect from S1 to S3 directly, then there must exist an alternative short cut pathway fromS1 toS3. Thus, temporal expression measurements yield additional insight into the cellular signal flow.

1.5.1 The Feed-Forward Loop Network motif

Signaling networks that regulate the responses of living cells were recently found to obey recurring circuit modules that carry out key functions (3, 7, 51). They contain several biochemical wiring patterns, termed network motifs, which recur throughout the network (52). One of these motifs is the Feed-Forward Loop (FFL) (7). The FFL, a three-gene pattern, is composed of two input transcription factors(regulators), one

(32)

1. INTRODUCTION

Figure 1.6: Transitive edge and Feed-Forward Loop - Transitive edges represent feed forward loops. The mode of interaction in the FFL can be activation or repression with an AND or OR input function atS3. S1 is activated by input signal(s).

of which regulates the other, both jointly regulating a target gene. The FFL has eight possible structural types (Figure 1.7) , because each of the three interactions in the FFL can be activating or repressing. Uri Alon and colleagues (3) found out that four of the FFL types, termed incoherent FFLs(Figure 1.7) act as sign-sensitive accelerators:

they speed up the response time of the target gene expression following stimulus steps in one direction (e.g., off to on) but not in the other direction (on to off). The other four types, coherent FFLs, act as sign-sensitive delays. Thus they carry out specific dynamical functions. In addition each of the coherent and incoherent types of FFLs can have an AND or OR input function at the promoter of the target gene depending upon whether both or only one of the two regulators are needed to regulate the target gene (Figure 1.6). The transitive triple representation in Figure 1.6 shows that the transitive edge(S1S3) combine with the indirect edges(S1S2 and S2S3) to form a coherent feed forward loop(FFL). Nature uses these FFLs in several organisms to cause time delays so that the cell can function properly by filtering out random fluctuations. We may be able to reconstruct FFLs in a network if we can measure or estimate time delays in the signaling network.

1.5.2 Feed-Back Loop Network motif

Biological networks are all known to contain Feed-Back Loops(FBLs) and cycles (1, 7). For example in regulatory networks, TFs are known to be both negatively and positively auto-regulated. Negative auto-regulation occurs when a TF represses its own transcription. Such a simple circuit has been used to show the speed of response time

(33)

1.5 Motivation for dynamic modeling of cell decision making processes

Figure 1.7: Feed-Forward Loops(FFLs)- The eight types of feedforward loops (FFLs) are shown. Arrows denote activation andasymbols denote repression. In coherent FFLs, the sign of the direct path from input factor X to output Z is the same as the overall sign of the indirect path through factor Y. Incoherent FFLs have opposite signs for the two paths. This figure is reproduced from (3).

(34)

1. INTRODUCTION

and reduction of the cell-cell variation in protein levels (53, 54, 55). Similarly, positive auto-regulation occurs when a TF activates its own transcription by up-regulating itself.

Positive auto-regulatory circuits have been shown to have opposite effects as to those of negative auto-regulatory feedback loops.(56). Modeling the cell cycle or autoregulation with an acyclic model (31) may not be the best idea due to loss of useful information.

Fortunately, the cycle problem can be solved by assuming that the system evolves over time.

1.6 Thesis Organization

There are two main goals involve in analyzing time series RNA interference (RNAi) data for reverse engineering purposes. First, how to infer signaling pathways if direct observations of gene silencing effects on other network components may not be visible in the data. Second, to infer the signaling dynamics between pathway components from the data. This thesis summarizes methodology to address the first question and proposes a novel methodology to answer the second question. It is organized mainly as follows.

1.6.1 Nested Effects Models

Chapter 2 gives an overview of Nested effects models and their implementations. The theory of NEMs has been applied and extended in several studies. I give an in depth overview of all NEMs from literature in this chapter. Chapter 2 also works out the similarities, differences and limitations of all the methods.

1.6.2 Gibbs sampling and Nested Effects Models

In chapter 3, the concept of Gibbs sampling is reviewed. I discuss how such an estima- tion algorithm is used in several bioinformatics applications with particular enfancy on how it fits in within the context of Nested Effects Models.

1.6.3 Dynamic Nested Effects Models

In chapter 4, I develop a novel theory of learning from time series gene perturbations in the framework of Nested Effects Models (NEMs), called Dynamic Nested Effects Models(DNEMs). Chapter 4 goes further to demonstrate how DNEMs can be used

(35)

1.6 Thesis Organization

to estimate time delays in a given network as well as make inferences on signal flow in a given network. The practical use is exemplified in an application to molecular mechanism in early stem cell differentiation. Finally a section on the speed up by stochasticity effects in dynamic networks is introduced as a by product of DNEMs.

1.6.4 Cyclic Dynamic Nested Effects Models

An extension of DNEM to handle cycles is discussed in chapter 5 with the help of simulations and an application to stem cell differentiation. In chapter 6 I discuss the impact of DNEMs by making a comparison to another complementary and faster modeling approach which also has the ability to unravel the regulatory networks across time.

(36)

1. INTRODUCTION

(37)

2

Nested Effects Models

In modern biology, the key to inferring gene function and regulatory pathways are ex- periments with interventions into the normal functioning in a cell. A common technique is to perturb a gene of interest experimentally and to study which other genes activi- ties are affected using gene expression monitoring. However, one of the key problems of analyzing perturbation screens is that the observed phenotypes occur downstream of the perturbed pathway and may not be able to show the direct influence of one par- ticular pathway component on another. This is illustrated here by the cartoon path- way adapted from Wagner 2001(Figure 2.1) showing a cascade of five genes/proteins (S1-S5). Proteins S1-S3 form a kinase cascade, S4 is a transcription factor acting on S5. Up-regulation of S1 starts information flow in the cascade and results in S5 be- ing turned on. In gene expression data this is visible as a correlation between S1 and S5(represented as an undirected edge in the model). Experimentally perturbing a gene, say S3, removes the corresponding protein from the cascade, breaks the information flow, and results in an expression change at S5 (represented as an arrow in the model).

However, the different phosphorylation and activation states of proteins S2-S4 are not visible as changes in gene expression. Thus, because of the pathway mostly acting on the protein level most parts of the cascade (dashed arrows in the model) can not be inferred from gene expression data directly. One class of models that was developed to handle indirect information and high-dimensional phenotypes are Nested Effects Models (49).

(38)

2. NESTED EFFECTS MODELS

Figure 2.1: Cellular networks underlying observable phenotypes- Global molec- ular phenotypes like gene expression allow a view inside the cell but also have limitations.

In gene expression data a correlation between proteins S1 and S5(represented as an undi- rected edge in the model) is visible. In addition, experimentally perturbing a gene, say S3, breaks the information flow, and results in an expression change at S5 (represented as an arrow in the model). However, the different phosphorylation and activation states of proteins S2-S4 are not visible as changes in gene expression. If the pathway is mostly acting on the protein level most parts of the cascade (dashed arrows in the model) can not be inferred from gene expression data directly. The figure is adapted from Wagner 2001.

2.1 Nested Effects Models(NEMs)

Following Markowetz et al. (49), we call the perturbed genes S-genes for signaling genes and denote them by S = S1, . . . , Sn. The genes that change expression after perturbation are called E-genes and we denote them by E =E1, . . . , EN. We further denote the set of E-genes displaying expression changes in response to the perturbation of Si by Di. In a nutshell: NEMs infer thatS1 acts upstream of S2:

S1 −→S2 if D2⊂D1

All downstream effects of a perturbation in S2 can also be triggered by perturbingS1 (Figure 2.2). This suggests that the perturbation ofS1 causes a perturbation ofS2 and acts upstream ofS2. In a general setting with more than two S-genes, we call the subset of S-genes, which are in an active state when S-geneSj is silenced, the influence region of Sj. The set of all influence regions is called a silencing scheme Φ. It summarizes the effects of interventions predicted from the pathway hypothesis. This is mathemat- ically represented as a transitively closed graph. The graph of upstream/downstream relations is estimated from the nested structure of downstream effects. Due to noise in

(39)

2.1 Nested Effects Models(NEMs)

the data, we do not expect strict super-/subset relations. Instead, NEMs recover rough nesting. Following Markowetz et al. (49) we assume only directed acyclic graphs. In

Figure 2.2: Cellular pathways can be reconstructed from the nested structure of downstream effects- If the target genes of S2 are a subset of the target genes of S1 then S1 acts upstream of S2. So in this sense all the target genes of S2 can be triggered by perturbing S1. Information on the target gene expressions can be obtained on a microarray.

the context of NEMs the most often used scoring metric is the posterior probability of a network Φ given dataD,P(Φ|D). According to Bayes rule, the posterior probability can be written as

P(Φ|D) = P(D|Φ)P(Φ)

P(D) , (2.1)

whereP(D) is a constant that does not depend on Φ. Consequently, the (marginal) likelihood P(D|Φ) together with the network prior P(Φ) play the central role in the inference.

In practice, we do not know which target genes or E-genes are being controlled by which S-genes. We first need to estimate the E-gene positions before scoring the graph. Closely following the presentation of Markowetzet al.(2005) (49) we denote the

(40)

2. NESTED EFFECTS MODELS

parameter for the E-gene positions as Θ. If we let Θ ={θi}mi=1 withθi ∈ {1, ..., n}and θi=jifEi is attached to Sj. In a perturbation experiment we predict effects at all E- genes, which are attached to an S-gene in the influence region. Expected effects can be compared with observed effects in the data to choose the topology, which fits the data best. Owing to measurement noise we cannot expect to find an expected topology to be in complete agreement with all observations. We allow deviation from predicted effects by introducing error probabilities α and β for false positive and negative situations, respectively. We model the expression levels of E-genes on the various perturbation experiments kas binary random variables Eik . The distribution ofEik is determined by the silencing scheme Φ and the error probabilities α and β. For all E-genes and targets of intervention, the conditional probability of E-gene state eik given silencing scheme Φ can then be written in tabular form as:

Table 2.1: The distribution of binary effect data- The distribution ofEikis deter- mined by the silencing scheme Φ and the error probabilitiesαandβ.

eik= 1 ei = 0

P(eik|Φ, θi =j) =n α 1−α if Φ predictsno effect 1−β β if Φ predictseffect

This means that if Ei is not in the influence region of the S-gene silenced in ex- periment k, the probability of observing Eik=1 is α(probability of false alarm); the probability to miss an effect and observe Eik = 0 even though Ei lies in the influence region isβ (probability of missed signal). In the following, we summarize NEMs based on their statistical approach for dealing with Θ in scoring a given network.

2.1.1 Marginal likelihood scoring

In the Bayesian framework of Markowetz et al.(2005) (49), networks are scored by marginal posterior probabilities which depend on the marginal likelihood of the param- eter space. The marginal likelihood involves marginalization over the whole parameter space Θ .

P(D|Φ) = Z

Θ

P(D|Φ,Θ)P(Θ|Φ)dΘ. (2.2)

The marginal likelihoodP(D|Φ,Θ) is based on the following assumptions given in (49):

(41)

2.1 Nested Effects Models(NEMs)

1. Given silencing scheme Φ, and fixed positions of E-genes Θ, the observations in Dare sampled independently and distributed identically:

P(D|Φ,Θ) =

m

Y

i=1

P(Di|Φ, θi) =

m

Y

i=1 l

Y

k=1

p(eik|Φ, θi), (2.3)

whereDi is the ith row in data matrix D.

2. Parameter independence. The position of one E-gene is independent of the posi- tions of all the other E-genes at any given time:

P(Θ|Φ) =

m

Y

i=1

P(θi|Φ) (2.4)

3. Uniform prior. The prior probability to attach an E-gene is uniform over all S-genes:

P(θi=j|Φ) = 1

n for all i and j (2.5)

However prior existing biological knowledge about regulatory modules can be incorporated (57, 58).

With the above assumptions the marginal likelihood can be calculated thus. The num- bers above the equality sign indicate which assumption was used in each step.

P(D|Φ) = Z

Θ

P(D|Φ,Θ)P(Θ|Φ)dΘ

[1,2]

=

m

Y

i=1

Z

θi

P(Di|Φ, θi)P(θi|Φ)dθi

[3]= 1 nm

m

Y

i=1 n

X

j=1

P(Di|Φ, θi =j)

[1]= 1 nm

m

Y

i=1 n

X

j=1 l

Y

k=1

pα,β(eik|Φ, θi =j) (2.6)

Note here that we can sum over all E-gene positions since we have a finite number of S-genes.

(42)

2. NESTED EFFECTS MODELS

2.1.2 Maximum likelihood scoring

It is also possible to maximize the joint posterior distribution of Φ and Θ. We can represent both the silencing scheme and the E-gene positions as adjacency matrices whose entries represent edges between S-S-genes and S-E-genes respectively. For the purpose of consistency we denote both matrices as Φ and Θ. Tresch (59) defined a Nested Effects Model (NEM)F as a product of Φ and Θ:

F = ΦΘ (2.7)

Using the same formulation as in the previous section and assuming data independence, the likelihood of the modelF is represented as P(D|F) and factors out as follows:

P(D|F) = P(D|ΦΘ)

= Y

(j,i)∈Φ×Θ

P(Dj,i|j=Fji) orlog(P(D|F)) = X

(j,i)∈Φ×Θ

logP(Dj,i|j=Fji) +const, (2.8)

if we assume equal probability for observing a 1 or 0 and (j, i) ∈Φ×Θ with j =Fji

interpreted as S-gene j is linked to E-gene i . The quantity log(P(D|F)) can also be expressed as a likelihood ratio for convenience using matrix algebra as follows:

log(P(D|F))−log(P(D|N)) =tr(F R), (2.9) whereR = logP(DP(Dji|eij=1)

ji|eij=0), “tr” denoting the trace function of a quadratic matrix and N the NULL matrix corresponding to the model predicting no effects at all. Hence the marginal likelihood of the data becomes

log(P(D|Φ,Θ)) =tr(ΦΘR) +const (2.10) This form provides a flexible way of handling input data binary values, p-values, or any other arbitrary statistic as long as it can be converted to a likelihood ratio. The aim of the NEMs is to find the optimal silencing scheme ˆΦ. The posterior model (Φ,Θ) written in log form is

log(P(Φ,Θ|D)) = log(P(D|Φ,Θ)) + log(P(Φ)) + log(P(Θ)) +const, (2.11)

(43)

2.1 Nested Effects Models(NEMs)

and the task is to find the maximum aposteriori(MAP) estimate for log(P(Φ,Θ|D)), ( ˆΦ,Θ) = argmaxˆ φ,Θ(log(P(D|Φ,Θ)) + log(P(Φ)) + log(P(Θ))), (2.12) So in order to maximize the graph we need to maximize the E-gene positions and vice versa.

2.1.3 NEMs as a Bayesian network

Zelleret al. (60) introduced a Bayesian network view on NEMs. A Bayesian network is defined by a graphical model structure Φ and a family of conditional distributions F0 and their parametersΘ(61, 62). The model structure Φ consists of a set of nodesV and a set of directed edgesE connecting the nodes such that the resulting directed graph is acyclic (DAG). The nodes represent random variables in the network whereas the edges encode a set of conditional dependencies. In the parametric setting, the family of conditional distributionsF0 is assumed to be known and hence is fully described by its parameters. Let X =X1, X2, ..., Xn denote a set of random variables that correspond to the nodes V in the network. Lower-case letters x1, x2, ..., xn are used to denote the value of the corresponding variables. Let Pa(Xi) denote the random variables corresponding to the parents of nodeiin the DAG. Then, the network structure Φ and the parameters Θ of the conditional distributions together define a joint distribution over the random variablesX as

P(xi, x2, ..., xn) =

n

Y

i=1

P(xi|pa(Xi)) (2.13) In the context of NEMs following the presentation in (59), we have to model a deter- ministic signaling hierarchy, in which some components (E) can be probed by measure- ments, and some components (S) are perturbed in order to measure the reaction of the system as a whole. LetH0 be the nodes of an acyclic graph representing a combination of the S-S-genes and S-E-genes connection Figure 2.3. A,B,C represent the S-genes and X1, X2, Y1, Y2, Z1, Z2 represent the effect nodes. We assume H0 as hidden in the sense that no observation will be available forH0. In order to account for the data, we introduce an additional layer of observable variables (observables, O) in the following way: each effect nodee∈E has an edge pointing to a unique observable nodee0 ∈O (Figure 2.3). Hence, O={e0|e∈E}, and we calle0 the observation of e. Similar like

(44)

2. NESTED EFFECTS MODELS

Figure 2.3: Bayesian Nested effects models - Example of a Nested effects model in its Bayesian network formulation. The bold arrows determine the graph Φ, the solid thin arrows encode Θ. Dashed arrows connect the effects to their reporters. This figure is reproduced from (59).

before we let pa(x) be the set of parents of a node x and for notational convenience add a zero node z, p(z = 0) = 1, which has no parents, and which is a parent of all hidden nodes (but not of the observable measurements). For the hidden nodes, define local probabilities corresponding to deterministic relationships as follows :

p(x= 1|pa(x)) = { 1 if any parent is active 0 otherwise,

= max(pa(x)) forx∈H0, (2.14)

All hidden nodes are set to 0 or 1 deterministically, given their parents. The local probabilitiesp(e0|e∈E),e∈E can come from both discrete or continuous distributions (59). From Equation 2.13 the Bayesian network NEM is parameterized by its topology Φ and its local probability distributions, which we assume to be given by a set of local parameters Θ. The ultimate goal is to maximize P(Φ|D). In the presence of prior knowledge and if we assume independent priors for the topology and the local

(45)

2.1 Nested Effects Models(NEMs)

parameters, we can write

P(Φ,Θ|D) = P(D|Φ,Θ)P(Φ)P(Θ) P(D)

∝ P(D|Φ,Θ)P(Φ)P(Θ) (2.15) from which it follows that

P(Φ|D) = Z

P(Φ,Θ|D)dΦ

∝ P(Φ) Z

P(D|Φ,Θ)P(Θ)dΘ (2.16) Its difficult to solve the integral analytically. We resort to a simultaneous maximum a posteriori estimation of Φ and Θ (59). Thus

( ˆΦ,Θ)ˆ = argmaxΦ,ΘP(Φ,Θ|D)

= argmaxΦ(argmaxΘP(D|Φ, θ)P(Θ))P(Φ). (2.17) 2.1.4 Factor graph NEMs

Finally, a signed version of the Nested Effects Model and an associated efficient struc- ture inference method, named Factor Graph-Nested Effects Model(FG-NEM) (2) was developed to distinguish between activating and inhibiting regulation in a pathway.

Recall that the original NEM by Markowetzet al. (49) include two sets of parameters.

The parameter set Φ records all pair-wise interactions among the S-genes and the pa- rameter set Θ describes how each E-gene is attached to the network of S-genes. Φ is a binary matrix with entry φAB set to one if S-gene A acts above S-gene B and zero otherwise. Φ must also be transitively closed. The model by Markowetz et al. (49) does not distinguish between stimulatory and inhibitory interactions. To tackle this drawback, Vaskeet al. (2) suggest a model, in which φAB takes six possible values for each unique unordered S-gene pair A,B also known as interaction modes. The possible values are: 1) A activates B,A →B; 2) A inhibits B,AaB; 3) A is equivalent to B, A=B; 4) A does not interact with B,A6=B; 5) B activates A,B→A; and 6) B inhibits A, B a A. The Factor graph NEMs allow for the reconstruction of a broader set of S-gene interactions from the secondary effects of E-gene expression corresponding to the observed data denoted as D. Similarly like the other NEM approaches discussed so far, a maximum aposteriori is used to identify the Φ that maximizes the posterior

(46)

2. NESTED EFFECTS MODELS

P(Φ|D) represented as :

Φˆ = argmaxΦP(Φ|D)

= argmaxΦX

Θ

X

H

P(Φ,Θ, H|D). (2.18)

where Θ refers to the attachment point of each E-gene into the network and H refers to the hidden E-gene states corresponding to up, down regulations or no change. Applying the same assumptions as in Markowetz et al. (49) we have:

Φˆ = argmaxΦP(Φ)X

Θ

P(Θ|Φ)X

H

P(H|Φ,Θ)P(D|H)

= argmaxΦP(Φ)X

Θ

X

H

P(H|Φ,Θ)P(D|H)

= argmaxΦP(Φ)Y

e∈E

X

Θ

X

H

P(He|Φ, θe)P(De|He) given independence of E-genes, E.

= argmaxΦP(Φ)Y

e∈E

Le(Φ) (2.19)

where De and He are the row vectors of data matrix and hidden states for E-genes respectively andθerecords the attachment of an E-gene to an S-gene andLesummarizes the marginal likelihood of the data restricted only to E-genee under a given model Φ and θe. Note that Le can be reformulated as a product of pair-wise S-gene terms(2).

2.1.4.1 Structure of factor graph NEMs and Network inference

Scoring a given S-gene graph can be achieved based on max-sum message passing in a factor graph (63) which provides an efficient means for estimating highly probable S-gene configurations. The parameters that determine the S-gene interactions, Φ, are explicitly represented as variables in the factor graph. Identifying a high-scoring S- gene network is therefore converted to the task of identifying likely assignments of the Φ variables in the factor graph. A factor graph is a probabilistic graphical model whose likelihood function can be factorized into smaller terms (factors) representing local constraints on a set of random variables. A factor graph can be represented as an undirected, bi-partite graph with two types of nodes: variables and factors. A variable is adjacent to a factor if the variable appears as an argument of the factor. Figure 2.4 shows the factor graph representation of a Bayesian network. Factor graphs represent

(47)

2.1 Nested Effects Models(NEMs)

both the variables as nodes and the factors as nodes, with edges from each factor to the variables in that factor’s domain, resulting in a bipartite graph. Factor graphs general- ize probability mass functions as the joint likelihood function requires no normalization and the factors need not be conditional probabilities. Each factor encodes a local con- straint pertaining to a few variables. In the factor graph NEM a Φ that maximizes

A

B C

D

A

B

C

D

P(A)

P(B|A)

P(C|A)

P(D|B, C)

Figure 2.4: Bayesian network next to corresponding factor graph- A Bayesian network (left) and the corresponding factor graph (right). The decomposition of the joint probability, P(A, B, C, D) = P(D|B, C) P(B|A) P(C|A) P(A) is made explicit in the bi- partite factor graph.

the posterior is found using max-sum message passing using all terms from Equation 2.19 in log space. The complete model of a factor graph NEM by Vaske et al.(2009) contains three types of variables and three classes of factors. The variables include:

the continuous random observation of E-gene expression under a given intervention and replicate experiment, the unknown hidden state of E-gene under a particular inter- vention which is a discrete variable with domain {1,0,−1} and the interaction modes between two S-genes. The factors consists of: the Expression factors which model ex- pression as a mixture of Gaussian distributions, the Interaction Factors which constrain E-gene states to five possible types of interaction modes between two S-genes and the Transitivity factors which constrain pair-wise interactions to form consistent triplets of S-genes. During message passing, messages which are simply local belief potentials associated with variable interactions are passed between all nodes(variables) in the

(48)

2. NESTED EFFECTS MODELS

Figure 2.5: Structure of factor graph for network inference in Factor graph NEMs- The factor graph consists of three classes of variables (circles) and three classes of factors (squares). XeAr is a continuous observation of E-gene e’s expression under interventionA and replicater. YeAis the hidden state of E-gene eunder interventionA, and is a discrete variable with domain{1,0,−1}. φAB is the interaction between two S- genes A and B. In this figure red, green and white shading denotes activation, inhibition and no interaction respectively. Expression Factors model expression as a mixture of Gaussian distributions. Interaction Factors constrain E-gene states to interaction modes between two S-genes. Transitivity Factors constrain pair-wise interactions to form consistent triangles.

The arrows labeledµandµ0 are messages encoding local belief potentials onφAB and are propagated during factor graph inference. This figure is reproduced from (2).

(49)

2.2 Network learning algorithms in NEMs

graph using two inference steps. In the first step, messages from observation nodes are passed through the expression factors and hidden E-gene state variables, to calculate all messages in a single upward pass . In the second step, messages are passed between only the interaction variables and transitivity factors until convergence using Equation 2.19. The final S-gene network is derived by transitive reduction of all redundant edges from Φ. Figure 2.5 reproduced from (2) gives an overview of the structure of factor graph NEM with expression factors, interaction factors, and transitivity factors. For acyclic factor graphs, the marginal, max-marginal and conditional probabilities of sin- gle or multiple variables can be calculated exactly with the max-sum algorithms (63).

Message-passing algorithms have been shown to demonstrate excellent empirical results in various practical problems even on graphs containing cycles such as feed-forward and feed-back loops(64, 65, 66).

2.2 Network learning algorithms in NEMs

Recall in the Bayesian framework of Markowetzet al.(2005) (49), networks are scored by posterior probabilities. By enumerating all network topologies, the maximum posterior network is selected. The exhaustive search limits the method to small networks of up to 6 S-genes. Thus, exhaustive enumeration is infeasible even for medium-sized studies.

For large-scale screens, search heuristics are used to explore model space. Several approaches to this problem have been proposed by Fr¨olich et al. (2, 57, 58, 59), all of which concentrate on small sub-models involving only pairs, triples, or quadruples of nodes. The final S-gene graph is scored by combining the scores from these sub models.

2.2.1 Pairwise and triple search

The division into subgraphs can either be into all pairs or triples of nodes (57). In the pairwise approach, for every pair of S-genes (A,B), we compute the Bayesian score detailed in section 2.1 and select the maximum aposteriori (MAP) modelMA,B∈ {A→ B, B → A, A =B, A 6= B}. The advantage of this approach is the increase in speed and the possibility to infer networks involving a very large number of nodes. However, the reconstruction accuracy of networks based on the pairwise method is rather low due to the pairwise independence assumption used in scoring the network which is not true in real biological networks. To improve on this limitation the triple search

Referenzen

ÄHNLICHE DOKUMENTE

Looking at static solutions and static hosting platforms, the emergence of SSGs like Jekyll, Hexo and Hugo, and smart static hosting services like GitHub Pages and Netlify

In this paper, we have proposed the use of Markov chains and transition matrices to model transitions between databases, and used them to define a probabilistic metric space for

a Biochemical Section of Key Laboratory of Functional Polymer Materials, The Ministry of Education of China, Institute of Polymer Chemistry, Chemical School of Nankai

(A) Alignment of the part nucleotide sequence of four FKBP23 species: human (HFKBP23), mouse (MFKBP23), lizard (LFKBP23), and fi sh (FFKBP23); the regions used for the initial design

Starzec (1999) investigated the same relationship for a set of igneous and metamorphic rocks from Sweden, and came up with regression coefficient of 0.82 for the rock types.

Managing rapidly changing situations in business service offerings is a serious challenge. Today´s enterprises adjust their services on the implementation level, which requires

A filmstrip model aims to describe a sequence of system state transi- tions from the application model as a single object diagram: a set of application object diagrams and

Extensive simulations on artificial data and application of our module network approach to infer the signaling network between 10 genes in the ER-α pathway in hu- man MCF-7 breast