Data-Analysis Model for PlugIns

6.2 Object Model

6.2.3 Data-Analysis Model for PlugIns

Yes Grant the permission, equivalent to a logical TRUE value.

No Forbid the action, never allow to perform it, equivalent to a logical FALSE value.

Undefined Do not grant or forbid the action, ignore this setting equivalent to a logical missing value.

A combination of all permissions with an assigned value, a user or group, and an object constitutes a complete ACL. The resulting permission is derived from a combination of all ACLs assigned to the user for the given object; ACLs can be either directly assigned to a user or assigned by group memberships. As a result, many ACLs may apply for a given combination of user and object. In case of a conflict, the permissions are computed from all ACLs using a logical AND operation. Undefined values are simply ignored; if there are only undefined values the default access policy is used.

The default access policy needs to be defined for each action in the action set. If the permission settings for an action consist only of missing values or if there are no applicable ACLs, the default ACL is substituted. For reasons of data security, without a specific default policy, a deny-by-default policy is applied. This results in an often desired behavior where a user cannot perform any action until it is explicitly allowed.

6.2. Object Model 95 tool, which serves as a means to combine several analysis methods in a single consecutive execution. These tools are termed ’Queue’ within the model (in the presentation to the user they are called ’pipelines’ in resemblance of an assembly-pipeline). Queues may contain a sequence of other tools and also other queues, which are then executed consecutively. A queue is independent of the actual data set but often restricted to a specific data type, which is determined by the first function in the pipeline.

Jobs are a combination of tools and actual data sets to be analyzed. A job can be executed via a scheduling mechanism to dispatch jobs to a multi-host compute cluster. Jobs reference the designated input data and the output they produce. They can be assigned to an experiment if appropriate.

Three different types of functions can be specified which serve different purposes:

General analysis functions are R or Perl functions or binary executables perform-ing computations.

Export-functions get additional annotation information from MAGE-OM, BRIDGE or a web-service. This can be for example pathway informa-tion from GenDB.

Writer functions put back data into the MAGE-OM. They can also be used to store images or graphics generated within a pipeline.

Importer functions put data into the database, like reading in raw-data files or array-layout definitions. Unlike writer functions, they do not require data from a previous computation.

Data structures were designed to represent the analysis modules within the database. The corresponding classes serve to set up pipelines and parameters of the contained functions. An overview of the extension classes that supplement MAGE-OM is given in UML in Figure 6.3 on the next page.

The ability to add new functions to the system is a key feature of the data analysis system. As such it allows to add functions which have a behavior not known during the implementation phase. To avoid meaningless pipelines, a type system has been included in the model. An example for the application of a type-systemcan be given as follows: a normalization function operates on the measured data that have been imported from the image analysis software; the application of a normalization function on already normalized data or on the results of a significance test would not be sensible. As a consequence, creating such a pipeline configurations has to be prevented.

The first step for the definition of a type system is to identify the types. From a computational perspective, the data type of microarray data is represented by multidimensional arrays of numeric and factorial values. The atomic datatypes and also the dimensionality of the arrays do not provide a suitable type system for

Figure 6.3: Simplified UML diagram of the supplementary persistence classes of the analysis model. The diagram depicts the core classes Job, Tool, and Function together with derived classes. The observation class is introduced to store all resulting data that do not fit in MAGE-OM. The diagram is simplified for clarity and readability:

only the most important subclasses are depicted, and not all associations are named and show their cardinalities.

6.2. Object Model 97 functions as they do not provide a logical classification of the data types. A numeric array of given size might be the result of many different analysis algorithms.

In particular, the dimension indicating the number of spots, genes or other design elements of a microarray should be neglected by the type system, as a function like a normalization function should be applicable to datasets regardless of the specific array design used. On the other hand, the choice of a normalization function might reflect the technological platform of the array, because multi-channel microarrays and one-channel microarrays require different processing. In addition, different image quantification softwares produce diverse quantitation types.

It is required to base the type system on a higher-level of logical annotation, because MAGE-OM plus extensions provides for the complete view of the world for the software. Any analysis functions has to operate on the datastructures which exist within the model. Therefore, the possible types of the analysis functions must reflect the data-structures in MAGE as close as possible.

Four classes representing steps of the data analysis serve as input and output of data analysis functions. These classes are descendants of the BioAssayData class and are used to determine the basic data type of a function in Analysis model:

PhysicalBioAssayData(PBAD) represent data measured with hardware equip-ment like scanners. The only possible data type found with PBAD is images.

Image analysis is currently not incorporated into EMMA pipelines but carried out by external applications. The presence of the PBAD data type provides the possibility to include image analysis directly into a pipeline.

MeasuredBioAssayData (MBAD) are data resulting directly from a image quan-tification software. MBAD consist of tabular data containing raw intensities and quality statistics for each feature on the array. These data have to be further processed by normalization.

DerivedBioAssayData (DBAD) represent the output of a transformation process.

A transformation process takes as input MeasuredBioAssayData or Derived-BioAssayData and creates one or more numerical datasets as output which are of type DBAD. Normalization is an example for a transformation from MBAD to DBAD. Other functions like significance tests operate on normal-ized data which are of type DBAD and give a table of significance statistics also of type DBAD for each gene.

BioAssayDataClusters (BADC) are the results of a higher-level analysis which provide a grouping of individual design elements or also microarrays on the basis of MeasuredBioAssayData or DerivedBioAssayData. A typical function is a cluster analysis algorithm calculating a grouping of data into clusters.

Also, the results of a classification algorithm producing a mapping of design elements into disjoint classes may be represented by this data type.

The BioAssayData classes are not complete, as specific analysis tasks require more sophisticated data-types than found in MAGE; images, clickable maps, files,

and lists of gene names are the most frequently required examples. The Observation class hierarchy is introduced to set forth such supplementary data types. It is also suited to eventually derive further sub-classes for new analysis methods without affecting the core MAGE-OM classes.

There are more preconditions that can help to classify the type of a data matrix, in particular the type of rows in a data matrix. As an example, the rows in a quantification table from an image analysis software represent measured values for each individual spot on the microarray. Also, normalization produces normalized intensities or intensity ratios for each spot, represented by the Feature class in MAGE-OM. Individual spots may be seen as repeated measurements for a common polymer of nucleotides physically present on the arrays, called Reporter. Reporters may be further grouped into a logical sequence, representing a genomic region.

A common example are genes represented by different oligonucleotide sequences, which are subsequences of its coding region. These logical sequences are called CompositeSequence in MAGE terminology.

A normalization function solely operates on the Feature-level of the data matrix, while a function which computes an expected expression value, e.g. the mean, over all replicates for a sequence operates on Features and its output is based on the Reporter or CompositeSequence assignment of the Features. In conclusion, the DesignElement type of a function may be one of Feature, Reporter, CompositeSe-quence orany. The type any may be applicable for functions which are indifferent on the type of design element present, such as plotting functions.

The third categorization of data in the analysis process relies on the data types of the columns in a data matrix, called QuantitationType. In MAGE it is manda-tory to assing a QuantitationTypeDimension to a data set defining the type of measurements found in the columns. For a measured data set, the Quantitation-Types correspond to the column headers of the quantification table. An example for quantitation types in measured data are the foreground intensity and background intensity of the estimated spot intensities. The names of the quantitation types may vary between different image quantification software and analysis functions, and are not known to the implementation of the analysis system. Therefore, the quantitation types have to be specified for each function present in the database during run-time of the system.

In summary, the type of each function in EMMA2 is defined by tuple of the input and output type, in such a way that each function constitutes a mapping:

f : (Bi, Di, Qi)7−→(Bo, Do, Qo) (6.1) where B ∈ {BAD,PBAD,MBAD,DBAD,BADC} is the class of the BioAssay-Data in MAGE, D ∈ {Feature,Reporter,CompositeSequence,Any} the type of DesignElement and Q the QuantitationTypeDimension in MAGE. The type of a function is modeled in the supplementary class

CHAPTER 7

Implementation

In the previous Chapters, the required functionality of EMMA2 has been specified, and from that the structural components and their interactions have been derived.

Now, it is time to assemble them into an operational piece of software. This can be accomplished by using programming languages, code libraries and data-base management systems. A prototype version of the software has been built from scratch to refine and improve the specification and design.

An overview of the component structure of EMMA2 is given in Figure 7.1 on the following page. The implementation and adaptation of these data-structures, algorithms and novel visualization methods are explicated in the following Chapter.

7.1 Choice of Core Development Tools

There are manifold development environments, database tools, and programming languages to support the process of implementing software. A major criterion for selecting these tools is their reliability and efficiency; another is free availability to guarantee the system is distributable and extensible within the academic commu-nity for everyone. For microarrays, additional problems such as dealing with very large and noisy datasets and very complex data structures need to be addressed.

Im Dokument EMMA2 : a MAGE-compliant system for the analysis of microarray data in integrated functional genomics (Seite 112-117)