Parallelism and Cloud - Architecture and Design Patterns

Design Pattern Collection

4.3 Architecture and Design Patterns

4.3.4 Parallelism and Cloud

• Communication between components (unknown to each other): This requirement supports the loose coupling of components by avoiding a direct dependency on other components.

• Asynchronous communication: Avoid blocking of the application.

• Sharing of data to be processed: Data should be easily transportable to other services.

Patterns

• Blackboard (Buschmann et al. [22]) enforces communication of different services. It, therefore, provides an "open" storage space that can be used for sharing data, results, etc.

• Whiteboard (Kriens and Hargrave [70]) is a variant of the Observer pattern (Gamma et al. [46]). The process of manually registering an Observer falls away.

• Shared Repository (Lalanda [73]) allows sharing different data structures.

• Service Locator (Alur et al. [5]) provides a possibility to reduce coupling of compo-nents.

• Dependency Injection (Fowler [44]) can be compared with the Service Locator, but without the need for an active lookup of the components.

• Listener / Observer (Gamma et al. [46]) allows registering for notifications on an event. This pattern reduces the tight coupling of components. A variant of this pattern is Publish - Subscribe (see Buschmann et al. [22]).

• Pipes and Filters (Buschmann et al. [22]) describes how to integrate systems operat-ing on a data stream.

The main selection criteria for these patterns are anabstractionfrom the concrete component and the ability tointegratedifferent kinds of components. Also,loose couplingof components by abstracting from their concrete implementation must be enforced. The mapping of the patterns to the identified requirements is presented in table 4.3.

Loose coupling

ExtensibilityRuntime

Reconfigur ation

Communication

with "unkno wn"

Async.

communication Data

Sharing

Blackboard X X X X X

Shared Repository X X X

Service Locator X X X X Dependency Injection X X X X

Listener / Observer X X X X X

Whiteboard X X X X

Pipes and Filters X X

Table 4.3: Mapping of patterns to requirements for internal integra-tion.

Parallelism

This collection describes patterns to introduce parallel algorithms.

Requirements

Key requirements for parallelism are the easy integration into a software system by using the technologies that are already present in a modern programming language framework (e.g.

theExecutorServicein Java). Moreover, the processing of big data should be supported by using a scalable platform. Also, results should be returned to the caller via a callback after an asynchronous call to avoid polling.

• Processing of large amounts of data: Especially in knowledge processing systems, the amount of data can become very large and needs many resources.

• Fast processing of data: An analysis should be done as fast as possible. This can be a contrast to the processing of large amounts of data (e.g., MapReduce (Dean and Ghemawat [27]) using the Hadoop⁴framework is suitable for big data, but processing is often slow).

• Simple integration of parallel algorithms into existing systems: Parallel algorithms tend to be more complex (because of synchronisation tasks and concurrent access to data).

4http://hadoop.apache.org/

• Elastic scaling: If a parallel algorithm needs more resources from time to time, these should easily be granted.

• Loose coupling: Also here, loose coupling supports the extensibility and exchange-ability of code.

• Flexibility for multiple types of algorithms: In knowledge processing, different vari-ants of algorithms exist (for example, Rete (Forgy [41]), or Phreak [61]). For testing, they should be easily integrable and exchangeable.

• Notification of finish (no polling): Asynchronous calls can produce a heavy load when polling is used to ask for results regularly. An asynchronous notification is favourable.

• Control of single tasks: Fine-tuning of the parallel execution (for example, defining a custom execution strategy). Sometimes only global mechanisms are provided that take away the control over the parallel execution from the developer (see, for example, OpenMP⁵).

• Physically distributed processing: An advantage in parallel processing is the ability to distribute it over multiple physical machines. This avoids a limitation of resources.

Patterns

• Task Parallelism (Mattson et al. [79]) tries to split the work up into multiple tasks that can be executed in parallel.

• Pipeline (McCool et al. [80]) lets multiple tasks be executed in parallel, communicat-ing through queues.

• Single Program, Multiple Data (SPMD) (Mattson et al. [79]) describes the parallel execution of multiple instances of the same program, each on its own chunk of data.

• Fork / Join (McCool et al. [80]) allows splitting the sequential execution into a par-allel one and later come back to a sequential version.

• Actors (Agha [2]) also allow distributed calculation over multiple processing nodes (either virtual or physical).

• Loop Parallelism (Mattson et al. [79]) allows parallelizing the same tasks on data in a collection.

• Master / Worker (Mattson et al. [79]) splits the task up into multiple parallel work-ers, which are coordinated by one master.

5http://openmp.org/wp/

BigData

Fast processing Simpleinteg

ration

ScalingLoose coupling

Multiple algorit

hms

Nopolling Control

intasks Physical

distribution

Task Parallelism X X X X

Pipeline X X X X X

SPMD X X X X

Fork/Join X X X X

Actors X X X X X X

Loop Parallelism X X X X X

Master / Worker X X X X X X X

Map-Reduce X X X X

Future X X X

Table 4.4: Mapping of patterns to requirements for parallelism.

• Map-Reduce (Dean and Ghemawat [27]) is a pattern for operating on large amounts of data. The data is split up, processed, and the partial results are reduced to one specific result.

• Future [85] allows receiving results from parallel tasks (e.g., results from a thread processing data in the background).

Some common criteria for selecting one of these patterns are the ability to startasynchronous calculation, either with or without the option towait for the result. Parallel algorithms should also bedistributableandwork on large amounts of data. Moreover, simple supportin modern programming languages is favourable.

Cloud

The benefits of a cloud platform for a knowledge processing system are described by Nad-schläger et al. [90]. A cloud platform can support knowledge processing by providing an infrastructure for computationally intensive ad-hoc calculations, but also supporting the processing of a large amount of knowledge via elastic scaling. Cloud patterns concerning the implementation of a knowledge processing system are described by Homer et al. [56].

Requirements

Executing a program in a cloud environment means that communication happens over a network. Communication in a network must be fault-tolerant as it can break any time, re-sulting in the possible loss of data. Moreover, the transportation of large amounts of data can be slow, depending on the quality of the network.

• Assure communication with good performance: Communication in a network can become slow, depending on the quality of the network and the amount of data.

• Fault tolerance: To counter possible network failures, applications should support meaningful error handling and provide alternative solution strategies.

• Handle large data: It should also be possible to process large amounts of data. There-fore, strategies have to be developed to maintain good performance.

• Avoid data loss: Data sent over a network can get lost (e.g., when a path in the network is broken).

• Resilience: Recover from errors.

• Ease of development: A cloud application brings some difficulties in development (e.g., the application has to be deployed to a cloud infrastructure). Therefore, strate-gies that make the development easier are welcome.

Patterns

• Cache-aside improves the performance of loading data from a cloud environment by locally caching it.

• Circuit-Breaker tries to handle faults to increase the performance in faulty commu-nication by avoiding repeated calls to a faulty service (due to a timeout, or an error).

• Compensating Transaction can be compared to the transaction concept of a database.

It tries to undo all steps of a process if one step fails.

• Command and Query Responsibility Segregation (CQRS) is a pattern to improve per-formance by separating reading and writing queries.

• Event Sourcing and Materialized View are on the one hand responsible for avoiding data loss; on the other hand, they increase performance. Both design patterns try to optimize the representation so that no data can be lost and that the access has a good performance.

• Sharding helps to manage a large amount of data by spreading it over multiple nodes.

• Retry provides a strategy to retry failing requests.

• Runtime Reconfiguration allows changing application parameters at runtime with-out having to redeploy the whole application.

Criteria for the usage of a cloud environment are the ability to havefallback strategiesin case of an error in communication. Moreover, theperformance should not decrease. Also, the deployment should be kept as easy as possible.

Communication

performance

Fault toler

ance Bigdata

Avoiddata loss

ResilienceEase

ofdevelopment

Cache-aside X X

Circuit-Breaker X X

Compensating Transaction X X

CQRS X X

Event Sourcing X

Materialized View X X

Sharding X X X X

Retry X X

Runtime Reconfiguration X

Table 4.5: Mapping of patterns to requirements for cloud.

Im Dokument Application of architecture and design patterns in the context of knowledge processing or knowledge based systems / submitted by Stefan Nadschläger MSc (Seite 98-103)