Diversity Driven Parallel Data Mining

(1)

Fachbereich f¨ ur Informatik und Informationswissenschaft Nycomed Stiftungs-Lehrstuhl f¨ ur Angewandte Informatik

Bioinformatik und Information Mining

Masterarbeit

Diversity Driven Parallel Data Mining

zur Erlangung des akademischen Grades eines Master of Science (M.Sc.)

Oliver Sampson

July 10, 2013

Gutachter:

Prof. Dr. Michael Berthold Dr. Barbara Pampel

Universit¨at Konstanz

Fachbereich f¨ur Informatik und Informationswissenschaft D–78457 Konstanz

Germany

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-264633

(2)

Sampson, Oliver:

Diversity Driven Parallel Data Mining Masterarbeit, Universit¨at Konstanz, 2013.

(3)

Abstract

With increasing availability and power of parallel computational resources, attention is drawn to the question of how best to apply those resources. Instead of simply finding the same answers more quickly, this thesis describes how parallel computational resources are used to explore disparate regions of a solution space by using diversity to steer the solution paths away from each other, thereby discouraging strictly greedy behavior. The formulation of models in a concept/solution space and its relationship to a search space are described as well as common search algorithms with heuristics for time or space computationally prohibitive searches. Measures of diversity are introduced, and the application of a beam search to the solution space for theKrimpalgorithm for frequent itemset mining is described. Experimental results show that it is indeed possible to get better results on real-world datasets with these methods.

(4)

ii

Acknowledgments

I would like to thank the entire Nycomed Chair for Bioinformatics, my fellow master students on the chair, and the employees of KNIME.com AG for their time and understanding for my many questions not only during the completion of this thesis, but also during the entire time of my r¨eentry into the world of Computer Science at the Univer- sity of Konstanz. In particular, special thanks to my two advisers Prof. Dr. Michael Berthold and Dr. Barbara Pampel for their patience; to Christian Dietz and Martin Horn for the magnificent Table Cell Viewer Node for KNIME; to Dr. Thorsten Meinl and Iris Ad¨a for help with algorithms and general guidance; to Dr. Thomas Gabriel, Peter Ohl and Dr. Bernd Wiswedel of KNIME.com AG for answers to implementation questions;

to Peter Burger for setting up and administering servers to run the experiments on; and to Heather Fyson for all things administrative. A big thank you to KNIME.com AG for sponsoring the wonderful coffee and coffee machine, without which this thesis wouldn’t have been possible–really.

(5)

List of Figures

1.1 Diverse Solutions . . . 3

2.1 Refining with two-stage Selection . . . 12

2.2 Simple Solution State Graph . . . 13

2.3 Rubik’s Cube . . . 14

2.4 Search Hierarchy . . . 15

3.1 Parallel Processing Taxonomy . . . 22

4.1 Frequent Closed Itemsets . . . 37

4.2 Standard Code Table . . . 39

4.3 A transaction database and its corresponding code table. . . 40

5.1 Refine and Select Diversity with KRIMP . . . 48

5.2 k-broad Stepwise . . . 50

5.3 Directed Placement . . . 51

6.1 p-dispersion-min-sum, Closed Itemsets . . . 56

6.2 p-dispersion-sum, Closed Itemsets . . . 57

6.3 k-broad-stepwise, Closed Itemsets . . . 59

6.4 k-broad-stepwise, Frequent Itemsets . . . 60

6.5 Directed Placement, Closed Itemsets . . . 62

A.1 Itemset Mining Algorithms . . . 66

A.2 Minsup Comparision . . . 67

(10)

viii LIST OF FIGURES

A.3 Comparison of Execution Times with and without Pruning . . . 69

(11)

List of Tables

2.1 Sample Dataset for Version Space . . . 7

6.1 Experimental Summary for p-dispersion-min-sum. . . 55

6.2 Experimental Summary for p-dispersion-sum. . . 57

6.3 Experimental Summary for k-broad-stepwise with Closed Itemsets . . . . 58

6.4 Experimental Summary for k-broad-stepwise with Frequent Itemsets. . . 59

6.5 Experimental Summary for Directed Placement . . . 61

A.1 Implementation Comparison of L(D|ST) and |F | . . . 67

A.2 Implementation Comparison of Compression . . . 68

A.3 Improvement of Pruning over non-Pruning . . . 69

(12)

x LIST OF TABLES

(13)

List of Algorithms

2.1 The Find-S Algorithm . . . 8

2.2 The Candidate Elimination Algorithm . . . 10

2.3 The Implicit Graph Search Algorithm . . . 16

2.4 The k-Best-First-SearchAlgorithm . . . 18

2.5 The Beam Search Algorithm . . . 19

3.1 The Greedy Construction Algorithm . . . 27

3.2 The First Pairwise Interchange Heuristic Algorithm . . . 28

3.3 The Erkut HeuristicAlgorithm . . . 30

3.4 The Levenshtein Distance Algorithm . . . 32

4.1 The Krimp Algorithm . . . 43

4.2 The Standard Code TableAlgorithm . . . 43

4.3 The Standard Cover Algorithm . . . 44

4.4 The Krimp Pruning Algorithm . . . 45

(14)

xii LIST OF ALGORITHMS

(15)

Chapter 1 Introduction

Faster. In most things, “faster” means “better.” Faster race cars, faster trains, faster Internet connections, faster time to marketplace. New measures of time have been invented to reflect modernity’s preoccupation with “faster:” “Webtime,” “real-time.”

“Take your time” is now “once-upon-a-time.” Faster does indeed mean better in some cases; some are simply a goal and an objective, as with race-cars or trains and planes, marathon runners, and downhill ski times. Faster, in a human sense, means that time that had been spent waiting for one thing, can instead be spent waiting for something else. The quest for faster-as-better has inspired wondrous innovations like jet airplanes and disastrous adaptations like cigarettes [Gle00]. But how one spends time is not always equal; faster in the personal sense is not necessarily better, and all ways to spend an hour are not equal, as anyone forced to watch any of the countless reality TV shows or wait at the dentist’s office with a toothache can attest.

The inexorable march of Moore’s Law [Moo65,Moo75] has brought faster computing and more computing power per unit time, and therefore a perceived “better” computing to the masses. A corollary to Moore’s Law (actually a description of the limitations) is seen in the advances in performance due to the recent renewed interest in and availability of parallel computing from the widespread adoption of Cloud Computing at the largest scales in the network and general-purpose graphics processing units (GPGPUs) and multicore CPUs in PCs and handheld devices at the edge of the network. The coming age of seemingly infinite computing resources at the fingertips in an all-connected world is irresistible, as new applications burst onto the scene that utilize faster computing for a better experience for the user, and satisfy the need for speed. The pace of the increase continues even to levels not imagined by Moore in the nascent days of integrated mi- crochips with the recent announcement of Google’s new Quantum Artificial Intelligence lab.¹

But what if we pause for minute, so to speak, and think about different applications for these powerful systems? What would it mean if we were able to take the parallel processing capabilities available and apply them differently, so that instead of reaching

1http://bits.blogs.nytimes.com/2013/05/16/google-buys-a-quantum-computer/

(16)

2 1.1. More. Better.

the same answer in a faster time, we actually used the parallel capabilities to look for better answers in the same or faster time? “Better,” similarly to “faster,” and seemingly obviously enough, can also mean “better.”

1.1 More. Better.

Using parallel computing to find better answers, e.g., improving accuracy, in Machine Learning is not new [Akl00,CS93a], but with the notable exception ofensemble methods, there has been a lack of concentrated and formal research into its application until recently [AIB12,BI13,IB13]. Many machine learning problems can be likened to searching in a solution space for an appropriate model that accurately represents a set of observed data, with the intention of making a prediction about some as-yet-unseen data. The solution space for most interesting machine learning problems is generally too large to be searched exhaustively, so some heuristic optimization is necessary to restrict the search space to a manageable size. A greedy heuristic is commonly employed that finds a first solution, and these heuristics impose a bias on the search. However, complex solution spaces can have many local optima, from which the heuristic may find a particularly poor (or good) local optima by chance. The enlightened machine learning algorithm will have some mechanism to account for this, which usually involves repeating the algorithm with a different starting point, with the hope that the same answer is reached (an indicator for having found a more global optima) or a better answer (a different optima for larger area of the search space.) Even the process of defining the repetitions of this process can be heuristically defined [MTZF04].

Employing parallel resources to explore different parts of the solution space is not, however, just a matter of launching several instances of the same greedy algorithm and comparing the answers. For different algorithms or datasets, there may be a tendency for solution paths to converge quickly to some optima, when there is no method to ensure that, in fact, different regions of the solution space are explored. A solution to this problem is to include a diversity measure between solution paths that requires solution paths begin exploring and remain in disparate regions of the solution space. It is the role of diversity and parallelism that is explored in [AIB12,BI13,IB13] and in this document.

Diversity is not a concept that lends itself to easy definition [Mei10]. Diversity requires at least two points, because nothing is diverse (or similar) without a point of comparison. Defining diversity is difficult, and the difficulty starts with defining a meaningful dis-/similarity measure for the data. What is clear is that with each datum accepted for learning by a machine learning algorithm, some kind of decision process has to be implemented that guarantees that the model adapted to that datum is somehow significantly different from other models already learned.

A common heuristic for prohibitively large search spaces is Beam Search. Beam Search is a type ofbreadth-first search algorithm, where the heuristic compares all of the solutions found at a particular depth and picks some number as worth pursuing, typically referred to in the literature as k-best, and discards the rest of the solutions from that

(17)

δ_diversityδ_diversity δ_diversity δ_diversity

Figure 1.1: Parallel resources exploring diverse areas of a solution space.

depth in the search. In this manner a constant number of solutions are explored at each step, conceptually represented by a constant-width beam. With the application of a diversity scoring measure to ensure that solutions are diverse in addition to, or even instead of, some measure of “best” for the beam, a conceptual framework for how to investigate parallel diversity is in place: The beam search with width k iteratively searches a solution space in parallel, and a diversity measure ensures that each of the solution paths is significantly different from the other solutions being explored.

1.2 Diverse Results

Given that the combination of data and a model have a solution space with a complex topography with many local optima, the results of a parallel-diverse exploration should provide better solutions than a purely greedy algorithm. And the wider the beam used to explore that solution space, the more likely it is to find a better solution.

Hypothesis 1 As the size of search beam through the solution space widens, the tendency to find better solution increases.

Ideally, the right diversity measure and the correct width for the data-algorithm combination will lead directly to the correct best solution as quickly as possible. Unfor- tunately, the direct path for complex problems can only be verified in short amounts of time, but a wider beam is more likely to find a better solution sooner.

Hypothesis 2 As the size of the search beam through the solution space widens, paths to better solutions have fewer iterations.

Unfortunately, measuring diversity has a computational cost associated with it, which means that in the parallel iteration cycle, diversity heuristics that are notcommunication- free may be computationally slower than a comparable pure greedy algorithm, which

(18)

4 1.3. Overview

may have found that solution directly. To be truly effective, diversity heuristics that require communication between parallel resources need to find a solution more quickly and account for the time lost due to the diversity measurement.

1.3 Overview

This document describes the necessary concepts to understand machine learning algorithms that explore diverse regions of a solution space. Chapters 2 and 3 describe the necessary components upon which Chapter4builds. Machine learning algorithms induce understanding from a set of observations by creating a model to match those observations. Chapter 2 begins with induction, describes the formulation of models used by machine learning algorithms, which are used to describe relationships in data, and how the differences between models form a solution space. After a discussion on the process of refining models to reflect newly observed data, search spaces are covered in general with some common search algorithms used to search for the best models to match observed data. Greedy algorithms quit when finding an optimum, which may be local or indeed optimal. Finding better solutions, even optimal solutions, can be found by searching diverse regions of a solution space in parallel, while using diversity between parallel searches to steer searches to guarantee exploration of disparate regions. Chapter 3 covers computational parallelism in general with a specific focus on parallelism in machine learning and then discusses diversity in general with some common dis-/similarity met- rics. Algorithms for discovering diverse elements from a larger population follow, and parallelism and diversity are then covered together in a common framework to describe parallel diversity. A specific application for parallel diversity is found applied to the machine learning task of frequent itemset mining with a new algorithm, Krimp[VvLS11], described in Chapter 4. Krimp finds the best representative subset of itemsets from a larger dataset using the Minimum Description Length principle. The application of parallel diversity to the Krimp algorithm is discussed in Chapter 5with specific examples of how diversity-based heuristics can be used to augment or replaceKrimp’s heuristics.

The experimental results of the application of parallel diversity toKrimpare covered in Chapter 6.

(19)

Chapter 2 Concept Learning as a Search Space

Induction is “the process of inferring or verifying a general law or principle from the observation of particular instances.” [Bro93] In the field of machine learning, inductive learning is the process of using a computer program to make predictions on a set of data based on information gleaned from another dataset, called the training data set.

There are some assumptions made both in the creation of the computer program and about the data used to make predictions. For example, when generating a classifier that will predict a value or class, about some unseen data from a set of possible values, it is assumed that the unknown data will be from the same possible values seen in the training data. Furthermore, different implementations of classifiers will have different assumptions upon which they are built, and provide the bias from which the learning algorithm is able to generalize. Mitchell in [Mit97, p. 40–42] describes a hypothetical case for a classification algorithm where no bias is assumed and draws the conclusion that no generalization without bias is possible.

Inductive learning is predicated on the inductive learning hypothesis.

The inductive learning hypothesis. Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples [Mit97].

Mitchell also compares three classification learning algorithms and ranks them according to the amount of bias.

1. Rote-Leaner. The algorithm learns based only on the data to which it has been exposed. For data instances that the algorithm has not seen, no classification is possible.

2. Candidate Elimination. The algorithm is able to classify new instances where there are no conflicts in the version space.

(20)

6 2.1. Concept Learning

3. Find-S. The algorithm generates the most specific model possible and uses it for classification [Mit97].

Recent trends and the hype in so-called “Big Data” analysis seem to belie the notion and effort for better model generation and inductive learning algorithms. In a widely cited article, Chris Anderson ofWired magazine stated “With enough data, the numbers speak for themselves.”¹ With more and more data, the bias ladder is descended and inductive learning is devolved to the mere execution of Rote-Learner. It is important to note that despite the potential for simplification, the increase in the amount of data and the use of that data requires improvement in the performance of inductive machine learning algorithms, and, more importantly, requires better tuned algorithms and implementations in what Anderson calls the “Petabyte Age.” With increasingly larger data mountains to climb, it becomes even more important to be able to intelligently analyze the data. It is exactly the biases in the data models that require understanding and consideration while creating the model, and attempts to bypass that step in the writing of algorithms are certain to be the source of predictive errors.

This section starts with a discussion of Mitchell’s view ofconcept learning as a search through a solution space [Mit82,Mit97], covers search spaces in general with some common search algorithms, finishing withbeam search.

2.1 Concept Learning

Herbert Simon defined “learning” in the sense of machine learning as “any change to a system that allows it to perform better the second time on repetition of the same task or on another task drawn from the same population.” [Sim80] Mitchell formalizes this in [Mit97]:

Definition 2.1 Learning. “A computer program is said to learn from experience, E, with respect to some class of tasks, T, and performance measure, P, if its performance at tasks in T, as measured by P improves with experience, E.” [Mit97, p. 2]

The tasks are as varied as the problems to which computers can be applied, and the experiences are certainly just as varied. The tasks can be thought of as problems to be solved, and similarly, the experiences can be thought of as data for the problem. Each new datum provides new information to the problem solving algorithm, represented by a computer program, from which the algorithm is potentially better able to perform. The process of taking a new datum or data and creating a potential solution via inference to the problem is the creation of amodel.

A model is represented by m and its predicted value for a data example, x, ism(x).

m belongs to a family models, M and x, and its concept, c(x), written as an ordered pair hx, c(x)i belong to a database, D.

1http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

(21)

Time of Day Day of Week Driving DrinkBeer

Morning Weekday Yes No

Morning Weekend No Yes

Morning Weekday No No

Afternoon Weekend No Yes

Evening Weekday No Yes

Table 2.1: Sample Dataset for Version Space In Machine Learning, Mitchell defines concept learning:

Definition 2.2 Concept Learning. Concept learning is the inference of a boolean- valued function, m, from a set of training data,D, of the function’s input, x, and output, c(x) [Mit97, p. 21].

m(x)|=∀hx, c(x)i ∈D:D⊂ D, m∈ M (2.1) A model is said to beconsistent, and therefore a potential candidate solution function for the concept learning problem for a dataset, when the model generates the same output for the same input from the training data.

Definition 2.3 Consistence, Consistent. A model, m, is consistent with a training dataset, D, if and only if the concept output by function m is the same as the concept for each datum, x in D [Mit97, p. 29].

Consistent(m, D)≡ ∀hx, c(x)i ∈D, m(x) = c(x)[Mit97, p. 29] (2.2) This document is only concerned with consistent models.

2.2 Version Space

Mitchell presents in [Mit78, Mit82, Mit97] his concept of a Version Space which sys- tematically describes the induction of models from a dataset. Consider the dataset in Table 2.1. A model that accurately describes the attribute DrinkBeer for the given data is a vector of three values, where a “?” represents any value is acceptable, a “∅” represents that no value is acceptable, or a specific named value (e.g.,Weekend) is acceptable.

For example, a model that states that drinking beer any time on the weekend is a good idea is represented by the model vectorh?, W eekend,?i.

When the model, m(x), correctly classifies the example, x, it is a positive example, i.e., DrinkBeer =Y es(= 1). The most general model that can be defined is the vector

(22)

8 2.2. Version Space

h?,?,?i, which classifies every example as a positive example. Correspondingly, the model represented by the vectorh∅,∅,∅iis the most specific possible, which would classify every example as a negative example, i.e., DrinkBeer =N o(= 0) [Mit97].

The task of the learner is to develop a model based on a set of training data that accurately predicts the concept, c(x), for each x, i.e.,∀x∈D:m(x) = c(x) [Mit97].

2.2.1 More-general-than

There is a natural ordering of models from general to specific which provides an advanta- geous structure when considering a model space as a search space. A model,m_j ismore general than or equal to another model, m_k when every example modeled by m_j is also modeled by m_k. Mitchell introduces a notation to describe this condition, m_j ≥_g m_k. Similarly a model may be strictly more general than another model, represented by m_j >_g m_k when (m_j ≥_g m_k)∧(m_k g m_j) [Mit97].

2.2.2 Find-S

Algorithm 2.1:The Find-S Algorithm [Mit97]

Input: Training Data, D

Output: S, a maximally specific model

1 Initialize m to the most specific model in M

2 for x:c(x) = 1 do

3 for attribute constraint, a_i, in m do

4 if a_i is satisfied by x then

5 do nothing

6 end

7 else

8 replace ai inm by the next more general constraint that is satisfied by x

9 end

10 end

11 end

12 return S

The Find-S Algorithm [Mit82, Mit97] finds the maximally specific model for the examples in a training dataset. It does this by starting with the most specific model possible, i.e., no possible values are acceptable. In the example shown in Table2.1, this would be

m_S =h∅,∅,∅i

The first positive example from the dataset in Table2.1from the second row indicates that the model is too specific and needs to be modified to apply to this dataset. The

(23)

values in the model are replaced with the next more general constraint that applies to the model, which are the values from this example.

m⁰_S =hM orning, W eekend, N oi

The next positive example, hAf ternoon, W eekend, N oi, indicates that the refined model, m⁰ is still too specific to match the training data. The model is further refined to include a ”?“ for theTime of Day attribute indicating that any value is appropriate.

The further refined model is:

m⁰⁰_S =h?, W eekend, N oi

It should be noted that Find-S Algorithm ignores all negative examples and is only concerned with the positive examples.

The third and final positive value from Table 2.1, hEvening, W eekday, N oi is again too specific for the second attribute, Day of Week, and the model is refined again to

m⁰⁰⁰_S =h?,?, N oi

giving the sensical notion that drinking beer is fine at any time when not driving.

With each training example from a dataset Find-S refines the model to be more general to accommodate the new information if necessary. The final model is the most specific model that fits the training data.

2.2.3 Candidate Elimination

Similar to Find-S is the Find-G algorithm which determines the set of most general models that match the training data. In analog toFind-S,Find-Gis strictly concerned with the negative examples from the training data [Mit97]. The algorithm starts with the maximally general model

m_G ={h?,?,?i}

Again from the dataset in Table 2.1 the first negative example (in the first row) indicates that this set of models needs to be refined to

m_G={hM orning,?,?i,h?, W eekday,?i,h?,?, Y esi}

Upon encountering the second negative example in the dataset, hM orning, W eekday, N oi, Find-G finds a model in conflict with the third fea- ture Driving which has to be eliminated from the model space.

The maximally general model space after processing the two data items is m_G ={hM orning,?,?i,h?, W eekday,?i}

(24)

10 2.3. Iterative Refinement

Algorithm 2.2:The Candidate EliminationAlgorithm [Mit97]

Input: Training Data, D

Output: M, a set of models that represent the training data

1 Initialize m_S to the most specific model in MInitialize m_G to the most general model in M

2 forall the x∈ D do

3 if x is a positive example then

4 remove from m_G any model inconsistent withx

5 foreach model s in m_S do

6 remove s from m_S

7 add to m_S all minimal generalizations of s to be consistent with x

8 remove any model from m_S more general than any other model in m_S

9 end

10 end

11 if x is a negative example then

12 remove from m_S any model inconsistent with x

13 foreach model g in m_G do

14 remove from g fromm_G

15 Add to m_G all minimal specializations ofg to be consistent with x

16 remove any model from m_G less general than any other model inm_G

17 end

18 end

19 end

20 return m_S ∪m_G

TheCandidate EliminationAlgorithm (See Algorithm2.2) captures how Find-S and Find-G work together to find the set of hypotheses that describe the dataset.

The only modification to Find-S and to Find-G in Candidate Elimination is that each has take into account modifications to its own model space when the a coun- terexample is encountered in the training data (Lines 4 and 12in Algorithm 2.2).

In summary, Find-S, iteratively refines a model to a more specific model based on data used for learning, whereasFind-Giteratively a model to a more general model. To- gether they describe the entire model space. For discussions in this document, analogs to Find-Sare more applicable rather than those toFind-G orCandidate Elimination.

2.3 Iterative Refinement

As seen in Find-S in Section 2.2.2, with each new datum processed iteratively by the algorithm, the model may need to be changed to fit the new information. This is more generally termedrefining. Refining uses arefinement operator on abase model, or start-

(25)

ing point, and generates k models, where k ∈ N. These models are used as the input to a selection operator, whose output models are then further refined and selected until some algorithmic quit criterion is reached [BI13].

This iterative process is represented by:

m⁰ =s(r(m))[BI13] (2.3)

Definition 2.4 Refine, Refining, Refinement. A refinement operator, r generates a set of models (m⁰₁. . . m⁰_k) from each model m ∈ M for the data D ∈ D, where k ∈ N [BI13].

r(m(x))→m⁰₁. . . m⁰_k[BI13] (2.4) The refinement process has to start somewhere, and this is on one or more base models.

Definition 2.5 Base Model. Let m ∈ M be a model from a set of models and r(·) a refinement operator. m is called a base model under refinement operator r if and only if m does not belong to the set of refined models [BI13].

BaseM odel(m)≡ ∀m⁰ ∈ M :m /∈r(m⁰)[BI13] (2.5) A model or models that meet a selection criterion are selected, and are either refined further or used as the final model. It is this final model or set of models that are used for making predictions.

Definition 2.6 Selection. The act of comparing models at an algorithmic stage to a selection criterion and limiting use to only those models meeting the criterion [AIB12, BI13].

s(m⁰)≡ ∀m⁰ ∈ M⁰, m⁰ ∈s(M⁰)[BI13] (2.6) There does not have to be a one-to-one relationship between refining and selecting stages; there could be a selection based on the first level of refinement, and a second- level selection based on potentially different criteria on the resulting set. (This could indeed be repeatedad infinitum until one or no models remain, but the benefits of such a strategy are not obvious.) Figure 2.1 graphically shows the refine and select process with two-stage selection. In Figure2.1a, two base models are each refined to five refined models. In Figure 2.1b, each set of five is selected for the three best models. These models are in turn compared, from which the best two are selected and used as the models for the next stage of refinement, if any.

(26)

12 2.4. Hierarchy of Searches

(a) Two solutions, each with five explored next solutions.

(b) From each group of five explored solutions, three are selected for group selection.

(c) From the six selected solutions in one set, two are selected for the next iteration.

Figure 2.1: Refining with two-stage Selection

2.4 Hierarchy of Searches

For some algorithms, concept learning can viewed as a type of search through a set of potential models with a search strategy determined by the algorithm and the data [Mit97, p. 15, p. 20][Mit82, ES11]. The refinement of models based on new data is analogous to searching a model space for a model that fits all the data encountered by an algorithm.

Searches are related to each other in how they select a next region in the search space, as to whether they are heuristically optimized, whether they search the complete space, among other properties.

2.4.1 Searching a Model Space

A formalization of a search space is required for the types of concept learning algorithms which have an analog to searching through a model space. There is a set of models which differ from one another only by some action, e.g., the processing of a datum from a dataset, with a subset of final models that describe the final state of the search process.

Definition 2.7 Model Space Problem. Amodel space problem, P = (M, A, m, M), is described by a set of Models M, a base model, m∈ M, a set of final models M ⊆ M and a finite set of actions A = {a₁, . . . , a_n} where a transformation from one model to another is described by a_i :M_k→M_j [ES11, p. 12].

The search space can be represented as a directed graph, where each node is an iteration in the search and where an edge represents a transitional action. For concept learning a node represents a model and an edge represents a refinement to that model.

Figure 2.2 shows a simple example, where models are refined to both local and global optima.

Definition 2.8 Model Space Problem Graph. Aproblem graph,G= (V, E, m, M), for the model space problemP = (M, A, m, M)is defined by the set of nodesV =M, the

(27)

Figure 2.2: Simple Solution State Graph of refined models (green) reaching global (blue) and local (yellow) optima.

initial node, m ∈ M, the set of final nodes, M, and the set of node-to-node connecting edges, E ⊆ V ×V, where (u, v)∈ E if and only if there exists an a ∈ A with a(u) = v [ES11, p. 13].

Furthermore, the iterative nature of model refinement is more aptly described by an implicit graph, where all edges and nodes are not stored, but rather are implicitly generated as needed [ES11, p. 47].

Definition 2.9 Implicit Model Space Graph. In an implicit model space graph, there is an initial node,m ∈V, a set of final nodes described by a termination condition V →B={f alse, true}, and refinement procedure, r(m)→M⁰ [ES11].

In consideration here are graph search algorithms that iteratively lengthen solution candidates (u₀, . . . , u_n =u) one edge at a time, until a goal model is found [ES11, p. 15].

The iterative process,model expansion, orrefining, generates all descendant models. The descendant models are logically more specific than the ancestor models and follow the Find-S algorithm (See Section2.2.2).

Definition 2.10 Descendant Model. A descendant model, m⁰, is a model which is the result of a refining step. m⁰ ∈r(m), where m ≥_g m⁰ [Mit97]

The number of children for a given model is the branching factor [ES11, p. 15].

Definition 2.11 Branching Factor. The branching factor of a model is the number of its descendants. Where M⁰ is the set of descendant models to a model m ∈ M, the branching factor is |M⁰| [ES11, p. 15].

The branching factor is especially relevant and important when discussing search heuristics that restrict the number of refinements from a given model.

(28)

Figure 2.3: A Rubik’s Cube²

2.4.2 Prohibitively Large Search Spaces

A simple and well-known example of a problem search space is that afforded by a Rubik’s Cube (See Figure 2.3). Each side of the cube consists of nine squares of one color and can be rotated 90°, 180° or 270° independently of the other two planes. The goal of the problem search space to align all nine squares on each side of the cube to having the same color. Of the 27 subcubes orcubies, there are 26 visible. Of these there are 8 corner cubies, 12 edge cubies, and 6 middle cubies. The initial state has a branching factor of 6×3 = 18, but because the same face is not rotated twice in a row (that would merely be a different single move), all subsequent branching factors are 5×3 = 15. There are 8!×3⁸×12!×2¹²/12≈43×10¹⁸ possible cube configurations [ES11, p. 22]. Evaluating one million configurations per second would require≈1.4 million years to evaluate them all.

Another well-known example of a problem search space is the Traveling Salesman Problem. Stated succinctly: find the shortest distance a traveling salesman must travel to visit all cities on his list once, when also given the distances between the cities. It is a combinatorial problem with a possible number of states being (n−1)!/2, which for n = 101 cities, there are ≈ 4.7×10¹⁵⁷ different solutions [ES11, p. 25], which even at one billion evaluations per second would require ≈1.5×10¹⁴¹ years to evaluate.

Similarly, the brute force approach to finding groups of items commonly purchased together in a store is combinatorial and exponential. When given a database of products available in a store and a transaction database detailing which items have been purchased together, the problem is to extract the groups of items have been purchased together frequently enough to make a recommendation to a user based on items in his shopping cart. For a product inventory of n there are 2ⁿ possible combinations of purchases.

For a database of 1000 items, there are 2¹⁰⁰⁰ ≈10³⁰¹ different possible combinations of products. There is no recourse, when selecting only a limited number of products, e.g.,

2Image by Booyabazooka from http://en.wikipedia.org/wiki/File:Rubik’s_cube.svg. Used under theCreative Commons Attribution-Share Alike 3.0 Unported license.

(29)

Searches OpenList ClosedList

Breadth-First OpenList : Queue

Enforced Hill-Climbing Depth-First OpenList : Stack

k-Best-First

Beam Hill Climbing

Figure 2.4: A Hierarchical Representation of Some Search Algorithms and Heuristics in UML³ Notation.

five, from a database of 1000; the number of possible solutions is still prohibitively large:

n k

= _k!(n−k)!^n! = 5!(1000−5)!^1000! ≈8.3×10¹² possible combinations.

Even with the fastest hardware or the most massively parallel systems available today, a brute force or exhaustive attempt to find solutions to these problems in a reasonable time is infeasible. Additionally, exhaustive search spaces can be prohibitively large in that they require too much computational memory, but there are algorithmic and heuristic methods to counteract these problems by directing the search or limiting the search space. Learning functions that can feasibly find solutions are operational.

Definition 2.12 Operational, Nonoperational. A learning function that can be used within realistic time constraints [Mit97]. Similarly, a learning function that is not operational is nonoperational.

Figure2.4 captures the relationships of the different types of searches discussed here.

3http://www.omg.org/spec/UML/2.4.1/

(30)

2.4.3 Breadth-First and Depth-First

The models to be refined and selected by the Implicit Graph Search Algorithm (See Algorithm 2.3) are maintained by search algorithms in two lists, an Open List and a Closed List. The Open List contains the models found in the exploration process, but not yet refined, and the Closed List contains the models already refined, and thus removed from further consideration. An algorithm starts with an initial model and refines it, placing all of the descendant models into the Open List and placing the start model into theClosed List. The algorithm then removes one model from theOpen List, refines it according to some set of training data, D_training, places all of the descendants into the Open List, and places that model into the Closed List [ES11, p. 49]. The process continues until a termination condition is met, such as the set of training data being empty.

Algorithm 2.3:The Implicit Graph Search Algorithm [ES11, p. 49]

Input: Initial model, m, Training DatatsetD_training Output: M the set of final refined models

1 ClosedList← ∅

2 OpenList←m

3 while Dtraining 6=∅do

4 m =OpenList.get()

5 ClosedList.put(m)

6 x= next data element from Dtraining 7 D_training =D_training\x

8 m~⁰ =ref ine(m, x)

9 OpenList.add(~m⁰)

10 end

11 M =OpenList.getAll()

12 return M

Breadth-First

A Breadth-First Search (BFS) algorithm behaves as if the Open List were a Queue or FIFO (first-in, first-out) data structure. In this manner, the descendants of a given node are all explored and expanded before any of the descendant’s descendants are explored and expanded [ES11, p. 49].

Depth-First

A Depth-First Search (DFS) algorithm behaves as if the Open List were a Stack or LIFO (last-in, first-out) data structure. Each model is refined and one of the model’s descendants is continually expanded until no more model refinements are possible after

(31)

which the previous model’s siblings are refined and explored in turn. At each level in the search, the algorithm continually refines as deep as possible only returning up to a higher level when no more refined descendant models can be explored [ES11, p. 49].

Algorithms using Weights

BFS and DFS as described are unweighted graphs, i.e., all expanded models are marginally equal with respect to each other, but differ only in that a final model has been found or not. When considering model graphs that include a weight along the model refinement edge, other data structures that utilize the weight to order the models can be considered. In particular, the Open List can be considered a Priority Queue where en- tries are added to the queue and dynamically reordered according to the weight. Prim’s Algorithm and Kruskal’s Algorthm for finding Minimum Spanning Trees and Djikstra’s Algorithm for finding shortest paths are examples where edge weights are used in the search decision process for finding the best search path from a set of possible search paths. [SW11].

BFS, DFS, Prim’s Algorithm, Kruskal’s Algorithm, and Dijkstra’s Algorithm guarantee finding the optimal solution. This is the optimality of an algorithm.

Definition 2.13 Optimality. The optimality of a search algorithm describes its ability to guarantee finding the best solution.

2.4.4 Search Space Limiting Heuristics

Because the number of model refinement paths increases with exponential time and space, i.e., 2^k wherek is the branching factor (See Definition2.11), based on the number of possible refinements for each model, the number of potential paths to be evaluated can be prohibitively large in terms of computational time or computation memory. Some search algorithms’ heuristics are modified for purposes of performance, but they are not guaranteed to find the optimal solution [ES11, p. 240]. What these heuristics have in common, is that they evaluate intermediate models and discard models that do not meet a performance criteria.

Heuristics are used to limit or direct the solution search paths for the most promis- ing candidates. Some heuristic-based algorithms which limit the search space include Hill-Climbing, Enforced Hill-Climbing, k-Best-First Search, and Beam Search [ES11].

Additionally, methods of duplicate detection are used to limit the the amount of storage consumed by the Open List and the Closed List, as well as preventing the repetition of following previously searched paths.

Hill-Climbing and Enforced Hill-Climbing

Hill-climbing (also called best-first search) is a strictly greedy evaluation heuristic algorithm, which merely selects the best descendant model from the current model, refines

(32)

it and selects that model’s best successor continually until a performance measure is met, or until no further refinement is possible. This heuristic has the disadvantage that the algorithm can be fooled into a dead-end where the refined model performs more poorly than the ancestor model. Enforced Hill-Climbing modifies this search heuristic by modifying a BFS and selecting a refined model only when it strictly performs better than the current model. If the model’s descendants all perform more poorly than itself, they are discarded and the model’s siblings are refined and searched for strictly better performing models. If no better performing models are found, the search is halted [ES11, pp. 240–241][HN01].

k-Best-First Search

A modification tohill-climbing orbest-first search is k-Best-First Search [Fel01] to evaluate the k best models, after which each of the k models’ refinements is added to the Open List for consideration in the next iteration (See Algorithm 2.4). For search paths’

heuristics that may lead to suboptimal solutions, k-best-first search has the advantage of potentially exploring diverse parts of the model space in each iteration, because the best models at any given iteration may not necessarily be grouped together in the search space [ES11, pp. 250–251].

Algorithm 2.4:The k-Best-First-Search Algorithm [ES11, p. 250]

Input: Implicit Problem Graph, P, initial model,m, Training DatasetD_training Output: M the set of final refined models

1 ClosedList← ∅

2 OpenList←m

3 while D_training 6=∅do

4 m₁, . . . , m_k =OpenList.get() // get() dequeues from a Priority Queue

5

6 ClosedList.put(m₁, . . . , m_k)

7 foreach i∈ {1, . . . , k} do

8 x= next data element from D_training

9 D_training =D_training\x

10 m~⁰ =ref ine(m_i, x) // m~⁰ contains a weight for use by the Priority Queue

11

12 OpenList.add(m~⁰) // add() enqueues into a Priority Queue

13

14 end

15 end

16 M =OpenList.getAll(k) // dequeues the top k

17

18 return M

(33)

Beam Search

Algorithm 2.5:The Beam SearchAlgorithm [ES11, p. 252]

Input: Implicit Problem Graph, P, initial model,m, Training DatasetD_training Output: M the set of final refined models

1 ClosedList← ∅

2 OpenList←m

3 while D_training 6=∅do

4 m1, . . . , mk =OpenList.get() // get() dequeues from a Priority Queue

5

6 ClosedList.put(OpenList.getAll()) // moves the entire OpenList to the ClosedList

7

8 foreach i∈ {1, . . . , k} do

9 x= next data element from Dtraining 10 D_training =D_training\x

11 m~⁰ =ref ine(m_i, x) // m~⁰ contains a weight for use by the Priority Queue

12

13 OpenList.add(m~⁰) // add() enqueues into a Priority Queue

14

15 end

16 end

17 M =OpenList.get(k)// dequeues the top k

18

19 return M

Beam search [Low76] (or k-beam-search) is a variation of k-Best-First-Search where onlyk models are evaluated at each iteration, and the rest of the Open List is discarded (See Algorithm 2.5). Beam searches are useful when memory for use by the algorithm is restricted, but it sacrifices optimality for being operational (See Definition 2.12). The simplest type of beam search is thebest-k heuristic where all models other than the best k, as referenced to some peformance metric, are discarded at each iteration. A beam of width 1 (k = 1) is equivalent to theHill-Climbing algorithm, and an infinite beam-width (k = ∞) is equivalent to a simple BFS. By restricting the width of the search space to k, the complexity of the search becomes linear, O(kd) where k is the width and d is the depth of the search. Extensions to beam search include beam-stack search [ZH05] and BULB (Beam search Using Limited discrepancy Backtracking) [Fur05], both of which guarantee optimality by including a backtracking mechanism. Iterative broadening is another variant which relaxes the k restraint during successive iterations, allowing for broader beams of models to be refined [ES11, p. 251].

Beam Searches have been found to be useful in many applications including planning

(34)

20 2.5. Summary

[ZH04], scheduling [HM02, OM88], speech recognition [HAH⁺01, p. 615][Koe04, JL02], and discovering rule patterns in historical linguistics [RAP98].

Duplicate Detection

As a computational time and memory efficiency heuristic, duplicate detection is employed for types of breadth-first heuristic searches [ZH06]. Refined models are checked against both the Open List and Closed List for matching models, where the new model is discarded when it is found to have been previously explored. Frontier searches [KZTH05, Kor99, ES11], are searches that maintain no Closed List, i.e., all explored nodes are discarded, and is a more general technique for memory reduction for BFS.

Discarding the Closed List complicates the search by potentially reintroducing into the search models that lead to paths previously evaluated and discarded. These require al- ternate methods ofduplicate detection, such as detecting the presence of a parent on the Open List, to compensate [KZTH05].

2.5 Summary

Machine Learning via induction on a set of training data is a powerful method for ex- tracting predictive models for use on a set of unknown data. The entire model refinement process can be thought of as a model space where each model is a node in a graph linked by edges representing a refinement of that model, based on additional data. Refining a model changes a model, i.e., becomes more specific (or general), to fit the new data and multiple different models can be refined simultaneously from a base model, which is analogous to following a path from node to node along edges in a search graph. Standard search algorithms such as BFS or DFS can be employed to search this solution space, but for prohibitively large model spaces, which are unable to be exhaustively searched due to computational time or memory restrictions, heuristic methods are used which sacrifice optimality for operationality.

(35)

Chapter 3 Diverse Parallel Machine Learning

The meaning of “diverse parallel” with respect to machine learning, or perhaps any other subject, remains veiled without context. The jump from conceptualizing machine learning algorithms as a search through a solution space to conceptualizing parallel searches through a search space is small. The reasoning behind requiring diverse is not quite as clear, and is explained in this chapter.

Parallel computation has evolved from a years’ long research project to being available in hardware standard in (seemingly) disposable handheld portable computers and common desktop and laptop computers [PH08]. Parallel computing has many forms, but they all share the same characteristic of a start point, some instructions that are computed in parallel, and an end point, where the results of the parallel computations are handled. How the instructions are computed in parallel is based on computer and network architectures, and vary in terms of whether the memory is shared, if the parallel processors are colocated on the same chip, device or networked, among other variations.

In contrast, a Google search on “diversity” returns results where the range of topics is strikingly broad. Just the Wikipedia page on the topic “Diversity”¹ alone lists a dizzying variety of subjects using “diversity” as a mantle. It comes therefore as no surprise that the use of “diversity” within the computing community is also used in different ways.

Because of the myriad ways the term “diversity” is used, it is important to describe and define just what “diversity” means. Perhaps most confusing is the use of the term to describe both the cause and the effect of exploring different parts of the solution space, i.e., differentiating between algorithms thatinjectdiversity with a purpose, e.g., Diversity Beam Search Algorithm [SHRQB95] and algorithms that merely describe diversity as a by-product of their behavior, e.g., KWA* [Fur04].

Diversity in solution search spaces in machine learning appears in the literature with some regularity, but the role of diversity in the method of parallel solution space exploration has remained relatively untouched as a topic unto itself until recently. This document builds directly on the work of Akbar, et al. [AIB12], Berthold and Ivanova [BI13], and Ivanova and Berthold [IB13].

1http://en.wikipedia.org/wiki/Diversity

(36)

22 3.1. Parallelized Machine Learning

Monolithic

CUDA, OpenCL

Subsets

Random Forests, Ensemble Methods,

Data

Code Homogenous Heterogenous

Stacked Generalization, Parallel Kmeans,

CudaRF

Multistrategy Hypothesis

Bagging, Boosting

Meta-Learning, Multi-Agent

Boosting

Figure 3.1: A Taxonomy of Parallel Processing Implementations

3.1 Parallelized Machine Learning

Parallelism in electronic computing has a history nearly as long as that of electronic computing itself, with parallelism in artificial intelligence (AI) and machine learning coming along shortly thereafter with investigations into artificial neural networks [EJ77].

Parallelism has been introduced into machine learning with two primary goals: accelerating execution and increasing accuracy. There is a third use of parallelism in the AI community used for multi-agent system, often under the moniker Distributed Artificial Intelligence, but that is beyond the scope of this discussion [SV00].

The goal of brute-force execution acceleration by subdividing an algorithm and/or the data into sub-units that can be executed in parallel and recombined is the more obvious of the two approaches and was the focus of early parallel computing research.

Neural networks were early adopters of parallelization in the 80s [RM87], with clustering [RIZ90, Li90] catching the wave in the 90s. The growth of ensemble techniques and metalearning in the 90s lead to the interesting case where researchers published two papers about their experiments in metalearning systems, where the stated goal of one paper was increased accuracy, whereas the other paper’s goal was improved processing performance. Ensemble techniques are a natural application for parallel processing, irrespective of the goal being either faster execution or improved accuracy.

Different types of parallel processing implementations of data mining applications are categorized in the taxonomy in Figure 3.1.² The taxonomy divides the implementations along two axes: the nature of the executable code, and the nature of the data upon

2The taxonomy contrasts with Talia’s Taxonomy of three categories [Tal06].

(37)

which it is executed. The code axis is designated either as “homogeneous,” where the same code is executed in each parallel instance, or as “heterogeneous,” where different algorithms are executed in each parallel instance. Similarly, the data axis is designated as “monolithic,” where all parallel instances are acting on the same body of data simultaneously, or as “subsets,” where each parallel instance is acting on a subset of the data, for which there is some sort of combiner program to unify the disparate results.

3.1.1 Parallel vs Serial

Vanilla parallelization, if there were such a thing, would simply be an attempt at ap- plying a divide-and-conquer approach to algorithm implementation, where dividing a problem into n equal parts, would enable a factor of n execution acceleration. How- ever, all problems are not equally suited to parallelization, with potential differences in algorithms, access to shared data structures, and communication overhead between parallel processes all detrimentally affecting what would be the ideal execution performance improvement [ES11]. Even the ideally maximally parallelized algorithm requires some sequential, (i.e., non-parallel) steps for the coordination of the parallelized portion.

The relationship between the speedup by a number, n, of parallel processes and the amount of processing performed in serial, s is given by Amdahl’s Law [Amd67, TLNC10].

Speedup_Amdahl(n) = 1

s+^1−s_n (3.1)

which is to say that the speedup is not linear with n and is strongly influenced by the amount of the algorithm that can be parallelized.

Gustafson-Barsis’s Law [Gus88] is another attempt to formulate the relationship in acceleration when performed on n processors. For problems that are large, the law states that the serial portion of the calculation, s as a fraction, becomes a smaller part of the overall execution time, so that for very large problems, the increase in processing speed approachesn. Gustafson-Barsis’s Law is less about describing the limits of parallel computing, and more about describing the amount of data that can be processed by n processors, because the serial portion for larger datasets decreases.

SpeedupGustaf son−Barsis(n) = n−s(n−1)[Gus88, TLNC10] (3.2)

3.1.2 Quality through Quantity

Akl in [Akl00] provides a model for analyzing real-time computation problems that can be performed better in the same amount of time by virtue of parallelization. Although rigorous in his analysis of how a parallel system provides better answers than a serial system in the same time, his fundamental assumption is not that better answers can- not be found by a serial computer given enough time, but rather that for some classes

(38)

24 3.1. Parallelized Machine Learning

of problems, particularly in real-time streaming applications, better answers are found when the parallel resources are applied to refine some iterative process. Examples include optimization, cryptography and numerical computation, where “better” is defined differently for each class of problems, i.e., closer to optimal, more cryptographically se- cure, and higher numerical accuracy, respectively. Unfortunately, he limits his analysis to that of real-time computational models, but does presciently imagine that there may be other computational problems where parallel computation provides better answers than sequential ones.

3.1.3 Related Work

Applications of parallel computing in machine Learning has a long history with two different goals: faster execution times or better performance with respect to some performance measure. As early as 1989, Stolfo, et al. applied parallel computing techniques to speech recognition [SGMM89] for better accuracy. This was later started formalized as ensemble techniques andmetalearning in 1993 with the work by Chan and Stolfo [CS93b]

specifically on metalearning. In [CS93b] Chan and Stolfo specifically cite a “divide-and- conquer” approach for “higher speed” while “maintain[ing] the prediction accuracy that would be achieved by the sequential version.” Also in 1993 Chan and Stolfo [CS93a]

touted the increased accuracy of ensemble techniques with and without a metalearning stage, while Wolpert worked with a type of parallelism for stacked generalization in 1992 [Wol92]. In [ZH04] Zhou and Hansen discuss methods of using a divide-and-conquer approach to BFS (See Section 2.4.3) and beam search (See Section 2.4.4), although it is not strictly described as being implemented for parallel processing. A parallel implementation with modifications is given by Korf and Schultze in [KS05].

Parallelized versions of popular clustering algorithms such ask-Means [SB99,ZMH09, KC00, FRCC08] and DBSCAN [AC01, XJK02, DM00, BNP⁺09], are often r¨eimple- mented with each new wave of parallelizing technology, e.g., parallel processing, (multi- threading and multicores) network of workstations (NOWs) with messaging, (message processing interface, (MPI)³) and general-purpose graphic processing units (GPGPUs) with CUDA⁴ or OpenCL⁵ technologies.

Parallel mining of association rules and frequent itemsets by Park, et al. [PCY95] in 1995 and by Agrawal and Shafer [AS96] in 1996 on a shared nothing (separate memory, separate disks) architecture, began almost immediately after the seminal work [AIS93]

on association rules in 1993 from Agrawal, et al. Other work [PZOL01] on shared memory systems includes that of Parthasarathy, et al. and from Agrawal on heterogeneous systems [Agr08].

The relatively recent rise of the use of GPGPUs for non-graphics related programming has generated more research in parallel association rule mining, with efforts concentrated

3http://www.mpi-forum.org/

4http://www.nvidia.com/object/cuda_home_new.html

5http://www.khronos.org/opencl/

(39)

on improving execution performance. Fang, et al. in [FLL⁺08] describe the use of an

|I|×|D|⁶bit matrix (0 for absence, 1 for presence), to represent the presence of an item in a transaction. Presence of two items together in a transaction is determined by a bitwise logical-AND and a count of the number of ones in the result. The generation of the a data structure and the recursive calls for calculation of support are performed in CPU- GPU hybrid code, due to programming limitations of the GPU. Zhang, et al. in [ZZB11]

describe GPApriori which uses a similar data structure to [FLL⁺08] for representing the items and transactions. However, Zhang, et al. use a parallel reduction algorithm [BT97, p. 17] to calculate support for each candidate set.

Teodoro, et al. describe in [TMMF10] parallel implementations for both multicore CPUs and GPUs based on tree projection methods of itemset mining. In their implementation they employ a parallelism based on the data, where each transaction is handled by a computing entity, and all elements are calculated for a given depth in the search tree, i.e., a breadth-first search, (See Section 2.4.3) before the next level is begun. Each item in the transaction updates the count value for that item in main memory, after which the tree representation is adjusted and the next level of processing is performed by the computing units in parallel.

Research on the application of randomized search heuristics provides an illustrative contrast of method to that of diverse parallelism presented in this document, although the goal is same. Friedrich, et al. describe in [FHH⁺10] the amount of improvement for approximations for multiobjective combinatorial optimization problems using randomized heuristics when compared to single-objective optimization problems, and demonstrate that good approximations are possible for both the VertexCover and SetCover problems. Using SetCover as an example, they demonstrate that the worst case improvement approximations for single-objective problems is that it is non-approximable with randomized search heuristics.

3.2 Measuring Diversity

Similarity, and therefore diversity, is an important issue in bioinformatics and chemoin- formatics regarding protein and molecule similarity[EJT00]. In subfields of information retrieval there has been renewed interest in diversity by showing users diverse results on the first page of search query results to minimizesearch abandonment [GS09]. Diversity has been used to balance learning algorithms for datasets where some classes heavily out- weigh other classes in the dataset [WTY09], and is used for balancing outcomes among classifiers in ensemble methods by measuring differences in the results among the classifiers [TSY06], although the usefulness in real-world pattern recognition applications has been disputed [KW03]. Diversity within a solution space has also been directly addressed by algorithms solving Pareto-optimal multi-objective optimization functions [MA04,LTDZ02].

6|I|is the number of unique items in the database, and |D|is the number of transactions.

(40)

26 3.2. Measuring Diversity

Shell, et al. describe in [SHRQB95] CRESUS, an “expert system for cash manage- ment,” which used diversity directly in a beam search (See Section 2.4.4) for avoid- ing local optima and accelerating the search for global optima, and van Leeuwen and Knobbe employed a beam search with diversity with positive results in subgroup discov- ery [vLK12].

An early attempt at solution space refinement and selection is found in [HMZM96].

Holte, et al. explore solution spaces as graph search problems, and try to take advantage of properties of graphs to speed up solutions. Rather than comparing several different solutions at a give depth in the search directly, they categorize the nodes and refer to one type of refinement as “opportunistic” where nodes along the search path belong to an abstraction layer. This is not an example of diversity, per se, but describes an early attempt at evaluating nodes along the search path during the refinement in an indirect manner.

Measuring diversity and selecting a diverse subset can be based on of whether the set is ordered or not.

3.2.1 (Unordered) Sets

Thek-diversity Problem is to select a subset ofkitems from from a larger set ofn items, where some measure of diversity is maximized. This is a variation of the p-dispersion problem and has been shown to be NP-hard [Mei10, DP09, Erk90, E ¨UY94]. Meinl describes in [Mei10] six diversity measures, some with heuristic optimizations, which are used as a diversity measure for the k-diversity Problem. These yield different types of diverse sets, which reinforces the observation that diversity does not lend itself to a single definition. Two are described here, p-dispersion-sum, and p-dispersion-min-sum.

Note 1 Even though it is formally called the “p-dispersion problem,” “k” is used in the following description and definitions for consistency with the rest of the document.

Definition 3.1 p-dispersion-sum. Given a set P = {p₁. . . p_n} of n items and k, where k :N and k ≤n, a distance measure d(p_i, p_j) :p_i, p_j ∈P between items p_i and p_j, the k-diversity problem is to select the set S :S ⊆P, such that

S^∗ = max

S⊆P

|S|=k

f(S),where f(S) = 1 k(k−1)

k

X

i=1 k

X

j>1

d(p_i, p_j)[DP09, Mei10] (3.3)

p-dispersion-sum selects a set of k items which are maximally far away from each other. As described more fully by Meinl, this has the effect of pushing the k-set away from the center of the distribution, leaving a diverse but unrepresentative k-set.

Drosou and Pitoura describe and experimentally compare and evaluate in [DP09]

four main heuristic approaches for finding a k-set using the p-dispersion-sum measure, each with several variants. Each of these methods is predicated on a particular definition of distance between an item and all other items in a set.

Diversity Driven Parallel Data Mining

Fachbereich f¨ ur Informatik und Informationswissenschaft Nycomed Stiftungs-Lehrstuhl f¨ ur Angewandte Informatik

Bioinformatik und Information Mining

Masterarbeit

Diversity Driven Parallel Data Mining

Oliver Sampson

Prof. Dr. Michael Berthold Dr. Barbara Pampel

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1 Introduction

1.1 More. Better.

1.2 Diverse Results

1.3 Overview

Chapter 2

Concept Learning as a Search Space

2.1 Concept Learning

2.2 Version Space

2.2.1 More-general-than

2.2.2 Find-S

2.2.3 Candidate Elimination

2.3 Iterative Refinement

2.4 Hierarchy of Searches

2.4.1 Searching a Model Space

2.4.2 Prohibitively Large Search Spaces

2.4.3 Breadth-First and Depth-First

2.4.4 Search Space Limiting Heuristics

2.5 Summary

Chapter 3

Diverse Parallel Machine Learning

3.1 Parallelized Machine Learning

3.1.1 Parallel vs Serial

3.1.2 Quality through Quantity

3.1.3 Related Work

3.2 Measuring Diversity

3.2.1 (Unordered) Sets