Prediction of Driver Behavior and Decision Strategies for Autonomous Driving: Using Machine Learning and Decision Theory

(1)

Wolfgang Münst

Prediction of Driver Behavior and Decision Strategies for Autonomous Driving

Using Machine Learning and Decision Theory

Dissertation

Fakultät für

Mathematik und

Informatik

(2)

Decision Strategies for Autonomous Driving

Using Machine Learning and Decision Theory

Wolfgang Münst

Department of Mathematics and Computer Science FernUniversität in Hagen

This dissertation is submitted for the degree of Dr.-Ing.

December 2019

(3)

For Annette, Lena Sonsee and the tiny one.

(4)

Autonomous and automated driving development has been making quick progress within the last years and many large automotive companies have announced initiatives to offer self- driving cars soon, some as early as 2020. During the first decades, these automated cars will have to share the road with human drivers. A way to increase safety and comfort for these automated vehicles and improve current advanced driver-assistance systems (ADAS), is to take the expected future behavior of other traffic participants into account. This increases the time a vehicle can use to react to changes in its environment and therefore allows for more comfortable or safe actions to be chosen.

Relevant research areas to achieve this goal include computer vision, machine learning, situational awareness, decision making and many more. In this thesis, we contribute and evaluate new ideas for some of the challenges in the toolchain required for predicting the behavior of other traffic participants and deciding what to do with these.

We used supervised machine learning techniques for both reactive predictions (short- term) and motivation-based predictions (long-term) and contribute to the question how the data used for machine learning can be labeled, comparing manual to semi-automated to automated approaches. This is especially interesting when subjective data is involved, e.g.

at which point in time a motivation arose in a driver, even though he did not act on it yet.

How to use and what to actually do with predictions in vehicles equipped with ADAS or automated vehicles is the next contribution. We evaluated many heuristics and a new, complex decision model based on decision theory. As use case, an adaptive cruise control system was enhanced by our prototype implementation of predictors and decision algorithms.

We used an automated prototype research car in a case study and compared the effects on safety and comfort using objective data evaluation and subjective feedback from the study participants.

We are able to show that even simple prediction and decision algorithms are able to improve the current status quo considerably and that the more advanced models work even better, but at the cost of substantial complexity increases.

(5)

vi

Zusammenfassung

Die Entwicklung von autonomen und automatisierten Fahrfunktionen ist die letzten Jah- re schnell voran geschritten und viele große Fahrzeughersteller haben angekündigt, bald selbstfahrende Autos vorzustellen. Manche bereits im Jahr 2020. Mindestens die ersten Jahr- zehnte werden diese automatisierten Fahrzeuge die Straße mit menschlichen Fahrern teilen müssen. Eine Möglichkeit um die Sicherheit und den Komfort in diesen Fahrzeugen und ak- tuellen Fahrerassistenzsystemen (ADAS) zu erhöhen besteht darin, das erwartete zukünftige Verhalten anderer Verkehrsteilnehmer zu berücksichtigen. Dadurch kann das Fahrzeug wert- volle Zeit gewinnen um frühzeitiger auf Veränderungen in seiner Umgebung zu reagieren und ermöglicht es, komfortablere und sicherere Verhaltensweisen zu wählen.

Um dieses Ziel zu erreichen, muss auf Erkenntnisse aus unterschiedlichen Forschungs- gebieten zurückgegriffen werden, darunter Computer Vision, maschinelles Lernen, Situa- tionsbeurteilung, Entscheidungstheorie und vielen anderen. In dieser Dissertation werden neue Ideen vorgestellt und ausgewertet, die zur Vorhersage des Verhaltens anderer Verkehrs- teilnehmer dienen und Entscheidungsmöglichkeiten diese Vorhersagen zu verwenden.

Wir haben Methoden aus dem überwachten maschinellem Lernen für reaktive (sehr kurz- fristige) und Motivations-basierte (etwas längerfristige) Vorhersagen genutzt und tragen zu Fragestellungen bei, wie die Daten für maschinelles Lernen annotiert werden können. Da- bei haben wir manuelle, halb-automatische und automatische Ansätze verglichen. Das war besonders dann interessant, wenn subjektive Daten betroffen waren, z.B. zu welchem Zeit- punkt genau die Motivation etwas zu tun in einem Fahrer aufkam, auch wenn er noch nicht sofort deswegen sein Verhalten geändert hat.

Wie bestehende Vorhersagen in Fahrzeugen genutzt werden können, die Fahrerassistenz- systeme oder Automatisierungsfunktionen haben, ist ein weiterer Beitrag. Wir evaluieren viele Heuristiken und ein neues, komplexes Entscheidungsmodell basieren auf Entschei- dungstheorie. Als Anwendungsfall haben wir ein adaptive cruise control Fahrerassistenz- system (ACC) um Vorhersage und Entscheidungsalgorithmen zu einem neuen Prototyp- System erweitert. Dieses System wurde von uns in einem Prototypfahrzeug in einer Pro- bandenstudie verwendet, um den Einfluss auf Sicherheit und Komfort sowohl objektiv als auch subjektiv zu bewerten.

Dadurch können wir zeigen, dass selbst einfache Vorhersagen und Entscheidungsalgo- rithmen aktuelle System deutlich verbessern können und dass noch fortschrittlichere Syste- me zwar noch besser funktionieren, dies jedoch durch deutlich höherer Komplexität erkauft wird.

(6)

1 Introduction 1

1.1 Autonomous Driving is Near . . . 1

1.2 Predicting Human Behavior . . . 3

1.3 Methodology: Learning Human Behavior with Machine Learning Techniques 5 1.4 Decisions based on Uncertain Predictions . . . 6

1.5 Use Cases . . . 7

2 Development Methodology 11 2.1 Development approach . . . 11

2.2 Source of data . . . 12

2.2.1 Prototype research car . . . 12

2.2.2 Driving simulators . . . 12

2.2.3 Data for short-term lane change prediction . . . 14

2.2.4 Data for long-term lane change prediction . . . 14

2.2.5 Data for zip merge situations . . . 15

2.3 Ground truth determination . . . 16

2.3.1 Labeling Methods . . . 17

2.3.2 Driving Data . . . 21

2.3.3 Maneuver Prediction . . . 21

2.3.4 Comparison . . . 22

2.4 Feature Selection . . . 25

2.4.1 Feature selection overview . . . 25

2.4.2 Information theory . . . 27

2.4.3 Mutual information and Symmetric uncertainty . . . 28

2.4.4 Fast correlation-based filter . . . 29

2.4.5 Entropy-Based Discretization . . . 31

2.5 Logistic Regression . . . 33

(7)

viii Contents

2.6 Bayesian Networks . . . 35

2.6.1 Overview . . . 35

2.6.2 Inference . . . 37

2.6.3 Learning . . . 39

2.7 Decision Theory . . . 41

2.7.1 Decision model . . . 41

2.7.2 Expected utility . . . 43

2.7.3 Objective and subjective expected utility . . . 43

3 Model description 47 3.1 General . . . 47

3.2 Prediction Models . . . 49

3.2.1 Short-Term Lane Change Prediction . . . 49

3.2.2 Long-Term Lane Change Prediction . . . 50

3.3 Vehicle Reaction Models . . . 60

3.3.1 Decision heuristics . . . 60

3.3.2 Decision Theory . . . 61

4 Use Case Implementation 65 4.1 Improving safety and comfort by using short-term predictions . . . 65

4.1.1 Introduction . . . 66

4.1.2 Prediction of a Lane Change . . . 67

4.1.3 Approaches to Decision Making . . . 67

4.1.4 Decider Configurations . . . 68

4.1.5 Case Study . . . 72

4.1.6 Data Analysis and Results . . . 73

4.1.7 Conclusions . . . 76

4.2 Improving comfort by using longer-term predictions . . . 78

4.2.1 Introduction . . . 78

4.2.2 Lane Change Prediction . . . 79

4.2.3 Long-term Prediction Issues . . . 79

4.2.4 Motivation-based Action Decider . . . 81

4.2.5 Evaluation Setup . . . 83

4.2.6 Data Analysis and Results . . . 85

4.2.7 Conclusion . . . 88

4.3 Zip-merge situations for automated or autonomous cars . . . 90

(8)

4.3.1 Gap selection for safe merging . . . 91 4.3.2 Immediate cut-in prediction to improve cooperation . . . 95 4.3.3 Conclusion . . . 98

5 Conclusions 99

Bibliography 101

Appendix Appendix A: ACC decider case-study questionnaire 109

(9)

Chapter 1 Introduction

1.1 Autonomous Driving is Near

Personal transportation is about to change: autonomous vehicles are under development by many automotive industry companies like General Motors, Ford, Tesla, Daimler or BMW and will gradually become part of our daily lives in the coming years [1–4]. Also tech giants like Google (now Waymo), Apple or Baidu are part of the scene with an artificial intelligence/machine learning background. With the potential of providing a cheap and efficient solution to personal transportation, they have the power to transform society. Com- panies from related transportation service industries, like Uber, seek to replace their drivers with computer systems, which will change the costs of on-demand individual transportation dramatically.

Astonishing technological advances have been made on all frontiers relevant to autonomous driving, since the first pioneering work of Dickmanns and Zapp [5] in the 1980s.

The research in the field of autonomous driving and advanced driver assistance systems (ADAS) has already led to assistance systems with series maturity, e.g. traffic sign recognition and automated parking systems, though there are doubts that the way to fully autonomous vehicles will incremently occur via improvements in ADAS-technology [6]. Some fundamental open question have yet to be resolved until robotic cars will populate the roads and outperform human drivers in terms of safe driving. Relevant research areas include computer vision, machine learning, decision making, control theory and many more.

This future perspective promises many improvements, both for individuals and society [7]:

(10)

• Improved safety. According to [8], about 93% of traffic accidents are caused by human error. Changing to autonomous transportation will significantly reduce the num- bers of crashes. In the US, this would lead to about 30,000 fewer deaths per year [9].

• Healthcare system relief: more than 2 million drivers each year are treated for injuries in the USA alone [9]. Fewer and less severe injuries will reduce health care costs significantly, since around 10% of all hospitalisation injuries are due to traffic accidents [7].

• Fewer traffic jams and using the road infrastructure more efficiently, because autonomous cars need a reduced safety distance between them. This would help save around 7,2 billion liters of fuel in the US alone [9], and also decrease time spent in traffic jams.

• Improved fuel efficiency and consequently also reduced pollution.

• Significantly reduced costs of ownership: if autonomous taxis cost a fraction of current ones and are available 24/7, there is no need to own a vehicle, especially in urban areas. Current cars are not in use on average for 22 hours per day [9]. The massive capital expenditure and maintenance costs related to this can then be put to better use [7]. This, of course, will be a challenge to the traditional car manufacturers, as there will be fewer cars sold. The manufacturers are therefore already working on future mobility business models for a shared vehicle economy, e.g. DriveNow (BMW) or Car2Go (Daimler), which merged in 2019 to create SHARE NOW [10].

• The time currently spent behind the wheel can in the future be used productively, e.g.

for answering emails, reading books or watching movies. In 2018, each US American spent on average 97 hours in travel delays because of congestions, costing in total around 87 billion dollars [11]. Germany is even worse, with an average amount of 120 hours spent in congestions, leading to a total cost of around 5.1 billion euros [11].

But there are also voices which see some of these improvements describes as too rosy, e.g. that the crashes or injuries would not necessarily be reduced so drastically, because the humans still in the loop would adjust to the changed risk and act more carelessly. Refer to [12] for a discussion in more detail with additional references.

In the next decades, automated vehicles and human drivers will share the same roads.

In 2018, Waymo (formerly called the Google self-driving car project) already launched its

(11)

1.2 Predicting Human Behavior 3 public ride-sharing service ’Waymo One’ in metropolitan Phoenix [13,14]. While humans and AI share the road, we need algorithms for the automated cars to enable them to react optimally to human drivers.

One main challenge is the extremely dynamic traffic environment, which can only be perceived by noisy sensors and their respective limitations. Additionally, only partial perception of the surroundings is possible, e.g. because of occlusions or when some relevant information is impossible to observe, like the intentions of other traffic participants. An important prerequisite to anticipating situation developments is the capability to assess other drivers and their intentions. Since these cannot be directly measured, they have to be in- ferred from observations. The behavior of traffic participants is highly coupled, so reliable predictions can only be made with an understanding of how traffic participants influence each other. Human drivers acquire the ability to interpret traffic situations and to predict the likely developments through experience. "Drivers" of autonomous vehicles expect the machines to show an equivalent foresight.

The goal of this thesis is to contribute to build this foresight into current advanced driver assistance systems and future autonomous vehicles. The focus will be on using predictions of lane changes or merge operations in different situations and decide how to react to the different potential outcomes, depending on their probabilities. This is done by splitting the reactions in short term and long term, because there are additional parameters to take into account in the longer term situations, e.g. mutual influencing in feedback loops.

We take a data-driven, probabilistic approach to learn the probabilities with machine learning techniques. This has a few advantages over manual methods, for example improving the predictive accuracy with the amount of data it learns from.

1.2 Predicting Human Behavior

Human driving is a challenging task to model. The actual physical process involves human perception, information processing, deciding and the physical execution of the resulting action, all of which are massively complex to model and not completely understood.

On the way to autonomous driving, the roads will be shared by human drivers and robotic vehicles for at least the first decades. This leads to an intermediate challenge: the robotic cars need to be able to react to sometimes irrational behavior of the human drivers. Superior reaction times of the automated systems will already reduce the number and severity of crashes [15].

(12)

Therefore, automated systems should be able to predict human behavior to react early to potential conflicts, so that the system can react even more safely and comfortably. Also the psychological component plays a role here: passengers in automated cars expect the car to react similar to how they would have reacted themselves. (Experienced) drivers are good at predicting what other drivers are about to do within a certain time frame. Passenges will be and feel more secure if the automated car displays similar behavior characteristics and hints that they understand the current traffic situation and will react accordingly.

This will change over time, when the share of autonomous vehicles increases while manual driving will slowly decline. There are even estimations, that manual driving might be prohibited in the future, when it becomes evident from data that it harms people, while autonomous cars are much safer [16,17].

Until then, to predict human behavior in traffic, many different approaches could be used.

• Simple, inflexible heuristics, e.g. rule-based if/then/else-statements.

• Using Game Theory to find the most likely outcomes of different situations.

• Applying a data-driven approach by utilising machine learning techniques, e.g. simple regression models, Markov Models, Bayesian Networks or Neural Networks.

The approaches can also be distinguished by the sophistication and complexity of situation they are trying to predict. The simplest and easiest one is extrapolate the next positions of cars by looking at physical movements. This could e.g. be done when using a constant velocity or constant acceleration assumption. The trajectories of vehicles can then simply be calculated. These models do work, but only on very short prediction times and fail in complex situations. An overview can be found in [18].

The next kind of traffic prediction is maneuver-based, in which potential static influence factors of the environment will be incorporated into the prediction model. The idea is to recognise or predict high-level maneuvers first and use trajectory prediction in a later step based on the identified maneuver. This could be, for example, identifying a lane change or turn maneuver by looking at the physical movement of a car towards a lane marking or a T-intersection. The physical movements combined with this maneuver information can improve the prediction horizon and accuracy. Many specific maneuver predictions for different situations will be necessary, plus a recognition of which general situation the car is currently in. Models like these are still limited, because they do not take other dynamic objects into account, which is required for many complex traffic situations [19].

(13)

1.3 Methodology: Learning Human Behavior with Machine Learning Techniques 5 More sophisticated models exist as well, which not only take the static environment into account, but can be interaction-aware and predict the behavior based on other dynamic traffic participants. These models can be more accurate, but usually need to endure higher computational complexity, which increases greatly with every new vehicle which needs to be taken into account (see e.g. [20–23]). Some complexity-reduction methods have been proposed as well, e.g. by only taking pairwise dependencies or asymmetric dependencies into account [24–27]. An example for interaction-aware models could be a fast car approaching a slower car on a highway or the expected give-way behavior at an intersection.

The outcome of all these models can also be divided into groups: deterministic, stochastic or scenario-based (see e.g. [19, 28]). Deterministic predictions return exactly one outcome, but cannot capture the significant uncertainties, which exist especially in longer time frames. Stochastic models could give several potential outcomes and their likelihood. In most publications, only one outcome is returned though, the one with the highest likelihood, e.g. [29–31]. Scenario-based models do not deal with explicit probability distributions, but generally work with a discrete number of potential future scenarios.

All these models have to deal with the inherent trade-off between accuracy, reliability and complexity.

1.3 Methodology: Learning Human Behavior with Machine Learning Techniques

Human behavior is very complex and cannot be easily captured by a hand-designed rule system. Therefore we used machine learning techniques to approximate and statistically learn the way humans behave in traffic.

As a precondition to let machines learn behavior from humans directly, a lot of data showing the intended behavior is required, with which the learning algorithms can be trained.

Once data has been collected, the data needs to be labeled to distinguish between different kinds of human behavior. There are different labelling strategies: fully manual, fully automatic or something in-between. How different labelling strategies influence predictive accuracy of a machine learning algorithm will be discussed inAbschnitt 2.3.

After the data has been collected and labeled, there are nearly infinitely many possibili- ties for features to look at as an input for the machine learning algorithm. The collected data consists of many objects, status of objects (indicator on? relative position to the lane? distance to preceding vehicle? windshield wiper active?) and environment data (weather, road

(14)

curvature, distance to next intersection,...). To increase prediction accuracy and learning speed, data types (features) which have no influence on the prediction need to be eliminated from the use in the learning process. This can either be done by an expert, e.g. obviously the windshield wiper status does not tell us anything about whether a vehicle will change lane soon. There are also automated and semi-automated ways to achieve this, e.g. by filters and wrappers which can test different features or combination of features for their predictive power. We will get into this inAbschnitt 2.4.

After the features have been selected, we need to decide which type of machine learning to use. Regression, classification and also the type, e.g. Neural Networks, Bayesian Net- works, SVMs, logistic regression or other techniques. There are many options to choose from. We decided to use Bayesian Networks because they still allow experts to understand what is happening in the different layers and improving the results by learning from ob- served data. This can be more easily aligned with safety verification and automotive safety standards, as opposed to blackbox approaches like deep neural networks. This is an active area of research, for an overview with many more related references see [32,33].

Then, the algorithm’s structure needs to be defined and the available data must be split to be used for training and model validation, to make sure that the model generalises well and does not overfit the test data.

1.4 Decisions based on Uncertain Predictions

Even if the steps ahead were performed perfectly and a predictor announces the probability for certain actions of a driver, it is not easy to use these predictions optimally.

A 60% probability that a vehicle will cut in ahead of the own car does not offer any rec- ommendation on what to do with it. If, for example, a trajectory planning system proposes an evasion maneuver onto the lane where the predicted vehicle will not be any more with a 40% chance (because it’s changing lanes with a 60% probability), it could crash into the vehicle sometimes if the more unlikely event occurs.

Reacting to probable behavior can be either done by naive approaches, e.g. just acting as if the prediction was certain to become true if a certain threshold was exceeded. There can also be systems in place which evaluate the current situation and react differently to the same prediction probability based on the current environment situation, e.g. by using decision theory. Different approaches for reacting to cut-in probabilities have been presented by us at the VTC 2017 [34] and are discussed inUnterabschnitt 3.3.1andAbschnitt 4.1.

(15)

1.5 Use Cases 7 Also there might be other prediction differences. While it is mostly straight-forward to react to a prediction about what is going to happen within 1 second, it is non-obvious how to handle predictions of longer time frames, e.g. if something is probably going to happen within the next 10 seconds, but it is uncertain when. A possible approach to this problem with long-term predictions, again for cut-in vehicles, has been proposed by us at the ITSC 2016 [35] and are elaborated inAbschnitt 4.2.

Bringing all these use cases, usually developed in simulations, into the real world, poses another challenge. Not only are sensor readings obviously noisy. Depending on how the predictions and reaction were evaluated in the simulations, the in-car system could still show barely useful real-world results, e.g. because of inept true positive/false positive definitions or large amounts of false positive detections which disturb the users of such systems. This can still lead to great-looking ROC curves. These challenges, especially the false positive detections, were also examined by [36], who suggested a suppression technique to combine multiple detections within a short timeframe to make the systems more reliable in real-world scenarios.

1.5 Use Cases

Applying the steps outlined in the methodology, we can present implementations for some use cases, as we will show inKapitel 4.

The mobility standards organisation SAE described different levels of automation, which can be achieved by the automotive systems [37], ranging from levels 0 (no automation) to 5 (full automation). SeeAbbildung 1.1for a breakdown [38]. In the automation levels 0-2, up to partial automation, which rely on driver-in-the-loop, no legal changes are needed, because the human is responsible to supervise the system at all times. Tesla’s ’Autopilot’ falls into this category, for example. When the driver does not need to monitor the traffic and environment conditions any more, as is the case in the conditional automation (level 3) and higher, then these systems are currently not legal in Europe, because the Vienna Convention of 1968 does not permit the driver to disengage even partially from supervising the vehicle [39,40].

While the prediction part of this thesis can be used as a warning system for human drivers for level 0 and beyond, the decision what to do with predictions makes only sense at level 1 and higher: it can improve existing level 1 systems like adaptive cruise control (ACC) by altering its logic to take predictions into account when deciding which vehicle to follow. The same functionality will be useful for higher levels of automation. For example,

(16)

if a level 2 systems takes over both acceleration/braking and lateral steering in a vehicle, the system’s decision space could be expanded to allow changing lanes as well. In even higher levels of automation, the algorithms will have to be implemented in the system in a more integrated way than in ACC systems.

Figure 1.1 Automation levels according to SAE [38]

The first two use cases presented in this thesis to show improvements over the current status quo involve modification of the existing adaptive cruise control system (ACC, SAE level 1). ACC was first described in 1980 [41], but the necessary sensors for series produc- tion were not affordable at that time [40]. A series vehicle was presented first by Mitsubishi in its Diamante model in 1995 [42]. The first two use cases we present modify when the ACC switches to a new lead vehicle. In current cars, this is only done when a new vehicle is already in front of the vehicle equipped with the ACC system.

In use case one, which relies on short-term lane change prediction by monitoring the physical movement directions of potential cut-in vehicles, the ACC system is extended to decide to react to the cut-in vehicle earlier, depending on the prediction certainty and the

(17)

1.5 Use Cases 9 decision algorithms. This increases the system’s safety, as very close cut-ins can lead to crashes in current systems, if the human does not manually intervene. We will describe this use case inAbschnitt 4.1. This functionality will be needed in all higher levels of automation as well.

The second use case uses a longer-term prediction presented in Abschnitt 4.2, which tries to derive the intentions of the driver to predict lane changes based on the current environment and other traffic participants. It is inherently more uncertain and, due to that, needs a different reaction. It switches to a coasting mode, anticipating a potential need to brake in the near future, to increase comfort.

The last use case looks at zip-merge-situations inAbschnitt 4.3, when a lane change is inevitable. We evaluate this from two different perspectives: one of them takes the position of a vehicle on the lane which persists. This is basically a case of cut-in prediction with the constraint that a lane change of the cars on the adjacent lanewill definitelyoccur. The other sub-case is for the cars on the disappearing lane - which gap should they try to change into?

We took a preliminary look at how this could be done by also learning from human drivers via machine learning techniques.

What all these use cases have in common: they are only needed when humans and automated systems share the same road. When all vehicles are automated or autonomous systems, they can coordinate via wireless communication to share their future actions, so that all systems can react appropriately. Then, uncertain predictions will be replaced by announcing future behavior.

(18)

(19)

Chapter 2 Development Methodology

2.1 Development approach

The general idea of our approach is to use a data-driven development methodology combined with supervised machine learning:

• First, we want to identify situations which we want to predict the future outcomes of.

In our case, this was lane changes in different traffic situations.

• Then we collect data from these situations (section 2.2),

• identify the ground truth for the classes to be predicted (section 2.3) and then

• analyse it to identify the input data features with the most predictive potential by using automated feature selection methods (section 2.4).

• In the next step, we use the selected features and some of the data to build and train a model using logistic regression (section 2.5) and bayesian networks (section 2.6).

• In the last step, we evaluate the model’s performance in objective metrics like e.g. the F1 score, precision or recall values. The applied models in the use cases, which were integrated into the prototype research car, were also evaluated subjectively in studies.

(20)

2.2 Source of data

So first, we need data to learn from. We’ve collected data from several different sources, e.g. from a prototype research car which was equipped with many different sensors and from driving simulators with human test subjects, if collecting the data of interest would not reliably reproducible or if the data collection was in conflict with ethics or laws. We’ll get into that in the next sub sections why this might be the case.

2.2.1 Prototype research car

The prototype research vehicle was a BMW i328 pre-series model equipped with a 360- degree environment surround view consisting of six LIDAR sensors to track other traffic objects and lane markings. A highly precise digital map was available that delivered information about roads and lanes. A highly accurate GPS/RTK module was used to position the ego car on that map with around 2cm accuracy. Three radar modules, one forward-facing and two facing sideways, were built in as well. There were also several cameras, a front- facing Mobileye EPM3 to assist lane and object detection and a two UEye cameras, one forward-facing and one backward-facing, which are not used in the environment model but just record video data to make the offline data analysis easier for the researchers.

Figure 2.1 Prototype research car BMW i328

2.2.2 Driving simulators

We used two different driving simulators to collect driving data.

(21)

2.2 Source of data 13 Using these was necessary due to some constraints in real world data collection: for example, to learn when a driver’s lane change wish really arises, we need to be able to distinguish that wish from the possibility that a lane change can be executed right now.

But the human drivers might not be able to tell exactly when his lane change wish arose independently, due to their traffic experience (sometimes many years). Our solution to this problem was to just take away the chance for the driver to check if a lane change is possible right now, by removing the mirrors of the simulated car - that way we get the real point in time of the lane change wish. Doing the same in real traffic with a research car is dangerous, therefore ethically questionable and also illegal.

Another advantage of driving simulators is the reproducability of certain traffic situations. This is problematic when trying to collect traffic scenarios in which different vehicles’

relation towards each other is important, e.g. distance towards the vehicle in front and at the same time the gap size on the left. This is impossible to reproduce reliably in real world traffic situations, so the behavior of the test drivers in this situation will not be comparable with enough certainty. In simulators, this is possible to recreate.

The first simulator we used is the "Spider" simulator used by BMW’s Research and Technology department. It is intended to give an accurate look and feel of driving a BMW consumer-class model. On the outside, the simulator consists of a very detailed reconstruc- tion of the driver’s seat of a BMW car, including important elements such as the steering wheel, the gas and brake pedals, the center console including the shift stick. Additionally, the driver’s view of the street as well as the rear-view and side-view mirror views are dis- played on three large monitors set up in front of the dashboard. On the inside, the simulator consists of a cluster of 5 computers accepting steering inputs from the simulator controls and accurately integrating these into the simulator scenario, including calculating other vehicles’ positions and behavior by using a psychological driver model. Figure 2.2 shows a picture of the setup.

This spider simulator can also be used as software-in-the-loop, which we used for different scenario evaluations. We implemented our different deciders into BMW’s highly automated driving framework and let this framework replace the human driver in the spider simulator. That way we could directly test the effect of different deciders and decider settings. This also has the advantage that the simulator can be run on any powerful desk- top machine and does not require the expensive and complicated hardware setup, which is needed to create a realistic scenario for human test subjects.

The second simulator is located at the University of Duisburg-Essen, which was also used to collect the data of some specific scenarios: a simulator study was conducted to

(22)

Figure 2.2 Spider simulator located at BMW in Munich

extract labels for lane change intentions while the study participants were driving. To do so, the simulator was set up as a partially automated driving system that took discrete maneuver wishes as input (lane change left or lane change right). The maneuver wishes were triggered through buttons on the simulator’s steering wheel. If the traffic situation permitted it, the requested maneuver was executed automatically by the system. If the targeted lane was not safe, the system executed the request when the lane was free again.

2.2.3 Data for short-term lane change prediction

Data for short-term prediction was collected exclusively from the real world with the prototype research vehicle, because both sensor quality and situation complexity are suitable. It was collected from different drivers on the German autobahn and a BMW test track [43,34].

Several hundred lane changes from more than 15 different drivers were recorded this way.

2.2.4 Data for long-term lane change prediction

For motivation-based lane changes, we collected and used data from simulator studies and one real-world data collection.

Two datasets were collected with the simulator at the University in Duisburg-Essen. In these studies, driving data and labels of more than 500km of simulated highway driving

(23)

2.2 Source of data 15 were recorded [44, 45]. We used lane change intentions to the left lane only and the data was reduced to sequences where a left lane actually existed. Otherwise the data in which no left lane existed would corrupt the null hypothesis, because no study participant would have requested a lane change to the left lane when no such lane existed, even if the relation data to the front vehicle might suggest a lane change motivation. More data was collected from different test subjects using BMW’s Spider simulator.

Another dataset was collected with the research prototype vehicle on the German autobahn. It includes 294 km of real world driving data from a driver study, using manual online labeling of the intentions to change lane [46]. During the study, the driver took over the role of a chauffeur, keeping the current lane and driving at speeds of the (advisory) speed limit.

The co-driver was sitting in the co-driver’s seat and ordered to take over the responsibility to decide when to change lane. He or she has to indicate lane change intentions by pressing a special button, which was then shown to the driver in the head-up display. The driver has to execute the lane change wish as soon as it was safe to do so. Because the mirrors of the vehicle were still adjusted for the driver, the co-driver could not check if his lane change wish was realisable. This study setup was intended to deliver clean labels of the start of subjective lane change intentions of the study participants.

2.2.5 Data for zip merge situations

The data set for zip merge situation was recorded on BMW’s Spider simulator. A custom scenario was built in which a lane disappears and the driver in the simulation always started on the soon-to-disappear lane, so he was force to merge onto the remaining lane at some point in time. The other vehicles were controlled by a psychological driver model, but are of minor interest, as the data will be used to predict which of the potential gaps the human driver changes into, and when.

For this, we recorded 210 situations with this specific scenario, which each last between 40s to 75s.

(24)

2.3 Ground truth determination

To learn the model parameters in supervised classification methods like Logistic Regression or Bayesian Networks, class labels in the data set(s) are needed. Often these labels are defined by human experts, but labeling is not standardised in any way, leading to every expert defining the labels in a slightly different way.

Depending on thetypeof class we need to label, labels can be either subjective or objective. Use figure2.3as example scenario for the times of interest of different labels: in case of lane changes, the exact point in time of the lane change (t₀) can be defined objectively as there can be a clear definition of when a vehicle is matched to a specific lane and when that matching changes, e.g. when the middle of its front crosses the lane marking. Even the beginning of a lane change (t_change) can be determined in an objective manner by taking into account the lateral movement during the full lane change, although sensor noise makes it hard for any algorithm to label this point in time robustly. But when exactly should we label the point in time when the driver first had a lane changeintention, even if he did not act on it immediately, because e.g. traffic flow did not permit a lane change at this point in time (t_motiv)? There exists no distinct definition as the motivation to change the lane depends on a driver’s individual awareness of the situation and on the personally accepted minimal time distances to surrounding vehicles. Consequently, this is a subjective label.

We have evaluated the significance of the efforts spent on labeling time series data and its effects on the performance of learning the parameters used for maneuver prediction. For both, the labels for the objective start of the lane change maneuver and the subjective start of the lane change motivation, different labeling strategies are compared in terms of applicability to different labeling tasks, classification quality, time spent in the labeling process and scalability of the process depending on a data set’s size.

ego-vehicle 130 km/h

110 km/h

tmotiv t_change t₀ Time

90 km/h Figure 2.3 Relevant times for lane change labeling

(25)

2.3 Ground truth determination 17

2.3.1 Labeling Methods

For labeling driving maneuvers, like lane changes, several methods exist to label time series data. An overview of the most frequently used labeling methods follows.

Manual offline labeling

Labeling data for supervised learning is often done offline as a post-processing step by a human expert who has a deep understanding of the affected domain. One way of reviewing and labeling the driving data manually is to process the data first to show all relevant sensor measurements in a highly compressed form. The expert can then zoom in on relevant sequences and define the labels precisely (see figure2.4for an example).

However, when tagging subjective labels, deep awareness of the recorded driving situation is needed, the data has to be played back in real-time so the expert is able to see things from the driver’s perspective. Sometimes inferring the driver’s intentions just from data might not even be possible. The quality of the labels depends on the immersion into the driving situation, the similarity of the thoughts patterns of the expert and the driver, on the quality of the recorded data and on the representation of that data.

Figure 2.4 Example of a lane change scenario with manual online label for t_motiv and semi- automatic offline label for t_change (solid vertical lines). The lane change takes place at t₀ = 0, the dashed vertical lines mark the time reference for the camera images shown at the top.

(26)

One main advantage of manual labeling is that measurement errors, e.g. self-localisation errors, can be recognised by a human expert easily, then to be marked as invalid data point.

Moreover, an expert usually is able to detect relevant characteristics in the data even if the measurements are noisy. The disadvantage of the manual labeling process is the amount of time it takes to review all data and define all labels by hand. Consequently, manual labeling does not scale with the amount of driving data for the training and evaluation of the learning algorithms.

Automatic offline labeling

Some characteristics in the recorded data can be used to automatically find the start and the end of a lane change. An expert manually sets up an algorithmic definition of the maneuver at first. Based on that, more maneuvers can then be found automatically. As an example, the end of a lane change can be defined as the point in time when the center of the vehicle’s front crosses the lane marking (seet₀ in Figure2.3). Assumptions about the lane change can then be used, e.g. a constant average durationT_motiv of the lane change motivation and a constant average durationT_change of the actual changing of the lane, to compute the start time of the motivation or the actual lane change with

t_motiv=t₀−T_motiv and t_change=t₀−T_change (2.1) The lane change duration to be used forT_motiv and T_change can either be defined by an expert (see [47–49]) or be computed in some way from recorded data. One possibility for that is defining a maximum entropy for the separation of the two classes C = {lane change, keep lane} and then calculating the mean lane change duration with the help of mutual information [43].

For the latter, letDbe the set of all values of the lateral positiond_lat relative to the lane the vehicle is driving on at timet_k wheret_k<t₀. Presumed every value of N measurement instances ofd_latat timet_kis unique, there existN−1 possible cut-pointss∈Sto splitDin two partitionsD₁ withD<s_nandD₂withD>s_n. Every cut-points_nnow induces a class information entropy [50]

H_I(D,s) = |D₁|

|D| H_C(D1) +|D₂|

|D| H_C(D2) (2.2)

in whichHC(D)is the class entropy defined as

(27)

H_C(D) =−

∑

i

P(Ci,D)log₂P(Ci,D) (2.3) The cut-point s_n minimising the class information entropy H_I is the entropy-optimal cut-points∗ withH_I,min=H_I(D,s∗). H_I,min can be seen as a quality indicator of the class split. As long asH_I,minis lower than a defined thresholdδ_H,max,Dcan be split well enough into the two classes of C, which means that the average lane change started even beforet_k. ComputingH_I,minis iteratively repeated for a decreasingtuntilH_I(D,s∗)>=δ_H,max, which marks the average start timeT_change of a lane change relative tot₀(see Figure 2.5).

Figure 2.5 Top: Lateral position of all recorded lane change sequences (black) and keep lane sequences (gray). Bottom: Optimal split class information entropy (black solid), entropy thresholdδ_H,max₌0.2 (black dashed).

The main advantage of the automatic labeling process is that it is fast and scales well.

Compared to defining the lane change duration for the automatic labeling by hand, the estimation ofT_change with the help of class information entropy relies on the true objective properties in the data. The disadvantage of automatic labeling methods shown here is that all maneuvers are labeled with the same maneuver duration, regardless of the characteristic of theindividualmaneuver, e.g. both a slow and a fast lane change by different drivers are both labeled using the sameT_change. Another downside is that there are parameters (the entropy threshold δ_H,max or the label duration T_change itself) that need to be tuned manually, once.

Moreover estimating the class information entropy has high computational costs, which hurts especially in high dimensional feature spaces.

(28)

Semi-automatic offline labeling

It might be helpful to improve and speed up the manual labeling process by combining manual and automatic labeling. For example, the end timet₀ of a lane change maneuver can be easily found automatically in the data, but the durationT_change of an individual lane change is hard to determine algorithmically. This is where a human supervisor can be employed to manually adjust the maneuver start time for each automatically found instance.

At this point it also has to be mentioned that there exist research fields called semi- supervised learning (with the distinction of inductive and transductive techniques) and active learning (which is sometimes also called query learning), but we do not investigate the applicability of these techniques here, as their focus is on learning algorithms instead of the labelling process. For an in-depth view on semi-supervised learning see [51], for a literature review on active learning see [52].

Manual online labeling

Sometimes data labelling is hard by just seeing recorded videos or analysing sensor measurements, because driving scenarios can be very complex. To deal with this, labels can also be acquired directly while driving and while recording data. The driver or a co-driver can either define labels by recording audio comments or by pressing buttons. While audio comments need to be processed manually offline in order to generate class values for the classifier’s training, buttons can be evaluated programmatically, which simplifies the post-processing a lot.

The advantage of labeling while driving is that there is a good situational awareness while defining the label. A negative impact of labeling while driving is that the driver might be distracted from the driving task, which on one hand is safety-critical and on the other hand affects the quality of the labels. Depending on the number of different labels that need to be defined and on the complexity of the current traffic situation, the driver can be overstrained with labeling while driving. However, based on the problem statement (such as only evaluating satisfaction or comfort of longitudinal distances) a co-driver can be employed to do the labeling. The labeling of the co-driver does not negatively affect safety and he can concentrate fully on the labeling process. But it remains an open question whether the traffic situation can be assessed from the perspective of the co-driver’s seat just as well as from the driver’s point of view, especially because the co-driver is not in responsible for steering the vehicle.

(29)

2.3.2 Driving Data

We used the 294 km data set described in2.2.4to compare the different labeling strategies.

The recorded lane changes have been re-labeled several times to test the different labeling strategies. As features for the detection of already started maneuvers (short term prediction) we use lane relational features, in particular the lateral distance d_lat and lateral velocity v_lat relative to the lane the vehicle is currently driving on. For the anticipation of lane change motivations we use inter-vehicle relational features, namely the time to collision ttc=∆d_pre/∆v_preand the time gaptgap=∆d_pre/vego relative to the preceding vehicle on each lane. For a better scaling of the features’ values the inverse of both thettc andtgap is used. Figure2.4 shows an exemplary recording of a lane change to the left in a situation similar to the scenario shown in Figure2.3.

To train and test the maneuver prediction described in the next section based on the different labeling strategies, the recorded time series data has at first been cut into smaller sequences of a maximum of 5 minutes driving time each and afterward been split into a training and a test set in the ratio of 0.6 to 0.4. With an average sampling time of 20 ms, the ratio of the label "lane change" to the label "keep lane" is approximately 0.97 to 0.03.

The ratio for the training data of the maneuver intention is with 0.89 to 0.11 a little less skewed, which is because of t_motiv ≤t_change relative to t₀ for the manual online labeling.

But in general, training data with an unbalanced ratio of the class distribution causes most predictors to underestimate the likelihood of the class that is less represented in the data.

Therefore, the data has been under-sampled to synthesise equally distributed data according to common practice.

2.3.3 Maneuver Prediction

We distinguish between the prediction of a lane change maneuver that already started (LC - short term prediction) and the anticipation of a driver’s lane change intention (LCI). For both the prediction of LC and LCI we propose a Bayesian Network (BN), which uses the labels defined by the different labeling strategies described in section2.3.1. The two separate net structures for inferring the probabilities for LC and LCI are depicted in Figure2.6. While the structure for inferring LC is straightforward, to compute the probability for LCI, for every drivable lanettc⁻¹andtgap⁻¹to the next vehicle in front on that lane are being used to compute the driver’s contentednessC_ion an individual lanei. The probability for the LCI is then computed depending on the driver’s contentedness of all available lanes. A more detailed description of this particular prediction framework will be shown in section3.2.2.

(30)

Figure 2.6 Overview of the structures of the two Bayesian Networks used for short term lane change prediction LC (left) and longer term lane change intention LCI (right).

2.3.4 Comparison

Labeling the start of the lane change maneuver att_change has been done automatically with different values of T_change defined manually (1s and 3s) and with T_change = 1.66 s, which was found by defining a maximum class information entropy of δ ₌0.2 in the data (see subsection2.3.1). Moreover, every lane change has been labeled semi-automatically offline, sot₀has been found in the data automatically and an expert defined the correspondingt_change for every label manually (as described in subsection 2.3.1). This resulted in an average labeled lane change duration ofT_change = 3.05s with a standard deviation ofσ ₌_{0.80. The} semi-automatic labeling process took about 2 hours for the nearly 3 hours of driving data.

Because the labels for the maneuver intention att_motivare of subjective type, the labeling strategies being compared are manual online labeling (see subsection2.3.1and 2.3.2) and automatic offline labeling with different values of T_motiv defined manually (1s, 5s and 10s).

The labels of the manual online labeling process show an average lane change motivation duration ofT_motiv =5.45swith a standard deviation ofσ ₌_2.42.

Figure 2.7 shows the Receiver Operating Characteristic curves (ROC) of the resulting bayesian network’s performance for the different labeling strategies. TPR is the true positive rate and FPR is the false positive rate of the maneuver classifications. On top, the results for the prediction of the start of the lane change maneuver (LC) are shown, the curves at the bottom show the results for the lane change intention (LCI). To pick the optimal classification threshold from the ROC, [53] proposed (among other metrics) the minimum distance to the optimal ROC point at [0, 1], which can be found by computing

d_opt=

!

(1−T PR)²+ (FPR)² (2.4)

As we can see in the top row of Figure2.7, the entropy-based one performs best with the lowest false positive rate, while maintaining a good short term prediction LC time horizon

(31)

Figure 2.7 Receiver Operating Characteristics (ROC) curves for different labeling strategies.

The optimal threshold is marked with a circle. First row: Results of predicting short term LC. Second row: Results of predicting LCI.

of 1.66s. Using the semi-automatic expert labeling increases the average label’s prediction time to 3.05s, but at the cost of introducing more false positives. Depending on the use case, this might be an acceptable trade-off.

The bottom row shows the results for the LCI results. When choosing a very short prediction timeframe, results are quite good. But when choosing a more reasonable time frame, we notice that a fixed selectedT_motiv=5sstill gets better results in both TPR and FPR, compared to using the times reported by the traffic participants during the study (manual online labeling).

Conclusion

For labeling the start of a lane change to answer the purpose of predicting a lane change maneuver as soon as it starts, it could be shown that the automatic labeling, which is fast, is favorable over semi-automatic labeling, in which a lot of work needs to be done manually.

The main challenge in labeling the start of the maneuver is to choose an adequate constant maneuver duration that trades off a long prediction time versus a high prediction accuracy.

Choosing the labeling duration based on class information entropy can be helpful, because it leaves the expert with only one parameter that needs to be tuned manually, that indicates the expected classifications’ quality.

(32)

The manual online labeling’s results turned out to be interesting, which we investigated with a driver study. We wanted to know if a manual labeling process is needed to gather labels that are normally hidden and subjective in nature, like the lane change intention of a driver. We found that manual labeling is not necessarily needed, yet maybe not even rec- ommended, because the labels can only be acquired in a tedious process and the prediction performance does not seem to profit from the manually defined labels at all. But the reason for the relatively poor results of the manual online labeling could also be because of dif- fering individual driving behavior of the study participants or a poor situational awareness of the traffic situation from the co-driver’s perspective. These open questions could to be investigated in future research.

For these reasons, we will be using automatic, entropy-based labeling methods for our predictors in the use cases.

(33)

2.4 Feature Selection 25

2.4 Feature Selection

In the recorded driving data, there are many potential data points, which can be measured about the own vehicle or the outside environment, e.g. own speed, speed of the preceding vehicle, distance to the succeeding vehicle, current speed limit, traffic density, vehicle in- dicators and hundreds of others. They also are only observable with a varying degree of data quality depending on the sensors. The development of a number of different classifiers, which are better suited for various types of data, was the first step towards overcoming limitations caused by the high feature dimensionality. However, later it turned out that the use of unnecessary features as input for classifiers and not the types of classificators were the main issue [54]. Which of these potential data points ("features") should be used as input for the different models can be answered by using feature selection algorithms.

Feature selection techniques aim to identify a subset of features relevant for some specific task from a complete data set (dimension reduction).

For this thesis, we are in particular interested in feature selection for a classification problem. Starting from the initial set of features, we want to extract the ones which are relevant for the classification by removing those which are redundant, noisy or which in general do not significantly increase the "information content" of the feature set.

2.4.1 Feature selection overview

Feature selection problems can be split into four basic steps:

• subset generation;

• subset evaluation;

• stopping criterion;

• result validation.

Subset generation: is the choice of the best subset candidate for feature selection. It can be generated by adopting an exhaustive search of the complete initial set of features (almost never used due to the exponential growing of the number of subsets with the dimension of the initial set) or by exploiting heuristic methods.

Subset evaluation: the comparison of subsets generated in the previous step in order to select the best one according to an evaluation criterion. Depending on the choice of the

(34)

evaluation criterion, feature selection methods can be classified in: filter models, wrapper models and hybrid models. The next section explains this in more detail.

Stopping criteria: are a set of conditions which tell when the feature selection algorithm should stop either because the best subset was found or a certain limit in the accuracy, time or subset dimension was reached.

Result validation: in this step the found subset is evaluated on a different data set.

To evaluate the subsets, two main categories of feature selection algorithms exist: filters and wrappers, plus an additionally method which is a mix of the two.

Filter methods rely on the properties of the data without using any classification algorithm; they are typically composed of two steps:

• Rank features based on some criterion;

• Select the one with the highest score.

For the first step there are two possible choices: analysing features independently or grouping them in sets, which is especially useful to identify redundant features.

These algorithms are usually fast (compared to wrapper methods) but might not be as accurate, since features are chosen without directly checking their influence on classification performance.

Wrapper models use classifiers to find the subset of features which maximises the classification result. Once a classifier is selected, wrapper methods usually follow three steps:

• Select a subset of features;

• Evaluate the performance of this subset by training and evaluating the classifier;

• Repeat previous steps until a desired quality is reached.

Wrappers typically take a long time to run, as their speed is influenced by the chosen classification method which needs to be trained and evaluated for each combination of features. But this will result in a higher classification performance if all feature combinations are evaluated. There are 2ⁿ−1 combinations for n features, so this becomes intractable quickly if there are many feature candidates.

Hybrid modelscombine the previous two methods in order to improve both time performance and classification accuracy [55,56].

For our work, we used a filter-based feature selection algorithm FCBF [57, 43, 58], which uses information content correlation as selection criterion. This allows us to avoid

(35)

2.4 Feature Selection 27 methods which transforms the input features into new meta-features, which are hard to understand, as for example principle component analysis (PCA) does. It also avoids the long runtime problems of wrapper based selection. A explanation how this works will be described in the next sub-sections. For more details refer to [59,56,57].

2.4.2 Information theory

The first two concepts of information theory are information and uncertainty [60]: these are two sides of the same coin, the former express the degree of knowledge about some phenomena, the latter derive from the lack of information.

Information contentquantifies the amount of information which is gained from a particular outcome. Consider a random event: before the outcome is known we have a certain level of information on that event. Once we observe the result, the information content increases.

Three conditions have to be satisfied by the information content:

• Events with lower probabilities increase the information content more and vice versa;

• Events which will happen for sure give no contribution on the information content (an extreme case of the previous point);

• The information content of multiple events is the sum of each single event, if they are uncorrelated.

A widely adopted formula for the information contenthof an eventx_iis:

h(xi) =log₂ 1 pi

(2.5) in which p_i is the probability of the event x_i and the unit is "bit" due to the base 2 logarithm.

Information contenth(xi) is computed for a single outcome and quantifies the amount of information gained if that event happens; a more general concept is the information entropy, which quantifies the uncertainty of a random experiment, so it takes all possible outcomes into account. Essentially it is the expected value of the information content:

H(X) =

N

∑

i=1

p_i·log₂ 1

p_i (2.6)

Since

a→0lima·log1

a =0 (2.7)

(36)

if an event has probabilityp_i=0 then its contribution toH(X)is null.

(a) (b)

Figure 2.8 (a) shows the behaviour of information content over probability; (b) shows the entropy gain of two variables with probability pand 1−prespectively.

Moreover for two random variablesX andY, the contribution of those are of interest:

H(X,Y) =

∑

i,j

P(xi,y_j)·log₂ 1

P(xi,y_j) (2.8)

which becomes simply:

H(X,Y) =H(X) +H(Y) (2.9)

ifX andY are uncorrelated random variables. If we want to know which effect knowing one variable has on the entropy of another variable, we can use:

H(X|Y) =

M

∑

j=1

P(yj)·

N

∑

i=1

P(xi|yj)·log₂ 1

P(xi|yj) (2.10)

This is calledconditional entropy.

2.4.3 Mutual information and Symmetric uncertainty

With the concept of conditional entropy and how one variable influences another, we can introduce a new quantity, theMutual information (MI), which estimates the correlation of two random variables:

MI(X|Y) =H(X)−H(X|Y) (2.11)

(37)

2.4 Feature Selection 29 It can be demonstrated that the MI is symmetric and so it can be written as well:

MI(X|Y) =H(X)−H(X|Y) (2.12)

=H(Y)−H(Y|X) (2.13)

=H(X) +H(Y)−H(X,Y) (2.14)

Finally the introduction of the concept ofSymmetric uncertainty (SU)allows the estimation of how much one random variable predicts the other:

SU(X,Y) =2· MI(X|Y)

H(X) +H(Y) (2.15)

This last quantity is key in many feature selection algorithms for measuring the correlation of different features to identify redundancies.

2.4.4 Fast correlation-based filter

A feature is considered good if it provides high contribution to the classification and if it is not redundant to any of the other relevant features [57, 61]. Since correlation is an appropriate measure able to estimate the feature goodness, a feature is good enough and should be kept if it has an great correlation to the class but also as low as possible correlation to any other features.

We try not to capture only the linear dependencies, but also those based on shared information among the features, exactly what symmetrical uncertainty does. The Fast Correlation-Based Filter (FCBF) successfully deals with the task of reducing the feature dimensionality by using the symmetrical uncertainty idea.

In this section we presume that our data is represented with the help of the features F₁, ...,F_N and stored for a later use in a data set. The symmetrical uncertainty between two featuresF_iandF_jwill be shortly denoted bySU_i,_j.Cis the class label.

The FCBF algorithm is calculated in two steps:

Step 1:Compute the contribution of a single featureF_ito the classification result. This way, all important features can be found with the help of symmetrical uncertainty, which determines how good feature values predict the class label. For a featureF_ito be considered as relevant,SU_i,Cneeds to be greater than the user selected thresholdδ value. The subset of all relevant features can be given by

(38)

S^′={Fi|SUi,c≥δ _:_∀F_i_,₁_≤i≤N} (2.16) Using a user defined threshold is a common practice in Machine Learning. In order to select the best threshold, a series of tests with different such threshold values is conducted.

This process can also be automated in order to save time, efforts and exclude the possibility of human error.

Step 2: Decide if a relevant feature contains unique information among all other features.

The degree of redundancy between two features Fi andFj can be computed by their symmetrical uncertaintySUi,j. Non-redundant features can be pointed out in the same straightforward manner as in step 1. But this approach is only reasonable for single class problems.

In multiple label classification tasks an issue may arise if we do not take both concepts (feature and class correlation) into account. In a situation in which feature F_i and F_j are uniquely relevant for class labels respectivelyC₁andC₂, but at the same time mutually redundant, eliminating one of the two causes a loss of information and negatively affects the classification ofC₁orC₂. In consequence, the predominant correlation concept is needed:

Predominant Correlation: The correlation between featureF_iand the classCis called predominant if this feature F_iis more determinative for the estimation of the class concept than for the forecasting of any other feature: iff SU_i,C ≥ δ _and _SU_i,_j _≤ _SU_i,C _{holds for}

∀Fj∈S^′(j̸=i). A featureF_j which does not fulfil this condition is called a redundant peer toFi and is a candidate to be removed by FCBF. SP_i stands for a set of all redundant peers to Fi. All redundant peers of Fi which are stronger correlated to the class concept thanFi

are denoted byS⁺_P

i (positive closure). Analogously,S_P⁻

i (negative closure) are all redundant peers toF_iwhich are not able to predict the class label as well asF_i:

S⁺_P_i ={Fj|Fj∈SPi,SU_j,C>SU_i,C} (2.17) S⁻_P_i ={Fj|Fj∈SP_i,SU_j,C≤SU_i,C} (2.18) Predominant Feature: A predominant feature is a feature that has a predominant correlation to the class or gains such a correlation after removing its redundant peers.

The main objective of feature dimensionality reduction can be simplified to the task of finding all predominant features, because a predominant featureF_i contributes significantly to the process of the class determination. On the other hand all features which belong to its negative closureS⁻_P_ido not bring new information to the classification process and therefore can be removed. In order to identify these features, FCBF uses three simple heuristics: