• Keine Ergebnisse gefunden

3.3 Neural Networks

3.3.6 Machine Learning Tasks

In the previous sections, most aspects of deep neural network design and training have been covered. Here, we briefly discuss how neural network models differ at the output for different learning tasks. The three most common learning tasks are classification, segmentation, and regression.

Classification. The most common deep learning problem for medical diagnosis tasks is classification. Here, the network needs to distinguish between one or more classes. In terms of network design, a common deep learning model uses a structure as described in Section 3.3.2, where the network input is gradually resized by several consecutive layers until a GAP layer is reached. Afterward, a fully-connected output layer needs to map features to classes. To obtain a categorical output that matches the classification problem, a one-hot approach paired with a softmax activation function can be used. One-hot encoding is an approach whereNcclasses are represented by a vector oclass ∈ RNc, which is chosen as the size of the output layer. While Ncclasses could also be represented by a scalar that takes value from0toNc−1, the one-hot encoding enables us to formulate the classification problem in terms of approximating our target data distribution. By applying the softmax function

softmax(x) = exp(x)

PNc

j=1exp(xj) (3.41)

to the output of our last layer, we obtain an approximation of a probability distribution over all target classes. Now we can formulate a loss for approximating our target distribution by using the negative log-likelihood, also called cross-entropy between training data and deep learning model’s prediction distribution. For the binary case with two classes, the cross-entropy loss is defined as

JCEB :=−(y log(ˆy) + (1−y) log(ˆy)) (3.42) where ˆy is a softmax model output. Similarly, for the multi-class case, the cross-entropy loss is defined as

JCEM :=−

Nc

X

j=1

y log(ˆy). (3.43)

Tab. 3.1: Example of a confusion matrix for binary classification where class A is the positive class and class B is the negative class.

Ground-Truth Class A Class B Prediction Class A TP FP

Class B FN TN

Note that softmax is used to approximate distributions with mutually exclusive classes.

For some classification problems, an example can belong to multiple classes at the same time. Here, the model outputs can be mapped to probabilities by individually scaling them using a sigmoid function.

For evaluation, the cross-entropy is usually not employed as it is difficult to interpret.

Instead, metrics such as accuracy, sensitivity, specificity, precision, and F1-score are often used. These metrics are defined as

Acc := TP +TN

TP +TN +FN +FP (3.44)

Sens := TP

TP +FN (3.45)

Spec := TN

TN +FP (3.46)

Prec := TP

TP +FP (3.47)

F1 := 2 PrecSens

Prec+Sens (3.48)

for the case of binary classification. True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) can be derived from a confusion matrix, as shown in Table 3.1. These can be extended to a multi-class problem with a one-vs-all approach. The accuracy provides an indication of how often a model classifies an example correctly in relation to all examples available. Due to class imbalance, this metric can be misleading, for example, if one class is more common than the other classes. For this purpose, sensitivity and specificity can provide more information.

Intuitively, the sensitivity indicates how well a model detects a class. A high sensitivity is often traded for a lower specificity as a model that finds class A very consistently, while often classifying examples of class B as class A, has a high sensitivity and a low specificity. TheF1-score is a more balanced metric as it combines precision and sensitivity in a single metric.

Segmentation. In many medical applications, the goal is not to assign a class to an image but to segment different structures within the images. For example, for multiple sclerosis, lesions in the brain need to be segmented in MRI scans. Thus, the model input is an MRI image, and the output is a classification of every pixel or voxel, which corresponds to a segmentation map. As given in Equation 3.1, the output needs to be spatially resolved. Thus, CNN architecture design needs to be different for this task. Encoder-decoder models are a popular approach for this problem, which were

3.3 Neural Networks

Res.Block/2 Res.Block/2

Convolutional

Conv ConvDown ConvDown ConvDown ConvUp ConvUp ConvUp Conv

Blocks Convolutional Blocks Convolutional Blocks Convolutional Blocks BlocksConvolutional BlocksConvolutional BlocksConvolutional

Input Image

Input with Segmentation Segmentation

Fig. 3.10: An example for a segmentation task using an encoder-decoder CNN.

popularized by U-Net [409]. An example structure, including the input image and the output segmentation map, is shown in Figure 3.10.

In terms of the network structure, the encoder processes the image similar to a normal CNN. This results in a compressed representation that captures important information in the image. Then, the decoder upsamples this representation until it reaches the same spatial size as the input.

Segmentation is similar to classification as we perform a pixel- or voxel-wise classifi-cation. Therefore, the loss functionsJCEB andJCEM can be applied to every pixel or voxel output independently, which is averaged into a final, scalar loss value. A problem that commonly occurs is a severe imbalance between classes within the image. For example, for multiple sclerosis, most of the voxels in the image belong to the background class, and only a few correspond to the foreground class. Therefore, other loss functions have been developed. A popular approach is the dice loss function [339]. Here, we minimize a variant of the Sorensen-Dice coefficient [107, 461] which is defined as

JDB := 1− yˆy +

y + ˆy + −(1−y)(1−ˆy) +

2−y−y +ˆ (3.49)

for the binary classification case whereis a small scalar for numerical stability. For the multi-class case, the generalized dice loss is more popular, which is defined as

JDM := 1−2

PNc

j=1αjyjˆyj

PNc

j=1αj(yj+ ˆyj (3.50) whereαj are class weightings, often chosen as the inverse volume of classj [97, 467].

The dice coefficient is also commonly used as an evaluation metric for segmentation tasks.

Besides dice coefficient variants, classifications metrics such as accuracy, sensitivity, and specificity are often used as metrics for evaluating segmentation problems.

Regression. For a regression task, the goal is to predict a continuous value instead of a binary or nominal output. While digital processing requires discretization, the outputs for a regression task are much more fine-grained. In terms of network design, we can use architectures similar to those used for classification problems. Only the output layer needs to be changed where we employ linear neurons without an activation function.

A suitable loss function for this problem punishes the amount of deviation from the true value. A popular example is the mean squared error (MSE) which is defined as

JMSE := 1

whereNb is the number of examples being evaluated. Squaring the error leads to increased punishment of large deviations from the desired value, which can help faster convergence. As an alternative, the mean absolute error (MAE) can be used which is defined as

whereNbis the number of examples being evaluated. As the MSE gives more weight to outliers, the MAE can lead to different results after optimization. The MAE is also useful as a metric for evaluation as it reflects the deviation from the target values without a squared distortion.

Another useful metric for evaluation is Pearson’s correlation coefficient (PCC). The PCC is defined as whereNb is the number examples being evaluated,y¯j is the mean predicted value for outputj andy¯ˆj is the mean target value for outputj. This metric provide a relative estimate of how similar predictions and targets are without depending on a unit. As an alternative, the MAE can be normalized to achieve independence from the target’s unit and range. When dividing the MAE by the target’s standard deviation, we obtain the relative MAE (rMAE) [55].

Regression is a typical task for deep learning applications in the context of computer-assisted interventions, for example, for force estimation, pose estimation, or tracking.

For diagnostic tasks, regression is employed for estimating quantitative tissue parameters such as the size of the left cardiac ventricle.

3.4 Summary

In this chapter, we introduced the fundamentals of neural networks and deep learning.

Deep learning is a method to learn an unknown function from data. While optimization is used to find deep learning model parameters using a training dataset explicitly, the overall goal is to achieve generalization. This property describes a model’s ability to perform well on new data that was not part of the training process. The deep learning problem can be extended to a bilevel optimization problem, where hyperparameter selection is incorporated as well. The hyperparameter search space is usually designed by a domain expert and depends on prior knowledge.

Neural networks are the basis of most modern deep learning methods. In their standard version, fully-connected neural networks learn a nonlinear transformation of a feature

3.4 Summary

vector for predicting an example’s class or a continuous value. The feature vector’s nonlinear transformation is comprised of linear matrix multiplication, where matrix entries are learnable parameters, followed by a nonlinear activation function such as a rectified linear unit.

Medical images are difficult to process when represented by a feature vector. There-fore, neural networks have been extended for directly processing images using convo-lutions instead of matrix multiplications. Here, the convolution’s filters are learnable parameters. Also, medical image data can have a temporal component, for example, in terms of a series of images capturing a process, including motion or several images for longitudinal analysis. For this type of data, recurrent neural networks are often used. Temporal information is aggregated through recurrent connections and sequen-tial processing. In particular, gating mechanisms are effective for capturing long-term information within sequential data.

Neural networks are typically trained using a labeled training dataset and gradient descent. The neural network’s parameters are optimized such that they minimize a loss function. In general, this optimization process is difficult as the problem is nonconvex, and the actual goal of generalization is not explicitly captured in the process. Recent gradient descent algorithms such as Adam improve the optimization procedure by using adaptive learning rates and exponential moving averages of the gradients’ statistical moment estimates. Regularization methods such as dropout, batch normalization, and data augmentation can also be used to improve generalization. For learning tasks such as classification, segmentation, and regression, different loss functions and evaluation metrics are used for assessing generalization performance.

4 Multi-Dimensional Deep Learning Methods

In this chapter, we adapt and propose a plethora of methods for deep learning-based medical image analysis in the context of multi-dimensional data and different data representations. A method’s suitability for multi-dimensional data can be characterized by two requirements. First, the method needs to be scalable to lower and higher dimensions, for example, for considering multiple spatial data dimensions. Second, the method needs to be well-tailored to the different dimension’s characteristics. For example, spatial, short-term temporal, and long-term temporal dimensions might benefit from a different treatment. As a result, deep learning methods that are used for exploring multi-dimensional data characteristics need to be both adaptable and well-tailored for the problem at hand.

The first set of methods we consider are multi-dimensional CNNs. Traditionally, CNNs have been used for 2D natural image data. As described in Chapter 2, the discrete convolution can be easily extended from 1D to 4D, making CNNs very versatile.

Yet, architecture design is very challenging as naive extensions from lower to higher data dimensions are accompanied by substantial increases in model parameters and computational cost. We address this problem by careful handcrafting of architectures, automated architecture search, and multi-dimensional transfer learning strategies. While useful for many different problems, convolutions are not immediately suitable for small data dimensions that are part of 2.5D or 3.5D data. For this purpose, we propose different types of Siamese-like CNNs that can be applied to a large set of problems where one of the dimensions only encompasses two samples. In particular, we introduce a novel attention-based information exchange mechanism for two-time point data. Furthermore, CNNs treat all data dimensions equally, which might not be optimal in some cases. For this reason, we consider recurrent-convolutional models that process the temporal data dimension differently, using learnable gating mechanisms. In this context, we propose a novel convolutional-recurrent model that demonstrates promising results across multiple applications.

Throughout this chapter, we describe general architecture concepts that are not neces-sarily tied to a specific implementation or application. We implement these concepts for different applications in Chapter 6.

4.1 1D, 2D, 3D and 4D CNNs

Convolutional neural networks have been studied extensively for medical image analysis.

As elaborated in Chapter 5, a key component that is missing is a multi-dimensional perspective. In other words, how should CNNs be transferred between different data

dimensions? We propose a number of methods to address this question, starting at 1D and 2D CNNs. Due to the low data dimensionality, we propose an automatic search strategy to obtain suitable architectures that are effective both for 1D and 2D data. The largest amount of CNNs are designed and proposed for 2D data, as this is the standard natural image dimensionality. Therefore, we address the question of how to extend 2D CNN concepts to 3D next. Here, we also propose a multi-dimensional transfer learning approach, which allows for effective architecture reuse in higher dimensions.

Last, we consider the extension of 3D CNNs to 4D, which is a field that has hardly been addressed.