• Keine Ergebnisse gefunden

Recurrent-Convolutional Models

CNN-GRU. Often, high-dimensional data is a combination of spatial and temporal dimensions. For the natural image domain, videos typically have a shape of x ∈ Rnt×nh×nw×nc where nc = 3 is the RGB color channel. While convolutions can be used for any type of data dimension, recurrent architectures, that we introduced in Section 3.3.3, have also been shown to be particularly effective at processing temporal dimensions. The original design of RNNs is meant for time series that are sequences of feature vectorsx ∈Rnt×nc wherencis the feature vector’s length. Thus, they are not suitable for multi-dimensional data, where we have to deal with sequences of spatial images.

The conventional approach for this problem was to use feature extraction methods to obtain a feature sequence from individual images. This approach has been extended to the deep learning domain by using a CNN as the feature extractor [108]. Here, a CNN is trained on individual 2D imagesx ∈Rnh×nw×nc, for example, for a conventional image classification task. Then, the CNN is used for the target spatio-temporal sequence by performing forward passes through the CNN for the individual images in the sequence.

Typically, the features after the CNN’s GAP layer, before the classification layer, are extracted from the forward pass of each image. As a result, we obtain a sequence of

4.3 Recurrent-Convolutional Models

Fig. 4.15: The RN2D-GRU model we employ, which is inspired by typical CNN-RNN models for video processing in the natural image domain [380]. S GAP refers to spatial GAP.σandtanhrefer to dense neural network layers with sigmoid and tanh activation function, respectively.

feature vectorsx0 ∈Rnt×n0c, encoding the spatio-temporal sequencex∈Rnt×nh×nw×nc. The sequencex0is now used to train a conventional RNN architecture for a target task such as action recognition or tracking.

An extension of this approach is to train a single deep learning model in an end-to-end fashion instead of pretraining the CNN for a different task [380]. The concept is shown in Figure 4.15. Here, the CNN and the RNN architecture are directly connected after the CNN’s GAP layer. The full architecture takes the full spatio-temporal tensor x ∈ Rnt×nh×nw×nc as the input. Each of the nt CNN paths processes one image x∈Rnh×nw×nc. Typically, all CNN paths share parameters as we expect the individual images in the sequence to have similar spatial features and intend to focus inter-frame relationship learning on the temporal RNN part of the model. In practice, this can be implemented by using a single CNN that processes all images in the sequence in parallel through the batch dimension. Thus, with a batch dimension of size Nb, the single-path CNN processes multiple sequences x ∈ RNb×nt×nh×nw×nc as a reshaped tensor ˆ

x ∈RNbnt×nh×nw×nc. Notice that this approach is similar to our proposed MP-CNN4D architecture in the previous section, except that we replaced the GAP layer and RNN with a full spatio-temporal convolutional processing block.

Both the CNN and the RNN part of this architecture concept can follow any design principle. For example, the CNN can be based on a ResNet, Inception-V3, or DenseNet, and the RNN can be a vanilla RNN, an LSTM, or a GRU. Throughout this thesis, we generally employ ResNet- or DenseNet-based CNNs for the spatial processing part and a GRU for the temporal processing part. For other multi-dimensional problems, RN2D-GRU can be converted to an RN1D-GRU or RN3D-GRU for sequences of 1D

cGRU-RN2D

Fig. 4.16: The cGRU-RN2D model we propose. S GAP refers to spatial GAP.σandtanh refer to convolutional neural network layers with sigmoid and tanh activation function, respectively. Note, that the use of convolutional layers instead of dense layers is a key difference to RN2D-GRU shown in Figure 4.15.

images or 3D volumes, respectively, when using a ResNet basis.

cGRU-CNN.As a next step, we extend the idea of using both CNNs and RNNs in an architecture. Most approaches for processing videos rely on the concept introduced previously, where a CNN is followed by an RNN. This is largely motivated by the idea to obtain abstract feature representations first and then to find temporal relationships between these abstract representations. While this is a reasonable approach for tasks such as action recognition, obtaining abstract representations of the spatial images first might not be optimal for other tasks. Consider a problem where temporal information is available, but not strictly necessary, for example, in the context of force estimation.

Here, a temporal history of previous images largely serves the purpose of obtaining more consistent estimates from a sequence of images instead of a single estimate. Thus, capturing small, local variations in the full spatio-temporal sequence might be preferable over searching for context between abstract feature vectors.

This motivates the idea to reverse the order of processing steps for combined CNN and RNN architectures. We propose to let an RNN architecture perform temporal processing first, producing a single aggregated spatial representation for further processing by the CNN. Taking the example from the previous section where a video is being processed, the RNN architecture receives a sequencex ∈Rnt×nh×nw×ncand produces an aggregated spatial representationx0 ∈Rn0h×n0w×n0c. Notice that this requires the RNN architecture to be able to perform spatial processing in order to keep the spatial structure intact for subsequent CNN processing. This can be enabled by utilizing convolutional RNN modules which have been proposed for weather forecasting [556]. In that work, the author exchanged the conventional dense neural network layers, that act as the gates, by

4.3 Recurrent-Convolutional Models

convolutional layers for an LSTM model. We apply this approach to the more efficient GRU architecture. As a result, the computations performed by this cGRU are described by

zti =σ(Kzhti−1 +LzRBN(xti)) rti =σ(Krhti−1 +LrRBN(xti))

cti = tanh(Kc∗(rtihti−1) +LcRBN(xti)) hti =zticti+ (1−zti)hti−1

(4.13) where hti is the hidden state, xti is the input, K and L are filters, ∗ denotes a convolution,σ denotes the sigmoid activation function, andRBN(.)denotes recurrent batch normalization [94]. In most applications, we also employ recurrent dropout [378] at the cell input with probabilitypdi and at the cell output with probabilitypdo. Combining this cGRU with a ResNet-based CNN results in our final architecture cGRU-RN2D, which is shown in Figure 4.16 for the example of 3D spatio-temporal data. Similar to RN2D-GRU, this architecture can be applied in a multi-dimensional context by converting RN2D into a RN1D or RN3D for sequences of 1D images or 3D volumes, respectively.

The idea of using convolutions in the processing gates of a recurrent unit can also be directly applied to our CNN-GRU model. To form a CNN-cGRU model, we swap the order of the GAP layer and the GRU layers. Thus, the cGRU layers receive the input x0 ∈ Rnt×n0h×n0w×n0c and produce an outputx00 ∈ Rn0h×n0w×n00c. Then, the GAP layer is applied to obtain a feature vector ˆx00 ∈ Rn00c that is processed by the output layer for producing a prediction.

cGRU-CNN-U. In Section 4.2, we discussed 2.5D and 3.5D architectures in the context of segmentation problems where a segmentation map is derived from two input images. For problems such as MS lesion activity segmentation, an entire temporal sequence of input volumes can be available, instead of just two volumes representing two states. Taking care of sequence processing can also be performed with recurrent models, integrated into a U-Net-like CNN architecture.

Following the idea of fusion between encoder and decoder leads to the architecture depicted in Figure 4.17. Here, the model receives an input of sizex∈Rnt×nh×nw×nd×nc. Then,ntencoder paths process thentvolumes individually, similar to the RN2D-GRU model. Again, parameter sharing can be enabled by processing the individual image volumes as part of the batch dimension.

After the encoder, allntvolumes are aggregated using a recurrent architecture. As the volumes are spatial in nature, we use cGRUs once again in order to preserve the spatial structure for the decoder. Thus, one or several cGRU layers receive the input x0 ∈ Rnt×n0h×n0w×n0d×n0c and output a single spatial representation x00 ∈ Rn0h×n0w×n0d×n0c. Then,x00is processed by a normal decoder that upsamples the representation into the final segmentation predictionˆy∈Rnh×nw×nd.

Another aspect that is unique for segmentation architectures that needs to be consid-ered is the presence of long-range connections in the architecture. Thus, a multi-time point encoder feature tensor needs to be connected to the single spatial representation in the decoder. Here, we opt for the same strategy as for the connection between encoder and decoder by using cGRUs for temporal fusion in each long-range connection before the decoder.

ResBlock

Res.Block/2ResBlock Res.Block/2ResBlock

ResBlock 2×ResBlock 2×ResBlock 4×ResBlock

Segmentation

Conv ConvDown ConvDown ConvDown ConvUp ConvUp ConvUp Conv

cGRU

GatesConv cGRU cGRU cGRU

Map Time Series

of Image Volumes

Fig. 4.17: The cGRU-CNN-U model we propose. The model takes a sequence of volumes as input and predicts a single volumetric segmentation map. All convolutional layers used 3D convolutions in this case.

Here, we described the cGRU-CNN-U for the case of a sequence of volumes. Thus, all convolutional layers employ 3D convolutions for spatial processing. In our applications we employ ResNet blocks within the architecture.

Summary. Multi-dimensional medical image data usually comes with several spatial dimensions and a temporal dimension. While convolutions can be universally used for any type of dimension, recurrent concepts such as LSTMs have been shown to be effective for temporal processing. Thus, multi-dimensional spatio-temporal data can be processed by joint CNN and recurrent architectures. For this type of architecture, we consider several different variants. First, we design a model that follows design principles from the natural image domain with a CNN, followed by recurrent GRU. Second, we proposed a cGRU-CNN concept where spatially consistent, temporal processing and aggregation is performed first, followed by spatial processing. Third, we apply this concept to segmentation problems and propose a cGRU-CNN-U concept where a cGRU aggregates a temporal sequence produced by an encoder, followed by a decoder that processes and predicts a single, spatial representation.

4.4 Summary

Overall, we propose a large number of deep learning methods that can be employed for different tasks where multi-dimensional aspects are relevant. We start with CNNs, the most common type of deep learning model for image processing. The majority of CNNs is designed for standard 2D image data. Thus, we first address the problem of moving from 2D models to 1D. In terms of computational and memory requirements, this step is not complicated, as 1D architectures are very lightweight. This opens up the option to implement an optimization procedure where we automatically search for the optimal CNN architecture. We study this idea in the context of image segmentation, where we search for a U-Net’s building blocks using neural architecture search.

The next challenging step is to move from 2D standard architectures to 3D models for processing 3D image volumes. Here, the extension is not trivial as the higher dimensional convolutions require more computational power, and the increase in trainable parameters for capturing higher-dimensional context also increases the risk of overfitting. Our

4.4 Summary

first approach to this problem is careful architecture engineering, where we design 3D CNN extensions for four different architecture concepts we adopted from the 2D image domain. Our second approach to this problem assumes that the issue of computational requirements is solvable by better hardware. Thus, we address the remaining risk of overfitting with a multi-dimensional transfer learning strategy. Here, we directly use pretrained models from the 2D domain and isotropically expand all their 2D operations to 3D. This process is enabled by a weight rescaling strategy that ensures consistent feature ranges for the higher-dimensional computations. As a next step, we also extended 3D CNNs to 4D. This encompasses the same problems as for 2D to 3D extensions in an even more severe way. Thus, we take the route of architecture engineering and propose three different 4D CNNs variants that use full 4D convolutions and efficient alternatives such as factorized convolutions.

Not all image data can be assigned to, for example, 2D or 3D image data in a meaningful way if one of the data dimensions is very small. Often, input data only consists of two states captured by two images. For this type of 2.5D and 3.5D image data, we propose Siamese two-path architectures that we adapt from the natural image domain. For this type of architecture, we study concepts such as early and late feature fusion, different fusion strategies, and modifications that enable transfer learning. For 3.5D segmentation problems, we also propose attention-guided interaction modules between processing paths.

Last, we also consider recurrent architectures that are particularly suited for processing temporal data. A lot of multi-dimensional learning problems have a spatio-temporal data structure and thus, one temporal dimension that needs to be processed. Therefore, joint CNN and RNN architectures can be designed that perform spatial processing with convolutions and temporal processing with recurrent modules. In the natural image domain, CNN-RNN concepts are popular where a CNN produces abstract, spatial representations that are processed by an RNN. We extend this concept by performing temporal processing first with a convolutional RNN module that produces a single spatial representation for CNN processing. This follows the idea of aggregating a sequence, where all elements are very similar, into a single representation before extracting abstract features. We also extend this concept to segmentation problems, where we propose an encoder-decoder architecture with recurrent aggregation of volumes in a temporal sequence.

An overview of all deep learning architectures and some of their implementations in this thesis is given in Table 4.2. Next, we introduce the different application scenarios, where we apply the architecture concepts described in this chapter.

Tab. 4.2: An overview of all architectures we employ, example implementations, and their respective properties and application task. Tasks include regression or classification (R./C.) and segmentation (Seg.). The keyword custom indicates that the model is a custom implementation.

Data Dim. Implementation Description Task

ConvolutionalNeuralNetworks

1D RN1D Custom ResNet R./C.

RN1D-U Custom ResNet-based U-Net Seg.

ENAS1D-U Custom U-Net found by ENAS Seg.

2D RN2D Custom ResNet R./C.

IN2D Custom Inception R./C.

RN2D-50 Adapted ResNet50 [193] R./C.

DN2D-121 Adapted DenseNet121 [208] R./C.

SENet2D-154 Adapted SENet154 [205] R./C.

RN2D-U Custom ResNet-based U-Net Seg.

ENAS2D-U Custom U-Net found by ENAS Seg.

3D RN3D-A Custom ResNet w/o bottlenecks R./C.

RN3D-B Custom ResNet w/ bottlenecks R./C.

IN3D Custom Inception R./C.

RX3D Custom ResNext R./C.

DN3D-121 Adapted DenseNet121 [208] R./C.

SE-RN3D-101 Adapted SE-ResNet101 [205] R./C.

facRN3D Custom ResNet w/ factorized convolutions R./C.

RN3D-U Custom ResNet-based U-Net Seg.

4D RN4D Custom ResNet R./C.

DN4D Custom DenseNet R./C.

facRN4D Custom ResNet w/ factorized convolutions R./C.

MP-DN4D Custom multi-path DenseNet R./C.

SiameseModels

2.5D TP-RN2D-50 Two-path CNN based on ResNet50 [193] R./C.

TP-DN2D-121 Two-path CNN based on DenseNet121 [208] R./C.

3.5D TP-DN3D Custom two-path DenseNet R./C.

MP-DN3D Custom multi-path DenseNet R./C.

TP-RN3D-U Custom two-path ResNet Seg.

MP-RN3D-U Custom multi-path ResNet Seg.

Recurrent-ConvolutionalModels

2D RN1D-GRU Custom ResNet-based CNN-GRU R./C.

RN1D-cGRU Custom ResNet-based CNN-cGRU R./C.

GRU-RN1D Custom ResNet-based GRU-CNN R./C.

cGRU-RN1D Custom ResNet-based cGRU-CNN R./C.

3D RN2D-GRU Custom ResNet-based CNN-GRU R./C.

RN2D-cGRU Custom ResNet-based CNN-cGRU R./C.

cGRU-RN2D Custom ResNet-based cGRU-CNN R./C.

4D RN3D-GRU Custom ResNet-based CNN-GRU R./C.

RN3D-cGRU Custom ResNet-based CNN-cGRU R./C.

cGRU-RN3D Custom ResNet-based cGRU-CNN R./C.

5 Application Scenarios and Previous Work

In this chapter, we introduce the application problems that we address throughout this thesis. We start by reviewing previous work and its development over time, where we focus on the use of data representations and machine learning methods. Then, we highlight aspects and open questions concerning data representations and deep learning that have not been sufficiently addressed in the literature so far.

First, we discuss the problem of vision-based force estimation. We address both image-based force estimation with higher-dimensional data representations as well as needle-based force estimation approaches with low-dimensional data. Second, we review work on OCT-based tissue segmentation and classification regarding the two major clinical applications of OCT, ophthalmology and intravascular imaging of the heart. Third, we address left-ventricle quantification, another deep learning problem that is focused on the heart as an anatomical region. Fourth, we examine the problems of pose and motion estimation with deep learning methods and higher-dimensional data.

Fifth, we introduce the longitudinal disease tracking problem of multiple sclerosis lesion activity segmentation. Finally, we review work on multi-dimensional deep learning that is not associated with any of our application problems. We conclude the chapter by highlighting general trends and open questions across all applications and imaging modalities.

5.1 Vision-Based Force Estimation

5.1.1 2D and 3D Image-Based Force Estimation

Vision-based force estimation is primarily motivated by the idea of estimating forces without the need for a mechatronic force sensor, for example, for robot-assisted inter-ventions. Robot-assisted minimally invasive surgery (MIS) has become increasingly popular as it addresses various shortcomings of conventional MIS [540]. Robotic sys-tems allow for motion scaling, tremor compensation, and more degrees of freedom for tool movement, which improves precision and reduces physical trauma [263]. However, these systems often lack force feedback [105], which would be helpful in controlling the instrument-tissue interaction during surgery. Typically, haptic feedback is generated on the patient side with haptic sensors, such as force sensors [101]. The information is fed back to a haptic interface that delivers the information to the human operator, for example, as vibrotactile or kinesthetic feedback [333]. One of the key challenges of generating reliable haptic feedback is accurate sensing of the forces at the patient [361].

Lack of haptic feedback may lead to complications, increased completion time, or

severe injuries [369]. Although various approaches to realize force feedback have been proposed, the problem is still considered an open research challenge [37].

One approach is to directly incorporate force-sensing devices into the robotic setup [388]. The devices can be placed inside or outside of the patient. If the device is placed outside the patient, for example, in between tool and robot, only indirect measurement is possible. In addition to the forces at the tool tip, forces acting on the tool, for example, due to friction, are measured, which cannot be separated [133]. When placing the device closer to the tool-tissue interaction point, for example, inside the tool tip, problems such as sterilization and biocompatibility arise [457].

Due to these shortcomings, vision-based force estimation procedures have been proposed. Vision-based force estimation is the task of estimating forces that are acting on tissue, only based on images that are typically provided by RGB-D cameras. This problem was initially not considered as a spatio-temporal data processing problem as early methods relied on deformable template matching methods [179] for finite element modeling techniques [233]. The methods were largely motivated by applications to microassembly and biomanipulation of individual biological cells [310]. Similarly, following methods relied on mechanical deformation models [247, 249, 360]. Kim et al. considered the problem of haptic feedback in MIS [247]. Based on a-priori material knowledge and geometric estimates from camera images, the interaction forces between a tool and material were estimated. Noohi et al. considered the problem of estimating deformation from a monocular endoscopic camera without the use of a known, undeformed template [360]. Another research direction in vision-based force estimation

Due to these shortcomings, vision-based force estimation procedures have been proposed. Vision-based force estimation is the task of estimating forces that are acting on tissue, only based on images that are typically provided by RGB-D cameras. This problem was initially not considered as a spatio-temporal data processing problem as early methods relied on deformable template matching methods [179] for finite element modeling techniques [233]. The methods were largely motivated by applications to microassembly and biomanipulation of individual biological cells [310]. Similarly, following methods relied on mechanical deformation models [247, 249, 360]. Kim et al. considered the problem of haptic feedback in MIS [247]. Based on a-priori material knowledge and geometric estimates from camera images, the interaction forces between a tool and material were estimated. Noohi et al. considered the problem of estimating deformation from a monocular endoscopic camera without the use of a known, undeformed template [360]. Another research direction in vision-based force estimation