• Keine Ergebnisse gefunden

3.3 Neural Networks

4.1.4 Extending 3D CNNs to 4D

The next step we consider is the extension of 3D CNNs to 4D. Intuitively, this is similar to the extension from 2D to 3D, with the addition that all issues are aggravated.

Computational costs and memory requirements are even higher. For example, extending a 2D CNN withnparamsparameters to 4D by isotropically expanding the kernels would lead to an architecture with n2params parameters. Thus, carefully designed solutions are required. Thus, similar to handcrafted 3D CNN architecture design, architecture hyperparametershAMhM are manually selected. Furthermore, current deep learning frameworks lack native support of deep learning tools such as 4D convolutional layers.

In this work, we consider several different approaches for extending CNNs to 4D. We start with an extension where we employ normal 4D convolutions. Then, we build a more efficient variant using factorized convolutions. Finally, we take a different approach to the extension problem by considering multi-path architectures. These methods can be employed for 4D data, which typically consists of a temporal sequence of image volumes: x∈Rnt×nh×nw×nd.

CNN4D. For this architecture, we replace 3D convolutions by their 4D counter-part. The mathematical extension of a discrete convolution to Nd dimensions is described by Equation 3.10 in Chapter 3. In a 4D convolutional layer, the kernel Kl ∈ Rkt×kh×kw×kd×kc of layerl is applied to feature mapsxl−1 ∈ Rnt×nh×nw×nd×nc, excluding the batch dimension. ntis the size of the temporal dimension,nh,nw, andnd are the spatial extent of the feature map, andncis the channel dimension of the feature map. Currently, there are no native 4D convolution operations available in standard environments such as PyTorch and Tensorflow. To keep the 4D convolution as efficient

4.2 2.5D and 3.5D CNNs

as possible, we implement a custom version in Tensorflow, which uses the native 3D convolution operation inside of two loops. The operation can be described as

(Kl∗xl−1) =

kt

X

i nt

X

j

Conv3D(Kl(i),xl−1(j)) (4.9) with correct padding and a stride of one assumed. Based on this layer, we build differ-ent implemdiffer-entations of 4D CNNs. This includes a ResNet-based RN4D architecture and a DenseNet-based DN4D architecture. Both architectures share a similar structure with an initial feature map size ofdcinit that is doubled each time the spatial dimensions are reduced by a convolutional layer with stride2. Asntis usually small for most of our problems, the temporal dimension is not reduced by strides.

facCNN4D.A more efficient variant of RN4D or DN4D can be constructed by making use of factorized convolutions. In each ResBlock or DenseBlock, each convolution is split into two convolutions with spatial kernels KSl ∈ R1×kh×kw×kd×kc and temporal kernelsKTl ∈Rkt×1×1×1×kc. This modification leads to a reduced number of parameters and decomposes spatial and temporal computations. Thus, the architecture is more efficient but does not have the same representational power as a CNN4D model as the decomposed convolutions can only represent separable 4D kernels.

MP-CNN4D.Instead of decomposing spatial and temporal processing in every sin-gle layer, as performed for facCNN4D, we can also decompose spatial and temporal processing globally. For MP-CNN4D, we first perform spatial processing for several layers while reducing all spatial dimensions. Then, the smaller 4D tensor is once again processed by a 4D CNN. In this way, costly 4D spatio-temporal processing is only performed after the spatial dimensions have been reduced already.

In detail, spatial processing is performed by using a multi-path 3D CNN where each path processes one 3D volume within the sequence individually. For efficiency, all 3D CNN paths share their weights. Intuitively, learning the same features for each volume should be effective as the volumes should share a lot of similarities. Afterward, higher-dimensional temporal dependencies can be captured by the 4D CNN.

Summary. Our three 4D CNN architecture concepts are shown in Figure 4.11 for a ResNet architecture. Overall, 4D CNNs are very challenging to design as their high-dimensional nature requires a large amount of memory and computational resources.

Also, an extension by an additional dimension with normal convolutional layers would lead to very high numbers of parameters, risking overfitting. Therefore, we also consider two more efficient variants of 4D CNNs. The first variant employs decomposed spatial and temporal convolutions in each layer throughout the entire network. The second variant splits spatial and temporal processing globally by performing spatial processing first to obtain spatially smaller feature maps. Then, the smaller volume sequence can be processed by a 4D CNN for still capturing full 4D spatio-temporal context.

4.2 2.5D and 3.5D CNNs

Siamese CNNs.A lot of deep learning applications in biomedical research come with data that cannot be assigned to 2D or 3D data as one of the data dimensions is very small,

ResBlock4D

Conv4D T+S GAP

(a) RN4D (b) facRN4D (c) MP-RN4D

xRnt×nh×nw×nd×1

Fig. 4.11: The three 4D CNN concepts we propose for 4D spatio-temporal data. Here, we show the concepts in combination with ResBlocks. Each architecture receives a sequence of volumes as input. T+S GAP refers to GAP over the temporal and spatial dimensions. Red indicates spatial processing, green indicates temporal processing, and yellow indicates joint processing.

usually with a size of two. In this case, the data is referred to as 2.5D or 3.5D if 2D or 3D images from two states or two time points are used. Thus, input examples have a size ofx ∈R2×nh×nw×nd for the case of 3D volumes. An example is the detection of disease progression between two images taken at two different time points. Intuitively, considering 2.5D data as 3D does not appear reasonable when using CNNs as convolving over a dimension of size two with a kernel sizek≥2simply results in a fully-connected structure. Instead, another intuitive approach is to treat the two states similar to colors.

Here, the two states can be stacked in the CNN’s input channel dimension. We will demonstrate that this approach can be seen as a special case of a class of Siamese CNNs that we propose.

Originally, Siamese CNNs were introduced in the natural image domain for tasks such as person tracking [277, 580]. The idea is to use two parallel CNNs where each path processes one image. At some point, the two processed data representations are fused, for example, in a loss function that expresses similarity between the two input images. There are a lot of different possibilities for modifying this concept and adapting it for different applications. Next, we introduce a Siamese CNN, where we consider different fusion strategies for the two paths, parameter sharing, and transfer learning for this particular type of architecture.

Our Siamese CNN architecture is shown in Figure 4.12. Siamese CNN architectures

4.2 2.5D and 3.5D CNNs

Fusion Res.Block/2

ResBlock 2×ResBlock 2×ResBlock 4×ResBlock

ResBlock 2×ResBlock 2×ResBlock 4×ResBlock

ConvInConvIn ConvDownConvDown ConvDownConvDown ConvDownConvDown ConvDown 2×ResBlock ConvDown 3×ResBlock GAP DenseLayer

Image State 1

Image State 2

Ofn= 3

fn= 4 fn= 2

fn= 1

Fig. 4.12: The Siamese CNN architecture we propose. The model takes two images representing two different states as input. We consider both 2D and 3D images with 2D and 3D CNN variants. In the initial part, the two images are processed independently up to a fusion point fn. At this point, the feature maps are aggregated and processed jointly. We consider different fusion points. Here, fn= 3is shown. At the output, GAP is applied to the remaining feature maps, and an FC-layer leads to the output. ResBlock refers to residual blocks.

take two images to be processed as input [277]. Then, the images are initially processed independently by the same set of learnable filters. At a fusion point, the feature maps of both images are aggregated and processed jointly by the remaining network layers [109, 580]. We consider the fusion pointfnas a hyperparameter that we study. Performing fusion early in the network relates to the assumption that the two images are very similar.

The most extreme case is feature fusion at the input, by stacking the two states in the channel dimension. Conversely, late fusion is required if the two images are dissimilar and need to be processed individually first.

For fusion itself, we also consider several different operations. Straight-forward fusion techniques include pixel or voxel-wise summation and addition. Concatenation along the feature map dimension is another option where the combination of the two paths’

features can be learned. The concatenation of the two paths is defined as follows. For path 1, consider a feature map tensor fm1 ∈ RNb×nh×nw×nd×n1c whereNb is the batch size,nh,nw,nd, are the feature maps’ height, width, and depth andn1c is the number of feature maps. The tensor fm2 is designed similarly. Thus, the concatenated tensor fm3 = fm1kfm2 has a shape offm3 ∈ RNb×nh×nw×nd×n1c+n2c. In our case, this doubles the number of feature maps, which is why we keep the number of feature maps constant in the following spatial reduction block in the network. This keeps the overall feature map sizes within the network at a reasonable level, despite the concatenation.

Regarding general network design, after the initial convolution in the network, we employ blocks derived from different architecture concepts such as ResNet or DenseNet.

For spatial reduction within the network, we employ convolutions with a stride of2.

For most tasks, we consider a purely contracting architecture that outputs a vector for classification or regression. Here, the spatial dimensions are repeatedly reduced up to a GAP layer, followed by an FC-layer. For segmentation tasks, we also consider Siamese architectures with an encoder-decoder structure. This is detailed further in the next section.

Siamese architectures can also be challenging in terms of computational and memory requirements, in particular, for 3.5D data. One way to overcome this issue is to follow the approach of carefully designing efficient architectures, as we outlined for the case of extending 2D CNNs to 3D, and for the design of 4D CNNs. Thus, architecture hyperparametershAMhM are chosen by the machine learning engineer. The other approach we proposed is to utilize standard architectures and overcome the immense increase in model size by transfer learning. This resembles our multi-dimensional transfer learning strategy where we reuse most architecture hyperparametershAMhM from the original architecture.

This approach can also be adopted for Siamese CNN design. For example, we can make use of a standard ResNet or DenseNet model and transform it into a Siamese architecture. For this purpose, we clone the initial part of the CNN, before the fusion point. After the fusion point, we use a single path of the rest of the original CNN.

The CNN part before the fusion point does not need to be adjusted, besides cloning for obtaining two parallel paths. However, after the fusion point, if concatenation is employed, the original model changes slightly as twice the number of features need to be processed. A majority of the network’s weights can be kept identical after concatenation by immediately reducing the number of feature maps back to the size of the original pretrained reference architecture.

Still, the convolutional layer that downsamples the feature map dimension after the concatenation needs to be adjusted for effective transfer learning. This layer performs downsampling along the feature map dimension with a kernel size of1along all spatial dimensions. Consider two spatial 2D feature maps before concatenation from the two path, each with shape fm1 ∈ RNb×nh×nw×n1c and fm2 ∈ RNb×nh×nw×n1c. Thus, the downsampling convolution’s weight tensor has the shapews∈R1×1×2n1c×n2c wheren1c is the downsampled feature map size. We assign the original, pretrained weight tensor of shapewo ∈R1×1×c1×n2c both to the sliced tensor of shape ws1 ∈R1×1×1...n1c×n2c and w1s ∈R1×1×n1c...2n1c×n2c. We show that this initialization does have a significant impact on performance.

Attention-Guided Interactions. The described architecture concept for regression and classification tasks is also applicable to segmentation problems with encoder-decoder architectures. Here, an intuitive approach is to use the point between encoder and decoder as the fusion pointfn. For the fusion point, we also consider different methods, including subtraction, addition, and concatenation. Another modification is the use of long-range connections, which are typical for segmentation architectures. Here, every long-range connection also needs to be augmented by a fusion strategy, similar to the normal fusion point.

In contrast to regression or classification problems, predicting a segmentation map requires accurate spatial correspondences between the two input images as spatially consistent outputs need to be generated. We hypothesize that an additional focus on relevant, corresponding regions could improve the learning problem. Therefore,

4.2 2.5D and 3.5D CNNs

Fusion Res.Block/2

ResBlock 2×ResBlock 2×ResBlock 4×ResBlock

ResBlock 2×ResBlock 2×ResBlock 4×ResBlock

ConvInConvIn ConvDownConvDown ConvDownConvDown ConvDownConvDown ConvUp 1×ResBlock ConvUp 1×ResBlock

Image

Fig. 4.13: The Siamese CNN architecture we propose for segmentation problems. The interaction modules are shown in Figure 4.14.

fm2

Attention A Attention B Attention C

+ + +

+ + +

fm2 fm2

fm1 fm1

Fig. 4.14: The three types of attention-guided interactions we employ are shown. After the convolutions, a sigmoid activation function is applied.

we also incorporate targeted information exchange between the encoder paths, see Figure 4.13. For this purpose, we propose a trainable attention-guided interaction block, which controls information flow between the two paths. Inspired by squeeze-and-excitation [205, 418], and spatial attention [522], the attention module learns a map that suppresses spatial locations to guide the network’s focus on salient regions. We consider blocks at three different locations in the network and propose three different variants for information exchange, see Figure 4.14.

For the first block, each path guides the other path’s focus independently (attention method A). Consider features mapsfm2 andfm1 of sizefm ∈ Rnh×nw×nd×nc at layerl of each path. nh,nw, andndare the spatial feature maps sizes andncis the number of feature maps at layerl. We compute the attention map for the first path as

a1 =σ(conv(fm1)) (4.10)

whereσis a sigmoid activation function andconv is a1×1×1convolutional layer with learnable weightsw1M. The mapa1 is of sizea1 ∈ Rnh×nw×nd×nc and multiplied element-wise withfm1. Following the concept of residual attention [522], we finally add the modified feature map to the original feature mapfm1. Thus, we obtain

fˆm1 =a1fm1 +fm1. (4.11)

In this way, the information in the original feature maps is preserved while the attention maps provide additional focus to relevant regions. The attention map for path 2 andfm2 is computed symmetrically.

For the second block, we use both attention mapsfm2 andfm1 for the computation of both attention maps (attention method B). Thus, we concatenate both feature tensors along the last dimension and compute

a1 =σ(conv(fm1kfm2)) (4.12) with weightswM1 . The attention map has the same dimensions as the ones computed for attention method A. Similar to method A, we use residual attention. We multiply the attention with the original feature mapfm1 and add the original feature map to the result. The mapa2 is computed similarly with weightsw2M.

Third, we consider a jointly learned attention mapa1 =a2 (attention method C). We perform the same computation as for method B and share the weightswM1 = wM2 . In this way, method C is more efficient in terms of the number of parameters, but the two paths receive attention towards the same regions.

Summary. Several application scenarios we consider throughout this work come with 2.5D and 3.5D image data. For processing this type of data, we propose Siamese CNN architectures, which we adopted from the natural image domain. These models process the two images individually up to a fusion point, where we consider both the position of the fusion point and the fusion method as important hyperparameters. We employ this architecture for regression, classification, and segmentation problems. For the latter, we also introduce attention-guided interaction blocks for improved information exchange between the two images’ processing paths and an effective focus on relevant image regions.