• Keine Ergebnisse gefunden

Multi-Dimensional Transfer Learning

3.3 Neural Networks

4.1.3 Multi-Dimensional Transfer Learning

Conv 13n2c Module,n2c

BN, ReLU Input,n1c

Pool

Output,n2c

bbb

Concat,n1c+n2c

Fig. 4.7: Two types of long-range connections for the modules of IN3D are shown.

Left, the transfer of features between stages is shown with a concatenation of features from different levels. Right, feature transfer through a long-range residual connection is shown.nc1 denotes the number of input feature maps,nc2

the number of output feature maps from the module. Pool indicates2×2×2 average pooling to match the module’s spatial dimensionality reduction. Conv 13 indicates1×1×1convolutions for adjustment of the number of feature maps. For the residual connection, the convolution is applied with a stride of two to match the module’s spatial dimensionality reduction.

of parameters. We augment IN3D further with multi-path blocks and long-range residual connections for improved performance. Lastly, RX3D shows how a network with little design effort compares to our similar but carefully tuned IN3D architecture. These architectures highlight how different design principles affect performance. Compared to the automated architecture learning approach for 1D data, our approach for 3D CNNs requires a lot of architecture engineering. This is traded off for more control in the design process, interpretable architecture variations, and computationally efficient models.

4.1.3 Multi-Dimensional Transfer Learning

While carefully handcrafting efficient architectures for 3D deep learning is a viable approach, there are some downsides such as significant engineering effort. Another downside is the difficulty of employing transfer learning, which is very popular in the medical image analysis domain [447]. Here, the idea is to copy a 2D CNN architecture that was trained for a task in the natural image domain and (partially) retrain it for a target task in the medical image domain. As natural image datasets are generally large, the pretrained model often learned very generic feature representations that can be effectively reused for a different task. In this type of setting, we reuse architecture hyperparameters hAMhM that have been chosen for other problems. For 3D deep learning problems, this concept is difficult to use as 3D image data is rare in the natural image domain. One approach for enabling transfer learning in 3D or higher data dimensions is to transfer 2D architectures to 3D by reusing weights. In the following, we propose an approach for this type of multi-dimensional transfer learning for 2D spatial and 3D spatio-temporal image data.

2D CNN Approaches. Similar to the extension of spatial 2D CNNs to 3D, we consider a regression or classification task where images are assigned to a category or

4.1 1D, 2D, 3D and 4D CNNs

Fig. 4.8: The architecture of the RX3D model is shown. Each module contains four and five residual blocks, respectively, where the first block in each module reduces the spatial dimension by half and increases the feature map dimension by a factor of two. Conv13and33 indicate filters sizes of1×1×1and3×3×3, respectively. ncis the number of feature maps in a block or layer.

continuous values are regressed from the image. Instead of designing efficient 2D and 3D CNNs based on existing architecture concepts, we directly make use of already proposed architectures. As outlined in the previous section, this approach is generally problematic for medical image analysis tasks, as medical datasets are usually much smaller and there is a substantial risk of overfitting. This problem can be partially circumvented by making use of transfer learning. Here, the CNN is pretrained on the ImageNet dataset, which contains1 200 000images that have to be classified intoNc= 1000classes. Then, the CNN is trained for the target task. For this purpose, the CNN’s last layer is removed and replaced by a layer with the target number of outputs.

We consider a pool of pretrained CNN architectures, including the previously intro-duced ResNet [193], ResNeXt (RX) [553] as well as the newer architectures DenseNet (DN) [208] and Squeeze-and-Excitation Networks (SE) [205]. The DenseNet model is focused on efficiency even more than previous architectures, as it relies on heavy feature reuse in each block. The idea of this architecture is to use all features from previous layers for current layerl. The output of thelthlayer is computed as

xl =H([x0,x1, ...,xl−1]) (4.8) whereHis the layer’s transformation, including convolutions. In this way, the output feature maps of all previous layers are reused. This allows for overall significantly smaller feature maps and a reduced number of parameters. Due to this structure, the number of feature maps grows linearly with a growth rate gk. Furthermore, the architecture uses compression layers in between dense blocks in order to keep the feature map sizes low. A 1×1convolution downsamples the feature maps by a factor ncred. This architecture also makes use of bottlenecks, as described for the ResNet model in the previous section.

Input

Fig. 4.9: The DenseNet and the SENet architecture are shown. FC-σis an FC-layer with a sigmoid activation function. Note that the original implementation of SENet uses a bottleneck with2FC-layers to keep the number of trainable parameters in a moderate range [205].

The SENet principle is an architecture augmentation that can be added to any existing design principle such as ResNets or DenseNets. The proposed SE-blocks perform a recalibration of a CNN’s feature map. In a normal convolutional layer, new feature maps are computed by individually convolving each previous map with a separate filter, followed by averaging. Thus, the combination is implicit by summation. Instead, SE-blocks explicitly learn a reweighting of a layer’s feature maps. First, the feature maps are pooled into a feature vector with spatial GAP. Then, a sigmoid FC-layer performs a nonlinear transformation of the feature vector. The same-sized output vector is then multiplied with the original feature tensor, thus, reweighting each feature map. For large feature map sizesnc, the authors also use a bottleneck-like setup for the FC-layer where two subsequent FC-layers withnc/ncred neurons andncneurons, respectively, are used.

The DenseNet and SENet architecture principle are shown in Figure 4.9.

3D CNN Approaches. As a next step, we propose an extension of the previously described state-of-the-art 2D CNNs to 3D for a 3D spatio-temporal learning task. Thus, instead of processing a single 2D image, the CNN receives a temporal sequence of 2D images. While the data is higher-dimensional, the 3D image tensor still consists of the same 2D images. This property opens up the idea of directly reusing 2D CNN architectures for 3D data. The straightforward extension can be performed by extending each convolutional kernel isotropically with an additional dimension. The identical architecture also opens the opportunity to performmulti-dimensional transfer learning.

Thus, we propose to initialize each 3D kernel with its 2D counterpart by copying the 2D kernel multiple times. In detail, the 3D convolutional kernels are initialized by copying the 2D kernels that were pretrained on ImageNet of sizeK2D ∈Rkh×kw×kc exactlykd times into the new kernel of size K3D ∈ Rkh×kw×kd×kc. Intuitively, this should work well as a 3D convolution on a 3D stack of 2D slices now has the same effect as applying several, stacked 2D convolutions to the 3D image stack.

However, the copied weights still need to be rescaled. Considering an individual voxel’s valueˆivj = 1being computed by a 3D kernel that is convolved over a tensor only containing the value1, the resulting voxel value would beˆivj = khkwkd. With

4.1 1D, 2D, 3D and 4D CNNs

2D CNN with Spatial 2D Convolutions

3D CNN with Spatio-Temporal 3D Convolutions 2D Image

3D Spatio-Temporal Image ×k1d

Fig. 4.10: Overview of our approach. We use both 2D CNNs with 2D slices (top) and 3D CNNs with temporally stacked slices (bottom). Both are initialized with pretrained weights from ImageNet. The initial and Conv2D/Conv3D blocks have a different structure based on the respective architecture.

2D processing and no inter-slice processing, the resulting value would beˆivj =khkw. During the initial phases of training, this would lead to exploding values. To recover consistent value ranges, we therefore multiply the 3D kernels’ weights by a factor 1/kd. We show that the proposed 3D extension approach in Figure 4.10 for the example application of LVQ. Here, the 2D image data consists of spatial 2D MRI slices, and the 3D spatio-temporal data consists of a temporal sequence of 2D MRI slices throughout a cardiac cycle. We also employ our method for transfer learning with weight initialization in the context of Siamese CNNs, which we introduce in the next section.

The 3D CNN input is of sizex ∈ Rnt×nh×nw wherenh and nw are the sizes of the spatial slice dimensions andntis the number of temporal slices. Throughout the entire network, we do not change the temporal dimension as we produce ntpredictions for ntinput slices in a single forward pass. For this purpose, we replace the linear output layer by a convolutional layer with kernel size1, which is able to handle arbitrarily-sized inputs.

Table 4.1 shows different CNN architectures in their 2D and 3D configuration and the associated memory requirements and trainable parameters. Notice that our carefully handcrafted 3D CNNs introduced in the previous section came with 3 to 6 million parameters while the straightforward extension performed here leads to10to20times the number of parameters. Also, the memory requirements often exceed currently available hardware, where the most commonly used GPUs have a memory size of11GB.

Due to the huge increase in memory and computational requirements, we only consider 3D variants of the smaller CNNs in our pool.

Summary.We propose a method for straightforward extension of 2D spatial CNNs to 3D spatio-temporal CNNs. As outlined in the previous section, this approach leads to

sub-Tab. 4.1: Overview of several different CNN architectures for the extension from 2D to 3D. We show the number of trainable parametersnparamsin million as well as memory requirements (Memory) in gigabytes for a batch size ofNb = 10and a temporal dimension ofnt= 10.

2D 3D

nparams(M) Memory (GB) nparams(M) Memory (GB)

DN-121 6.969 3.16 11.26 15.29

DN-161 26.51 5.78 39.47 27.88

DN-169 12.59 3.74 18.57 18.07

SE-RN-50 26.07 3.27 48.72 16.02

SE-RN-101 47.31 5.05 90.02 24.64

SE-RN-152 64.80 7.21 124.0 35.22

SE-RX-50 25.54 4.14 28.39 20.31

SE-RX-101 46.93 6.31 52.29 30.81

SENet 113.1 11.21 170.2 54.49

stantial increase in memory, computational requirements, and trainable parameters. We try to overcome the problem of increased trainable parameters with a multi-dimensional transfer learning approach where we copy pretrained, scaled 2D kernels into a 3D CNN.