Distributed Deep Learning - Scalable Data Analytics and Machine Learning on the Cloud

In this section, we will give an overview of distributed deep learning. In the first subsection, we will discuss deep learning and deep neural networks. In the next subsection, we highlight the common distribution schemes which are applied when training deep learning models in parallel.

2.2.1 Deep Learning

To remove the confusion of differentiating between artificial intelligence (AI), machine learning (ML), and deep learning (DL), we first explain these three terminologies that are often used interchangeably.

Artificial intelligence is a branch of computer science that aims to build and program computers to perform tasks that require human intelligence [83]. In recent times, AI has made practical achievements in several fields, such as automated reasoning [103], robotics [70], and natural language processing [119]. Machine learning is a core part of AI that includes statistical techniques and algorithms to empower machines to learn from user input data independently, and then use what they have learned to make informed decisions [18].

Going a step narrower, DL is a specific approach of ML where algorithms permit machines to learn and extract features from vast amounts of data by themselves [18]. Moreover, deep learning algorithms can adapt, through repetitive training, to discover hidden features in data. To do so, deep learning predominantly relies on artificial neural networks (ANNs). ANNs progressively learn using neurons arranged in many layers, similar to how a human brain learns. In the next subsections, we highlight the architecture of the neural network model and deep learning phases. Additionally, we will look to schemes that are followed in distributing deep learning.

Deep Neural Networks

Deep neural networks (DNNs) represent a class of machine learning models that have been rapidly evolving over the last couple of years. DNNs have proven that they can solve complex problems such as image understanding and machine translation. A DNN is an artificial neural network (ANN) with multiple layers of interconnected neurons between the input and the output layers. Figure 2.7 shows the architecture of a deep neural network. The DNN in the figure has an input layer ofnneurons, and an output layer ofmneurons with layers in between. The input layer represents the first layer that receives training data, e.g., pixel values of an image, while the output layer represents the results that the trained DNN is intended to give, e.g., the category of the given image.

A layer in a DNN is shown in Figure 2.8. Each neuron in this layer represents a mathematical function that transforms sets of inputs to sets of outputs. The neural network is organized such that neuron’s output in each layer provides the input to the neurons in the following layer. Via this layering, the overall network represents a function f :x7→y, that maps an inputxthat goes into the input layer to an outputy that leaves the output layer. The purpose of f is to provide an estimate function to a target function f⁰, e.g, a classifier that can map an image (as input) to a category (as output).

Figure 2.7.:DNN Architecture Figure 2.8.:DNN Operator

To train a DNN, the set of parameters known as model parameters, i.e., the weights, the biases, and the thresholds of every artificial neuron are adjusted such that function f approximates the target function f⁰with the best accuracy. The training process consists of two main steps: the first step is theforward propagation[74], where at the beginning of training process, a set of weights and biases are randomly initialized, and then used to calculate the output signal of each neuron for each example in a training batch (a batch is a subset from the whole dataset that are used for training). In each neuron, the weighted sum is calculated, and summed up with the bias as shown in Figure 2.8, before passing the summed value to an activation function e.g., Rectified Linear Units (ReLU).ReLU is used as a common activation function for deep learning model. If the summed value is negative, the activation function returns 0, otherwise it returns the value itself, serving by that two issues in the learning process: (1) interactions effect between the variables (inputs) i.e., when an input’s value affects the predicted value differently based on another input’s value, and (2) non-linear effects which result from the different values of the biases that are coming from the previous layer (nodes). Theforward propagationstep will produce an approximated value in the output neurons for each input example. This approximated value is then used to evaluate the performance of the network by calculating the difference between the function f⁰output and the predicted output of the target function f. The error function is defined as follows:

Error= 1

2(f−f⁰)² (2.1)

The second step is theback-propagation[96]. The goal of the back-propagation is to reduce the error in the estimate i.e., the difference between the estimate f and the actual output f⁰. Since the actual output is constant, the only way to reduce the error is to change the predicted value. The predicted value depends on weights and biases, and thus, in the back-propagation, gradient descent algorithms are used to calculate the gradient of the error function 2.1 with respect to the neural network’s weights and biases [95], and then, the gradients are propagated back in the network layer by layer to update weights and biases. The adjusted weights and biases are then used in the forward propagation step of the next training iteration (batch).

2.2.2 Distribution Schemes

Training DNNs with vast amounts of training data is a resource and time expensive task. Thus, it is often performed in a distributed infrastructure, which consists of several compute nodes, where each might be

Figure 2.9.:Model Parallelism Figure 2.10.:Data Parallelism

provided with multiple GPUs to reduce the runtime of the training process [74]. DL comes with many possibilities for parallelization, and accordingly, several machine learning frameworks such as TensorFlow [15] and MXNet [21] support distributed DL by implementing several parallelization schemes. In this thesis, we mainly focus on the predominant schemes, model and data parallelism. Other parallelization schemes such as hybrid or pipeline parallelism are out of the scope of this thesis.

Model Parallelism

Model parallelism refers to a model being logically split into several parts (i.e., some layers in one part and some in another), then placing them on different nodes as shown in Figure 2.9. In model parallelism, each worker node trains a part of the DL model on the full training dataset, i.e., no shard [74]. The worker, which comprises the input layer of the DL model, is fed with the training data. As explained in 2.2.1, in the forward propagation step, the output signal in each neuron is calculated and forwarded to the next layer. In model parallelism, the output signals in each layer are forwarded to the worker that comprises the next layer of the DL model. In the back-propagation, gradients are calculated at the worker that holds the output layer, and propagated back to the workers that hold the hidden and input layers.

As mentioned above, in model parallelism, the DL model is split. Thus, less memory is needed for each worker to store the parameters. That makes the model parallelism the preferable choice when the complete DL model is too large to fit in a single node. However, model parallelism has disadvantages associated with the heavy communication between the workers. Additionally and according to A. Mirhoseini et al. [78], an ineffective split of the DL model between workers can lead to the stalling of workers due to communication overhead and synchronization delays. Consequently, increasing model parallelism might not result in training speedup.

Data Parallelism

On the contrary to model parallelism, each worker node in a data parallelism scheme has a replica of the DL model [74]. The training dataset is divided into distinct shards and fed into the model replicas of the

Figure 2.11.:Parameter Server Architecture

workers as shown in Figure 2.10. In data parallelism, each worker trains the model on its shard of data, and accordingly, each worker will have its own updates of the model parameters.

As in data parallelism, each worker trains the model parameters, which mandates a parameter syn-chronization between all workers. Parameter synsyn-chronization poses several challenges in how and when the parameters should be synchronised. According to R. Mayer et al. [74], the parameter server ar-chitecture is the most prominent arar-chitecture of parallel DL systems, that is implemented to manage the parameters update process. The system roots of the parameter server architecture date back to the blackboard architecture [102] and MapReduce [35] as reported by Alex Smola [20]. In the parameter server architecture, there are two types of entities, i.e., workers and servers [59]. As shown in Figure 2.11, the model parameters are divided into shards and distributed to the parameter server(s), which then can be updated in parallel. Among the systems that use the parameter server architecture are TensorFlow [15], Apache MXNet [21], DistBelief [37], and SparkNet [80].

The main advantage of data parallelism is that it can be applied to any DL model without further domain knowledge of the model, as required in model parallelism. Additionally, data parallelism scales well for DL models which are compute-intensive but which have fewer parameters. However, data parallelism can be limited for these DL models that have many parameters, as the parameter synchronization becomes a bottleneck [64, 67].

2.2.3 Discussion

There are several machine learning frameworks such as Google TensorFlow or Apache MXNet that support distributed deep learning. However, they are hard to use. Performing a scalable training job is hard even for a data scientist. This is different for (D)DBMSs, which implement a distributed optimizer that finds an optimal execution strategy for each query [30]. In distributed DL, all users including data scientists and inexperienced users need to manually define the distributed execution strategy. Many users can not do so since it is required that they go through a long and complex process to find the optimal servers to workers ratio. A user needs to take into consideration several aspects such as the model, the dataset size, and the used infrastructure.

This user intense involvement makes these machine learning frameworks not ready for the cloud yet.

Thus, we wanted in the scope of our thesis work to provide scalable support for AI in the cloud similar to DBMS. InXAI, we introduced optimizers that perform similar jobs as the optimizers in DBMS do. In this case, the user of the machine learning framework no longer spends a considerable time to manually set the training and the distribution strategies of the deep learning job;XAI solves all these issues and delivers to the DL-framework users not only the optimal distribution strategy, but also the optimal hyperparameters required for model accuracy and have an influence on the efficiency of the training process.

Part II.

Scalable Data Analytics in the

Im Dokument Scalable Data Analytics and Machine Learning on the Cloud (Seite 31-36)