• Keine Ergebnisse gefunden

attracting increasing interest. We discuss the parallelizable nature of our proposed architecture in both training and inference contexts and show how this could be helpful when deploying the model in distributed environments.

We demonstrate the effectiveness of the multi-sequence-to-sequence network on three datasets. Two of these datasets are drawn from various sensor stations spread across Quebec and Alberta that measure climatological data. The third dataset contains energy load profiles from multiple zones of a smart energy grid. In our experiments on sensory data, we show that the proposed architecture outperforms both, purely local models for each agent as well as one central model of the whole system. This can be explained by the fact that the local sub-models learn to adapt to the peculiarities of the respective sensor station and, at the same time, integrate relevant information from other stations through the interconnection layer, which allows the model to exploit cross-correlations between the data streams of multiple stations.

The remainder of this chapter is organized as follows. In Section 2.2, we ex-plain both the architecture of our proposed model and the attention-based fusion mechanism. Section 2.3 elaborates on the model’s distributed training and infer-ence. In Section 2.4, we discuss related work. Section 2.5 shows the experimental settings and results for the different experiments. Section 2.6 concludes our work and discusses possible directions of future research.

of models end-to-end.

2.2.1 Multi-Encoder-Decoder RNNs

We consider the task of predicting multiple multivariate output sequences from multiple multivariate input sequences. The input sequences are represented by a three-way tensor X ∈ RE×Tenc×Fenc, where E denotes the number of encoder de-vices,Tenc denotes the encoder sequence length, andFenc is the number of encoder features. Similarly, the output sequences are represented by a three-way tensor Y ∈ RD×Tdec×Fdec, where D denotes the number of decoder devices, Tdec denotes the decoder sequence length, and Fdec is the number of decoder features. In the case of multivariate streaming data from a sensor network, the value X(i, t, j) corresponds to thej-th feature measured at thei-th sensor station at time t. Sim-ilarly, the value Y(i, t, j) corresponds to the predicted value of theb j-th feature at the i-th output node at time t. If we consider, for example, the task of predicting the features of the nextTdec values for all stations in a sensor network, then D is the number of stations,Fdec is the number of features, and Tdec is the time period for which forecasts are performed. The input and output feature spaces may be identical, i.e. a prediction of all the sensor values per sensor node, or not, e.g., there may be a central control station making predictions for larger parts of the system.

We propose a neural network architecture that extends the encoder-decoder framework. The general architecture consists of multiple encoder functions, a interconnection layer, and possibly multiple decoder functions. Each input-sensing device is modeled by an encoder function

fenc,i(X(i,:,:)) = ei, with i∈ {1,2, ..., E}, (2.1) which takes the data measured at the i-th sensing device as input and outputs a latent representationei. For each output device, an interconnection functionfcon,j combines the representations{ei}Ei=1 as

fcon,j({ei}Ei=1) = cj, with j ∈ {1,2, ..., D}. (2.2)

eE cD

X(E,1,:) X(E,2,:) X(E, Tenc,:)

Y(D,b 1,:) Y(D, Tb dec−1,:) Y(D,b 1,:) Y(D,b 2,:) Y(D, Tb dec,:)

e2 c2

X(2,1,:) X(2,2,:) X(2, Tenc,:) Y(2,b 1,:) Y(2, Tb dec−1,:) Y(2,b 1,:) Y(2,b 2,:) Y(2, Tb dec,:)

e1 c1

X(1,1,:) X(1,2,:) X(1, Tenc,:) Y(1,b 1,:) Y(1, Tb dec−1,:) Y(1,b 1,:) Y(1,b 2,:) Y(1, Tb dec,:) Encoders -fenc

Attention-basedinterconnectionlayer-fcon

Decoders -fdec

Figure 2.1: Unfolded multi-encoder-decoder recurrent neural network for multiple sequence-to-sequence prediction.

Finally, for each output device, a decoder function fdec,j models the prediction using the respective combined representation cj as

fdec,j(cj) =Yb(j,:,:), with j ∈ {1,2, ..., D}. (2.3) In this way, information between the different input and output sequences can be exchanged through the interconnection layer.

In order to predict the sequence of future sensor behavior given the previous observations, we implement the functions fenc and fdec using recurrent neural net-works. Thus, the functions are implemented as was explained in Equation 1.6 in Section 1.2.2. Figure 2.1 presents the architecture of a multi-encoder-decoder re-current neural network model. For the sequence-to-sequence prediction, we model each encoder and each decoder function using an RNN. The rectangles describe the hidden states of the RNNs, while the arrows indicate the transformations

be-tween the states. Each encoder RNN iterates over the sequence produced by the respective sensing node. Thus, the input of thei-th encoder RNN isX(i, t,:). We define the last hidden state of the i-th encoder RNN to be the encoder output ei. For each decoder RNN, a combined representation is computed by the respective interconnection function, which is represented by the large vertical box in the il-lustration. The interconnection layer outputs combined representations for each decoder. These representations are then used as initial hidden representation in the decoder RNNs. The decoder outputY(i, tb −1,:) is passed as an input to the next time stept of the i-th decoder RNN.

In order to maintain the scalability of the architecture the dimensionality of the merged decoder representation should be independent of the number of input channels. One canonical way to keep the dimensionality of the decoder repre-sentations the same size as the encoder reprerepre-sentations is to use a sum or mean operation, such that

cj = 1 E

E

X

i=1

ei ∀j ∈ {1, . . . , D}. (2.4) This implementation for fusing the representation vectors, does not require addi-tional learned parameters. Moreover, it can deal with a variable number of input channels, which is especially useful in settings where the number of input channels is not constant over time, e.g. moving devices where devices appear and disap-pear over time or where some input devices do not send any data, e.g., broken sensors. However, it does compute the same representation for all decoders. Thus, a more advanced implementation of the interconnection layer is desired, and will be described in the next section.

2.2.2 Spatial Attention Mechanism

We propose an implementation of the interconnection layer using an attention mechanism. In this way, the combination of latent representations is not fixed for every prediction but changes depending on the current context, which is encoded in the input representations. The attention mechanism assesses the importance of the representations of the encoding devicesei and computes a weighted sum over

e1 ... e5

fatt,j

wj,1 ... wj,5

e1 e2 e3 e4 e5

+

cj

wj,1 wj,2 wj,3 wj,4 wj,5

Figure 2.2: Illustration of the attention-based fusion layer with five encoder inputs and a single (the j-th) decoder output. The attention network on the left is shared for all inputs.

the latent vectors

cj = 1 E

E

X

i=1

wjiei. (2.5)

The vectors ei are element-wise multiplied with the weightwij ∈R. To adjust the weights wij dynamically, we compute them using the additional attention func-tion fatt. The attention function is implemented as a multi-layer percepron, as described in Equation 1.1 in Section 1.2.2. The outputs of the attention function are normalized through a softmax function, such that

zji =fatt,j(ei), (2.6a)

wji = exp(zji) PE

k=1exp(zjk). (2.6b)

Whether or not attention is put on a representationeican vary for each prediction, depending on the information encoded inei. The approach draws inspiration from the attention-based machine translation model [8]; however, the attention is not used across time, but spatially across sensing devices to address the dynamic fusion problem. The only parameters that need to be learned are those of the attention function. As all encoder representations are passed independently to the same attention function, the number of parameters is independent of the number of encoders, which yields a constant number of parameters.

Figure 2.2 illustrates the attention-based interconnection layer for an architec-ture with five encoders and a single decoder. The attention network on the left is specific to the decoder. All representations ei are passed to the attention net-work separately. In practice, this can be implemented as a batch of inputs. For each encoder representation, a weight w1, . . . , w5 is derived and then used in the fusion mechanism on the right. The output of the fusion mechanism is a weighted combination of the encoder representations, which is then passed to a decoder model.

Note that this mechanism can deal with a variable number of input devices.

This is especially useful in settings where the number of input-devices is not con-stant over time, e.g., moving devices where devices appear and disappear over time, or where some input devices do not send any data, e.g., due to broken sensors.

2.2.3 Model Training

The model is trained end-to-end in a supervised fashion by minimizing the neg-ative log-likelihood of a historical training set D = {(X(n),Y(n))}Nn=1 w.r.t. the model parameters of all encoders and decoders. The log-likelihood of i.i.d. train-ing examples can be written as

l =−

N

X

n=1

log p(Y(n)|X(n),Θ), (2.7)

where Θ = {Θenc,i}Ei=1∪ {Θdec,j}Dj=1 is the set of parameters for all encoders and decoders. We assume the parameters of the j-th interconnection function to be part of Θdec,j. Thus, we obtain

l =−

N

X

n=1

log p(Y(n)|X(n);{Θenc,i}Ei=1,{Θdec,j}Dj=1). (2.8) Due to the conditional independence of the decoders, we can simplify (2.8) to

l=−

D

X

j=1 N

X

n=1

log p(Y(n)(j,:,:)|X;{Θenc,i}Ei=1dec,j) =

D

X

j=1

lj, (2.9)

obtaining a sum of negative log-probabilities that we refer to as l1, ..., lD. Since all components in the complete encoder-decoder architecture are differentiable, the network can be trained via backpropagation. By training the complete model end-to-end, each encoder learns to encode the relevant information for all decoding tasks, rather than solely learning a good model for the specific input sequence.