Residual connection-based graph convolutional neural networks for gait recognition

(1)

https://doi.org/10.1007/s00371-021-02245-9 O R I G I N A L A R T I C L E

Residual connection-based graph convolutional neural networks for gait recognition

Md Shopon¹ ·A. S. M. Hossain Bari¹·Marina L. Gavrilova¹

Accepted: 3 July 2021 / Published online: 16 July 2021

Abstract

The walking manner of a person, also known as gait, is a unique behavioral biometric trait. Existing methods for gait recognition predominantly utilize traditional machine learning. However, the performance of gait recognition can deteriorate under challenging conditions including environmental occlusion, bulky clothing, and different viewing angles. To provide an effective solution to gait recognition under these conditions, this paper proposes a novel deep learning architecture using Graph Convolutional Neural Network (GCNN) that incorporates residual connections for gait recognition from videos. The optimized feature map of the proposed GCNN architecture exhibits the invariant property to viewing angle and subject’s clothing. The residual connection is used to capture both spatial and temporal features of a gait sequence. The kinematic dependency extracted from shallower network layer is propagated to deeper layer using residual connection-based GCNN architecture. The proposed method is validated on CASIA-B gait dataset and outperforms all recent state-of-the-art methods.

Keywords Gait Recognition·Behavioral biometric·Graph Convolutional Neural Network·Video Processing

1 Introduction

Over years, extensive attention was given to a person identification task in order to prevent fraudulent activities. Several techniques have been developed for verifying persons from video or image by considering various biometrics such as face, palm print, iris, face, and gait [1]. Among all the afore- mentioned traits, gait is unobtrusive and easily collectable biometric, that can be observed without any hindrance to the subject’s activity [2,3]. Gait recognition has been studied extensively in the domains of cybersecurity [4], computer graphics [5], assisted living [6], virtual reality [7], action recognition [8,9], emotion recognition [10], and medicine [11,12].

Domains of image processing, data analytics, and information security traditionally relied on the use of hand-crafted features. However, this reliance has recently been addressed

B

^{Md Shopon}

md.shopon@ucalgary.ca A. S. M. Hossain Bari asmhossain.bari@ucalgary.ca Marina L. Gavrilova mgavrilo@ucalgary.ca

1 University of Calgary, Calgary, Alberta T2N 1N4, Canada

by deep learning architectures including Recurrent Neu- ral Networks (RNNs) [13], Convolutional Neural Networks (CNNs) [14], and Autoencoders [15]. For example, images can be represented in the Euclidean space, and CNN can handle the composition of data, local connectivity, and shift- invariance [16] of data. Although CNN can effectively extract latent patterns from data, a CNN architecture is unable to extract feature maps from data represented in graph form. A citation network is a directed graph, where vertices represent documents and edges represent references of one publica- tion to another. Traditional machine learning algorithms fail to work on graph-like data because graphs can be irregular.

For instance, one of the crucial operations of CNN is convolution, and it is straightforward to apply this operation in image data but challenging to apply it in the graph. Due to this deficiency, deep learning methods for graphs have gained considerable importance. However, despite the evidence that Graph Convolutional Neural Networks (GCNN) are fit for various problems, there has not been a successful application of GCNN-based architectures for biometric gait recognition from videos. Moreover, the susceptibility of GCNN to varied walking conditions and low-quality video data has not been investigated. In addition, the spatiotemporal relationship between body joints has not been utilized so far for gait recognition. Such relationships contain distinguishable fea-

(2)

tures which can enhance the performance of gait recognition algorithms. This paper fills the research gaps identified above in the gait recognition research.

The contributions of this work can be summarized as follows. First, the proposed architecture utilizes the dynamic modality of the skeleton sequence to achieve high resistance to low-quality data, varied walking conditions, and other artifacts. Second, a novel architecture of graph convolutional neural network is proposed to capture the distinctive kinematic features for gait recognition. Third, residual connections are integrated within the architecture of the graph convolutional neural network to enhance the performance of the gait recognition by amplifying the high-level feature map of the deeper layer by adding the low-level feature map.

Thus, the required feature map does not suffer from diminishing gradient problems during optimization. Fourth, the global attention sum pooling layer is incorporated to ensure faster convergence while training and to overcome the issue of overfitting of graph convolutional neural network. Global attention sum pooling uses a sum operation to pool sets of features, leaving a smaller and distinctive number of features.

Publicly available CASIA-B gait dataset is used to evaluate the performance of the proposed method. Sets of experiments are conducted to confirm that the proposed method outperforms the recent state-of-the-art methods in terms of recognition accuracy.

2 Related work

Deep learning-based methods gained interest in gait recognition research in the last couple of years. Liao et al. [17]

developed a Pose-based Temporal-spatial Network (PTSN) architecture to handle specifically carrying and clothing variations and He et al. [18] utilized Multi-task Generative Adversarial Networks (MGAN) for gait recognition. More- over, Wu et al. [19] proposed a CNN with hand-crafted histogram and introduced deep features to fine-tune the network. In [20], autoencoder was used to extract discriminative features, and Long Short Term Memory (LSTM) was used for modeling spatiotemporal features. Instead of considering gait sequence as continuous, Chao et al. [21] regarded gait as a set of independent silhouettes. Their proposed method was capable of extracting invariant features such as speed and step distance from the set. Wolf et al. [22] proposed a 3D convolutional neural network for gait recognition which was capable of learning the gait features from different viewing angles. Zhan et al. [23] used Siamese Neural Networks to transform a sequence of images into gait energy images.

Authors employed a contrastive loss function for the given inputs, which allows the system to reduce the loss for similar- looking inputs and increase for different ones. Batistone et al. [24] proposed a time-graph-based LSTM network. This

method used a fully connected neural network (FCNN) and LSTM to extract skeleton points from a person’s image and then learned the joint features. A combination of hand-crafted geometric features and artificial neural network has proven to be effective for Kinect-based gait recognition [25]. A multi- temporal 3D Convolution Neural Network (3D-CNN) and a frame pooling method were recently proposed to handle gait recognition from videos [26]. Another work designed 3D-CNN-based method with spatial and temporal fusion for action recognition [27]. The advantage of the deep learning- based approaches is that spatial and temporal information can be gathered due to the specialized structure of the network.

However, the training costs of these deep learning-based methods are expensive. In addition, these methods do not explore the kinematic relationship between joints.

Pose estimation algorithms [28–30] extract body joint coordinates of a person from a video. Although numerous computer vision and image processing-based pose estimation algorithms have been proposed, the recently introduced deep learning pose estimation algorithms can identify body joint location better than earlier methods. Li et al. [31] used pose estimation to generate eighteen 2D coordinates of body joints and proposed a loss function to make the features cross view- invariant. Shaik et al. [32] used OpenPose [28] to extract hand-crafted features from the body joint coordinates, and train a deep learning model. Liao et al. [33] also exploited OpenPose to transform the video-based gait sequence to skeleton sequence and to extract static and dynamic spatiotemporal handcrafted features. CNN was then trained with the Softmax and center loss functions for the optimization.

Similarly, Mao et al. [34] used the same loss functions to train two GCNN architectures for gait recognition. Although these methods utilize the body joint relationships of a human body, the architectures have a large number of trainable parameters. Moreover, these methods consider only adjacent frames, whereas it is important to propagate the information into later frames to successfully identify a person. As the prior works were unable to address this issue, many existing methods suffer from performance deterioration. Our method overcomes this problem by leveraging GCNN to create high-level feature map. The proposed GCNN architecture utilizes residual connection to aggregate body joint relationships using identity mapping to not only adjacent frames but also the later frames.

Over the last few years, Graph Convolutional Neural Net- work has been applied in palmprint recognition [35], object segmentation [36], image denoising [37], hand gesture recognition [38], and person re-identification [39]. Zhang et al.

[40] worked on human activity recognition from skeleton sequence by combining spatial and temporal feature maps extracted by the two-stream GCNN architecture. Shi et al.

[41] represented the body joints of the human body as a Directed Acyclic Graph (DAG) and later, a GCNN was

(3)

trained using the graph as an input for the human activity recognition from video. The motivation behind representing the body joints as graph is based on the kinematic relationship between joints and bones in the human body. In [42], authors trained a GCNN architecture to learn the topo- logical structure from skeleton sequences by introducing a context encoding network for action recognition. While those works appeared recently, they were not applied to the gait identification problem from video sequences. Moreover, challenging conditions such as bulky clothing, accessories, and varied viewing angles have not been investigated. In [43], authors argued about utilizing recurrent unit instead of residual connection in GCNN. However, in their work, they have demonstrated the results only on datasets that are not related to computer vision or biometric recognition domains. There- fore, it is not guaranteed that incorporating recurrent units in GCNN will provide a better result for the gait recognition task.

Earlier, GCNN was considered to be applicable only on a dataset that has a textual relationship or has a graph-like form.

In this work, we demonstrate that GCNN can be applied to gait analysis as well. The proposed architecture is capable of gait recognition by utilizing the body joints relationship. In addition, the proposed method is also able to handle challenging conditions such as different viewing angles and carrying accessories while walking.

3 Methodology

In this paper, a skeleton-based method for gait recognition using Graph Convolutional Neural Network is proposed. This work introduces the transformation methodology in order to convert gait sequences into a graph-based structure. Resid- ual connection-based Graph Convolutional Neural Network (RGCNN) is proposed, which introduces residual connection in the graph convolutional layer. This allows to extract spatiotemporal features from joints and aggregates distinguishable body joint information from one frame to another to enhance the recognition accuracy. In addition to the residual connection, global attention sum pooling is incorporated after the last convolutional layer which drastically reduces the number of parameters of the network and allows training of the GCNN to converge faster. The overall architecture of the proposed system is presented in Fig.1.

3.1 Body joints and gait cycles extraction

We have used a pre-trained model of OpenPose [28] pose estimation network to extract the 25 body joints from videos (see Fig.2).

OpenPose is one of the fast multi-purpose pose estimation algorithms. The residual connections and global attention

Fig. 1 Overall architecture of the proposed system

Fig. 2 Skeleton joints extracted from videos using OpenPose algorithm

sum pooling layer contribute to the performance, as the high-level feature map in the deeper layer is amplified by adding the low-level feature map, and the resulting architecture does not suffer from diminishing gradient problems during optimization. The use of OpenPose in the proposed method further incorporates the dynamic modality of the skeleton sequence to achieve high resistance to the low- quality gait sequence, varied walking conditions, and other artifacts. Recent research [2] demonstrates increase in performance for varied walking conditions using OpenPose.

Other data normalization techniques, such as the silhouette- based method, are found to be more suitable for CNN-based architectures [19,21]. They are not optimal for the proposed Graph-based architecture because the transformation of the silhouette to graph representation requires adjacency matrix of higher dimension. Since the walking pattern of a human shows repeated patterns, the distinctive behavioral charac- teristic of an individual is revealed through the gait cycle [44,45]. Previous works [46] proved that the double support

(4)

Fig. 3 Body joint estimation of CASIA-B dataset videos for different walking conditions

phase method where the distance between both is maximized contains more relevant gait information in comparison to the mid-stance method. Thus, it is used in this work. Figure3 depicts pose estimation for different walking conditions.

3.2 Data representation

For skeleton-based gait recognition, existing methods model the data as a vector sequence. However, such representation ignores the kinematic dependency between bones and joints which we have utilized in this work. To fit the data into the GCNN layer, it first needs to be converted into a graph. In this study, the skeleton data is processed to represent a gait cycle as an undirected acyclic graph. In the graph, the body joints are represented as vertices, and bones are as edges. The edges are composed of two different subsets. The first subset represents the connection between intra-skeleton joints in each frame, denoted asES=

(vti, vt j)|(i,j)H , where H is the set of the body joints andvti andvt j are the vertices of current frame and next frame, respectively. The first subset of edges contains the spatial information of the skeleton sequence. The second subset of edges are the intra-frame edges, denoted asEF =

(vti, v₍t+1)i)|iH

. Here, all edges inEF represent its trajectory over the frame sequence. The second subset of edges contains the temporal information of the skeleton sequence. One benefit of the proposed graph construction method is that it retains the hierarchical representation of the skeleton sequences. This construction will represent the body joints as graph nodes and natural con- nectivities in both human body structures and movements as graph edges. The connections in this setup are thus naturally defined without the need for manual assignment.

3.3 Graph convolutional neural network and residual connection

Graph convolutional neural network is a type of convolutional neural network that is capable of taking advantage of graph structural information. Convolutional neural networks aim to extract features by convolving filters over the images,

whereas a GCNN passes a filter over the graph and extracts discriminative information to classify the graph. There are two types of Graph Convolutional Neural Networks. First, Spectral Graph Convolutional Network performs convolution operation by altering the representation of nodes into spectral domains [47]. Second, Spatial Graph Convolutional Neural Network performs convolution operation by considering the neighboring nodes representing the spatial features [48]. In the context of gait recognition, the changes in the neighboring nodes of skeleton joints play an important role in distinguishing individual subjects. Thus, Spatial Graph Con- volutional Neural Network is chosen to extract and propagate the relative changes of the neighboring nodes of joints. Con- sider an undirected GraphG = {V,E,A}, whereV is the set of vertices,E is the set of edges, andAis the adjacency matrix which represents the connection between the vertices and edges. Then, to model the representation of graphG, a graph Fourier transform is performed on the graph so that the fundamental operations such as convolution can be performed.

Feature propagation makes the GCNN a highly versatile architecture. At the beginning of each layer, the featureXiof nodeNiis averaged with the feature vectors of its neighborhood nodes, and this operation can be expressed by equation (1).

H⁽^l⁺¹⁾=σ

H⁽^l⁾+⁽^l⁾

(1)

H⁽^l⁾ is the activation matrix of thel^{t h} layer and ⁽^l⁾ is the feature matrix of thel^{t h} layer. After this operation, the feature vector of node Ni will contain information of its neighborhood nodes. However, the aggregated representation of a node does not include its own features. To resolve this problem, a self-loop is added to normalize the adjacency matrix and include node Ni’s features in the feature aggre- gation step.

A=D⁻¹^/²AD⁻¹^/² (2)

(5)

In equation (2)Dis the degree matrix ofAandAis denoted by A = A+ I where A is the adjacency matrix without self-loops andIis the identity matrix. The first block of convolutional layer extracts the spatial features which results a 1×Lconvolutional layer. Later this output is used to extract temporal information for the final feature map. HereLis the size of the temporal window. The concept of residual connection (ResNet) was first proposed by He et al. [49]. Prior works in the computer vision domain established that residual connection is guaranteed to provide high performance on video sequences [50]. Residual connection propagates information in targeted layers. Thus, the discriminative information is amplified using the residual connection. We introduce the residual connection in our architecture to extract spatiotemporal features from joints and to propagate spatiotemporal features to a deeper layer using residual learning. The identity map that is used in GCNN layer is different from the original ResNet [49]. In the original paper, the Hadamard product is used for concatenating the output of the activation layer with another layer. However, the dot product is used in the proposed method to concatenate the feature maps. In our method, the output of a layer is computed by performing a dot product between the actual output of that layer and the information that layer receives for identity mapping.

Applying this additive property results in faster convergence.

The idea of residual connection is to help with backpropagation. Since the skip connections propagate information to skipped layers, in the forward pass, information from layer l can directly be passed into layerl+t, fort ≥2, and, during the backward pass, the gradients can remain unchanged from layerl+t to layerl. Therefore, it stops the gradient from diminishing. The architecture of the GCNN can be considered a two-stream structure, where the first one learns only the kinematic features, and the other one learns the important spatiotemporal features. Therefore, the proposed GCNN architecture extracts the hierarchical representation of both kinematic and spatiotemporal features.

3.4 Proposed GCNN architecture

We propose a Residual Connection-based Graph Convo- lutional Neural Network (RGCNN) architecture for gait recognition shown in Fig. 4. The architecture takes two inputs, the first one is the adjacency matrix of the gait sequence, and the second one is the feature vector, comprised of the coordinates of each of the body joints. The input is fed into a RGCNN layer followed by a batch normalization regularizer. The purpose of batch normalization layer is to normalize each layer’s inputs values to have a mean output activation of zero and standard deviation of one. The activation function of the RGCNN layer is a Rectified Linear Unit (ReLU). Advantage of using ReLU is the reduced likelihood of the gradient to vanish. The constant gradient of ReLU

results in faster learning. These RGCNN layers learn the hierarchical representation which captures the spatiotemporal dynamics of the gait pattern. After that, the normalized inputs and the adjacency matrix are fed into subsequent RGCNN layer, which later goes into another batch normalization layer.

This feature map becomes the input of the pooling layer. In this study, Global Attention Sum Pooling (GASP) method is used to eliminate the contribution of the irrelevant nodes and to reduce the computation cost by reducing the size of the graph. The pooling layer can be expressed as follows:

x= N

i=1

σ (xw1+b1)

(xw2+b2)

i (3)

We choose to use sigmoid activation function denoted asσ in Equ. 3. Here, x is the feature map, N is the number of data samples, w is the weight, b is the bias, and x is the feature map after pooling. The output of the pooling layer is a one-dimensional vector which becomes the input of the Multi-Layer Perceptron (MLP) architecture. MLP architecture is designed with a stack of three layers. There are 512, 256, and 128 fully connected nodes in each of the layers of the MLP architecture. Moreover, dropout layer is added in between each fully connected layer to introduce regularization. Due to the incorporation of Global Attention Sum Pooling, the number of parameters in the proposed RGCNN architecture is low. The type and number of parameters of the proposed method is presented in Table1. Since the proposed architecture has low number of model parameters, it does not suffer from overfitting by extracting generalized character- istics of the walking pattern for the person identification. To calculate the loss between the predicted and original label, Categorical Cross-Entropy (CCE) loss is used for training.

The cross-entropy loss is defined as follows:

CC E Loss = N

i=1

C

c=1

τ (yi,c)×log P(c|Si1, ....,Si T)(4)

τ(yi,C)= 1, ifyi =c;

0, ifot herwi se (5)

Equation4produces the mean negative logarithm value of the predicted probability for theNtraining samples andSi1,Si2, Si3,....,Si T denotes the factorized input graphs ofi^{t h}training sample and their corresponding label yi, C is the number class and N denotes number of data sample, andc is the original label of the samples. Furthermore, Triplet Loss (TL) function is also employed for attaining better generalization.

Triplet loss function is defined as follows:

T L Loss= N

i=1

f_i^a− f_i^p²−f_i^a− f_iⁿ²+α (6)

(6)

Fig. 4 Architecture of the proposed RGCNN model

In Equ.6, f_i^a denotes anchor data sample. Anchor is the current data. f_i^p denotes a positive data sample which is similar to anchor data. Finally f_iⁿis a negative data sample which is different from the positive data.αis a bias value which is set to 0.7. The objective of this function is to keep the distance between the anchor and positive data samples smaller than the distance between the anchor and negative data samples.

Choosing proper hyperparameters in a deep learning architecture is crucial. First, RMSProp optimizer [51] is used to update the weights of the RGCNN model using backpropagation method. It is worth mentioning that Adam optimizer was tested with both the categorical cross-entropy and triplet-loss function, and was found to lead to similar performance. Second, learning rate is updated adaptively for better recognition accuracy. During the training, the learning rate is adjusted according to the validation loss and accuracy. Third, momentum for RMSProp optimizer is set to 0.05. Fourth, training the RGCNN architecture is conducted in mini-batches. The batch size is set to 64. Fifth, dropout rate is set 25% to introduce regularization with less oscillation.

3.5 Dataset

CASIA-B gait dataset, a standard popular dataset for gait recognition, is chosen for the experimentation. This dataset contains gait sequences of 124 subjects in 3 walking conditions and each recorded in 11 different viewing angles (0^◦, 18^◦, 36^◦, 54^◦, 72^◦, 90^◦, 108^◦, 126^◦, 144^◦, 162^◦, 180^◦).

The walking conditions are, (1) Normal Condition (NC)—6 Sequences Per Subject, (2) Bag Carrying (BC) Condition—2 Sequences Per Subject, (3) Cloth Carrying Condition (CC)—

2 Sequences Per Subject. Each subject has 110 (11 Viewing angle * (6 Normal Walking + 2 Bag Carrying + 2 Cloth Car- rying) different video sequences. In this work, the dataset is partitioned in three different settings according to [21] as follows:

• Small-Sample Training: In this setting, the first 24 subjects are used for training, and the rest 100 are used for testing.

• Medium-Sample Training: In this setting, the first 62 subjects are used for training, and the rest 62 are used for testing.

• Large-Sample Training: In this setting, the first 74 subjects are used for training, and the rest 50 are used for testing.

In all three settings, the first four normal walking conditions are kept in the gallery set (NC #1-4) and last two normal walking conditions for sequences 5 and 6 (NC #5-6) , bag carrying condition (BC #1-2), cloth carrying condition (CC

#1-2) are kept in the probe set. The output embedding vector of the second last MLP layer is utilized for testing purposes.

The training set is only used for the adjustment of the model weights. The data of the gallery set is first passed into the model to output the corresponding embedding vector then the data of the probe set is also put in the model to obtain the corresponding embedding vector, and the two embedding vectors are compared.

Si milar i t y(A,B)=

_N

i=1Ai×Bi

N

i=1A²_i ×N i=1B_i²

(7)

The distance between the embedding vectors of gallery and probe sets is obtained using cosine similarity measure (see Equ.7). In Equ.7,AandBare the feature vectors of the two images. The gallery data that results in the minimum distance is considered to be the class of that probe data.

3.6 Compared architectures

We test our method in a large-sample setting. Three architectures are utilized to prove that the proposed GCNN architecture performs better:

(7)

Table 1 Type and number of parameters of the proposed method

Type of parameter Number of parameter

Trainable parameters 204,491

Non-trainable parameters 1,536

Total Parameters 202,955

Table 2 Comparison of rank-1 accuracies with PTSN [17], MGAN [18], CNN-LB [19], GaitNet [20], Gaitset [21], GaitGraph [52] on CASIA-B dataset for normal, bag carrying and cloth carrying conditions

Gallery NC#1-4 0^◦- 180^◦ Mean

Probe 0^◦ 18^◦ 36^◦ 54^◦ 72^◦ 90^◦ 108^◦ 126^◦ 144^◦ 162^◦ 180^◦

NC #5-6 PTSN [17] 49.3 61.5 64.4 63.6 63.7 58.1 59.9 66.5 64.8 56.9 44.0 59.3

MGAN [18] 54.9 65.9 72.1 74.8 71.1 65.7 70.0 75.6 76.2 68.6 53.8 68.1

CNN-LB [19] 82.6 90.3 96.1 94.3 90.1 87.4 89.9 94.0 94.7 91.3 78.5 89.9

GaitNet [20] 91.2 92.0 90.5 95.6 86.9 92.6 93.5 96.0 90.9 88.8 89.0 91.6

GaitSet [21] 90.8 97.9 99.4 96.9 93.6 91.7 95.0 97.8 98.9 96.8 85.8 95.0

GaitGraph [52] 85.3 88.5 91.0 92.5 87.2 86.5 88.4 89.2 87.9 85.9 81.9 87.7

Proposed RGCNN + RMSProp + CCE 93.1 98.2 95.4 97.9 95.2 97.3 95.2 97.1 95.2 93.8 94.8 95.7 Proposed RGCNN + RMSProp + TL 94.75 98.48 96.85 98.31 96.78 98.88 96.92 98.77 97.90 93.87 95.85 97.03

BC #1-2 PTSN [17] 29.8 37.7 39.2 40.5 43.8 37.5 43.0 42.7 36.3 30.6 28.5 37.2

MGAN [18] 48.5 58.5 59.7 58.0 53.7 49.8 54.0 61.3 59.5 55.9 43.1 54.7

CNN-LB[19] 64.2 80.6 82.7 76.9 64.8 63.1 68.0 76.9 82.2 75.4 61.3 72.4

GaitNet [20] 83.0 87.8 88.3 93.3 82.6 74.8 89.5 91.0 86.1 81.2 85.6 85.7

GaitSet [21] 83.8 91.2 91.8 88.8 83.3 81.0 84.1 90.0 92.2 94.4 79.0 87.2

GaitGraph [52] 75.8 76.7 75.9 76.1 71.4 73.9 78.0 74.7 75.4 75.4 69.2 74.8

CC #1-2 PTSN [17] 18.7 21.0 25.0 25.1 25.0 26.3 28.7 30.0 23.6 23.4 19.0 24.2

MGAN [18] 23.1 34.5 36.3 33.3 32.9 32.7 34.2 37.6 33.7 26.7 21.0 31.5

CNN-LB [19] 37.7 57.2 66.6 61.1 55.2 54.6 55.2 59.1 58.9 48.8 39.4 54.0

GaitNet [20] 42.1 58.2 65.1 70.7 68.0 70.6 65.3 69.4 51.5 50.1 36.6 58.9

GaitSet [21] 61.4 75.4 80.7 77.3 72.1 70.1 71.5 73.5 73.5 68.4 50.0 70.4

GaitGraph [52] 69.6 66.1 68.8 67.2 64.5 62.0 69.5 65.6 65.7 66.1 64.3 66.3

Original GCNN: In the first experiment, original GCNN without residual connection or any pooling layer is imple- mented. This architecture is considered to illustrate the performance of the original GCNN on skeleton-based gait recognition.

GCNN with residual connection: In the second architec- ture studied, residual connection in the GCNN layer was integrated. However, no pooling layer was incorporated. The purpose of considering this architecture is to show the sig- nificance of the pooling layer in RGCNN architecture.

GCNN with residual connection and global attention sum pooling: The third architecture is our proposed architecture where the global attention sum pooling layer after the last RGCNN layer is integrated.

3.6.1 Convergence, accuracy, and loss experiments

The proposed RGCNN architecture is separately optimized by minimizing the triplet loss and categorical cross-entropy loss functions (see Table 2). Triplet loss provides better performance on each of the walking conditions, and thus shows better generalization than categorical cross-entropy loss function. The performance comparison between categorical cross-entropy loss and triplet loss is shown in Table 3. Besides, Fig.5depicts the learning curves of training and validation sets on three proposed deep learning architectures using triplet loss. The learning curve depicts the performance of the proposed method over 1000 iterations during training. One key observation in the proposed RGCNN is that it

(8)

Table 3 Comparison of Triplet Loss (TL) function and Categorical Cross-entropy (CCE) loss function

Walking Condition CCE Loss TL Loss

Normal condition 96.75% 97.03%

Bag carrying Condition 89.56% 90.77%

Cloth condition 87.20% 89.90%

exhibits low fluctuation of loss and accuracy compared to original GCNN and GCNN with residual connection. In the loss curve of our proposed model shown in Fig.5, we can observe that loss starts at 5.0812 cost and gradually decreased to 0.444 cost over time. It is worth mentioning that with the reduction in loss, the accuracy increases. During the training process, with the training loss the validation loss also decreases gradually which states that our method is not overfitting. Therefore, our proposed method performs better than the original GCNN.

Table4shows the achieved results for the three architectures mentioned earlier. The proposed GCNN architecture attained 14.71 and 11.64% higher accuracy for normal walking condition by the RGCNN over the original GCNN and GCNN with the residual connection, respectively. For bag carrying condition, 14.31 and 9.01% higher accuracy was obtained over the original GCNN and GCNN with residual connection. For cloth carrying condition, the proposed method attained 14.67 and 9.84% higher accuracy than original GCNN and GCNN with residual connection, respectively.

3.6.2 Cumulative match curve (CMC) for different conditions

The CMC score is computed to demonstrate the ranking ability of the proposed identification method. The CMC curve represents a biometric identification system’s accuracy for different ranks. Rank K recognition rate is the probability that a prediction is accurate when considering the top K matches.

The different performances at each rank on different walking conditions are shown in Fig.6. Typically, this curve will converge to 99.9% when considering higher ranks, as the cri- teria becomes more lenient. For normal walking condition, the Rank-10 accuracy of the proposed method is 100%. For bag carrying condition the Rank-10 accuracy is 98.10% and for cloth carrying condition the Rank-10 accuracy is 96.20%.

3.7 Comparison against state-of-the-art methods Most of the existing video-based methods use silhouettes as features. Although these features contain essential information for classifying individuals, they do not model the kinematics relationship between different body parts. The

proposed method is compared against the five most recent methods particularly developed to handle exceptional conditions: the 2017 Pose-based Temporal-Spatial Network (PTSN) [17], the 2018 Multi-task GAN (MGAN)[18], the 2016 CNN-LB[19], the 2019 GaitNet [20], the 2019 GaitSet [21], and the 2021 GaitGraph [52]. The above compara- tors were chosen because they applied deep learning-based architectures on video-based dataset and showed superior performance to previous methods. As we have used Open- Pose algorithm that extracts the 3D body joints regardless of the carried accessories, our proposed method does not depend on the silhouette. Table5shows the precision, recall and F score of the proposed method for normal, bag carrying and cloth carrying condition. The recognition accuracy of the proposed method for normal walking condition is 97.03%; the precision is 96.08%; the recall is 95.40%; the F-score is 94.84%. For bag carrying condition recognition accuracy is 90.77%; the precision is 89.90%; the recall is 89.36%; the F-score is 89.08%. For cloth carrying condition the recognition accuracy is 89.90%; the precision is 87.52%;

the recall is 87.90%; the F-score is 86.0%. For normal walking condition the proposed method achieved on average 2%

higher accuracy over best performing GaitSet [21] and 5, 7, 24, 7, and 9% over GaitNet, CNN-LB, MGAN, PTSN, and GaitGraph, respectively. Recognition accuracy for the bag carrying condition of the proposed GCNN is on average 3% higher than previously best performing GaitSet and 5, 18, 36, 53, and 15% higher than GaitNet, CNN-LB, MGAN, PTSN, and GaitGraph, respectively. For the cloth carrying conditions, the proposed method attained on average 19%

higher accuracy than previously best performing GaitSet and 31, 35, 58, 65, and 22% higher accuracy than Gait- Net, CNN-LB, MGAN, PTSN, and GaitGraph, respectively.

The method shows consistent performance across all conditions and views, proving that the system is view-invariant.

The comparison of rank-1 accuracy of gait recognition methods on CASIA-B gait dataset for normal, bag carrying and cloth carrying condition is shown in Fig.7. Furthermore, our proposed method not only achieves higher recognition accuracy but also secures higher precision, recall, and F-score on CASIA-B dataset.

Table2 also depicts a thorough comparison of the proposed architecture for different viewing angles against the compared methods. For normal condition, the proposed

(9)

Table 4 Rank-1 accuracies of original GCNN, GCNN with residual connection and proposed RGCNN model for normal (NC), bag carrying (BC) and cloth carrying (CC) conditions

Testing Accuracy

Settings NC#5-6 BC#1-2 CC#1-2

Original GCNN 82.32% 76.46% 75.23%

GCNN+

Residual Connection 85.39% 81.76% 80.06%

Proposed RGCNN 97.03% 90.77% 89.90%

Fig. 5 Accuracy and loss curve for original GCNN, GCNN with residual connection and proposed RGCNN

Table 5 Precision, recall and F score of proposed method for normal, bag carrying and cloth carrying conditions on CASIA-B dataset

Walking condition Precision Recall F Score

Normal Condition 96.08% 95.40% 94.84%

Bag Carrying Condition 89.90% 89.36% 89.08%

Cloth Condition 87.52% 87.90% 86.06%

method attained 98.48% for 18^◦, 98.77% for 126^◦ and 98.88% for 90^◦angles. For bag carrying condition the method achieved 93.81% for 18^◦, 91.92% for 108^◦and 92.36% 144^◦ angles. For cloth carrying condition the method achieved 91.14% for 18^◦, 89.21% for 108^◦and 91.02% 144^◦angles.

Thus, it can be concluded that the proposed method is capable of handling challenging walking conditions better than all recent state-of-the-art methods.

3.8 Reasons for improved performance

It is evident from Table2 that the proposed GCNN architecture shows a significant performance gain over the state- of-the-art methods. One key reason for such improvement is the ability of the proposed architecture to extract distinguishable spatial and temporal patterns. Besides, GCNN finds the weights of each of the body joints (vertex) using convolutional operation which is optimized using the RMSProp optimizer to minimize the triplet loss. Furthermore, GCNN

(10)

Fig. 6 CMC curve for Rank-1 to Rank-10 accuracy for normal, bag carrying and cloth carrying conditions of the proposed method

with residual learning ensures the low model parameter count and helps overcome the vanishing gradient problem. Since the number of parameters of the proposed architecture is very low, the overfitting problem is mitigated, and running time is significantly reduced. During optimization, the most contributing feature map is propagated to subsequent layers using identity mapping. This increases the performance during the challenging condition, whereas the silhouette is hampered with occlusion.

4 Conclusion

In this work, we proposed a novel residual connection-based graph convolutional neural network with global attention sum pooling method. The RGCNN method aggregates the feature map from one frame to another by utilizing the kinematics relationship among different body joints to enhance recognition accuracy. Performance comparison on publicly available CASIA-B dataset establishes that the proposed method is superior to all recent state-of-the-art methods. The proposed method shows performance improvement not only for the normal walking condition but also the challenging scenarios such as bag-carrying and cloth-carrying conditions.

It also performs extremely well under varying view angles.

In the future, we will study other architectures to extract more discriminative feature maps, and we will investigate method performance without retaining hierarchical feature map on low-quality videos. Furthermore, the introduction of the recurrent unit to GNN architecture can be investigated. Incorporating video sequences in addition to skeleton

sequences is another direction for future research. ^{Fig. 7} Comparison of rank-1 recognition accuracy (%) of gait recognition methods on CASIA-B dataset for normal, bag carrying and cloth carrying conditions

(11)

Acknowledgements The authors acknowledge the Natural Sciences and Engineering Research Council (NSERC) Discovery Grant funding, as well as the NSERC Strategic Partnership Grant (SPG) and the Innovation for Defense Excellence and Security Network (IDEaS) for the partial funding of this project.

References

1. Xiao, Q.: Technology review-biometrics-technology, application, challenge, and computational intelligence solutions. IEEE Com- put.l Intell. Mag.2(2), 5–25 (2007)

2. Boulgouris, N.V., Hatzinakos, D., Plataniotis, K.N.: Gait recognition: a challenging signal processing technology for biometric identification. IEEE Sig. Process. Mag.22(6), 78–90 (2005) 3. Ahmed, F., Paul, P.P., Gavrilova, M.L.: DTW-based kernel and

rank-level fusion for 3d gait recognition using kinect. The Vis.

Comput.31(6), 915–924 (2015)

4. Obaidat, M.S., Traore, I., Woungang, I.: Biometric-Based Physical and Cybersecurity Systems. Springer, Berlin (2019)

5. Zhang, Y., Zheng, J., Magnenat-Thalmann, N.: Example-guided anthropometric human body modeling. The Vis. Comput.31(12), 1615–1631 (2015)

6. Seifert, A.-K., Zoubir, A. M., Amin, M. G.: Radar-based human gait recognition in cane-assisted walks, in2017 IEEE Radar Con- ference (RadarConf), pp. 1428–1433, IEEE, (2017)

7. Thalmann, N. M., Thalmann, D.: Modeling behaviour for social robots and virtual humans, in SIGGRAPH ASIA 2016 Courses, pp. 1–231, ACM, (2016)

8. Rahman, M. W., Gavrilova, M. L.: Kinect gait skeletal joint feature-based person identification, in 2017 IEEE 16th international conference on cognitive informatics & Cognitive Computing (ICCI* CC), pp. 423–430, IEEE, (2017)

9. Ahmed, F., Paul, P. P., Gavrilova, M. L.: Joint-triplet motion image and local binary pattern for 3d action recognition using kinect,’

in Proceedings of the 29th international conference on computer animation and social agents, pp. 111–119, (2016)

10. Ahmed, F., Bari, A.H., Gavrilova, M.L.: Emotion recognition from body movement. IEEE Access8, 11761–11781 (2019)

11. Cho, C.-W., Chao, W.-H., Lin, S.-H., Chen, Y.-Y.: A vision-based analysis system for gait recognition in patients with Parkinson’s disease. Exp. Syst. Appl.36(3), 7033–7039 (2009)

12. Gavrilova, M. L., Ahmed, F., Bari, A. H., Liu, R.,Liu, T., Maret, Y., Sieu, B. K., Sudhakar, T.: Multi-modal motion-capture-based biometric systems for emergency response and patient rehabilitation, in Research Anthology on Rehabilitation Practices and Therapy, pp. 653–678, IGI global, (2021)

13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)

14. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The Handbook Brain Theory Neural Netw.

3361(10), 1995 (1995)

15. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.- A., Bottou, L.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

J. Mach. Learn. Res. vol. 11, no. 12, (2010)

16. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond euclidean data. IEEE Sig. Process. Mag.34(4), 18–42 (2017)

17. Liao, R., Cao, C., Garcia, E. B., Yu, S., Huang, Y.: Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing variations, in Chinese conference on biometric recognition, pp. 474–483, Springer, (2017)

18. He, Y., Zhang, J., Shan, H., Wang, L.: Multi-task gans for view- specific feature learning in gait recognition. IEEE Trans. Inform.

Foren. Secur.14(1), 102–113 (2018)

19. Wu, Z., Huang, Y., Wang, L., Wang, X., Tan, T.: A comprehen- sive study on cross-view gait based human identification with deep CNNs. IEEE Trans. Pattern Anal. Mach. Intell.39(2), 209–226 (2016)

20. Zhang, Z., Tran, L., Yin, X., Atoum, Y., Liu, X., Wan, J., Wang, N.:

Gait recognition via disentangled representation learning, in IEEE conference on computer vision and pattern recognition, pp. 4710–

4719, (2019)

21. Chao, H., He, Y., Zhang, J., Feng, J.: Gaitset: regarding gait as a set for cross-view gait recognition. AAAI Conf. Artif. Intell.33, 8126–8133 (2019)

22. Wolf, T., Babaee, M.,Rigoll, G.: Multi-view gait recognition using 3d convolutional neural networks, in 2016 IEEE international conference on image processing (ICIP), pp. 4165–4169, IEEE, (2016) 23. Zhang, C., Liu, W., Ma, H., Fu, H.: Siamese neural network based gait recognition for human identification, in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2832–2836, IEEE, (2016)

24. Battistone, F., Petrosino, A.: TGLSTM: a time based graph deep learning approach to gait recognition. Pattern Recog. Lett.126, 132–138 (2019)

25. Bari, A.H., Gavrilova, M.L.: Artificial neural network based gait recognition using kinect sensor. IEEE Access7, 162708–162722 (2019)

26. Lin, B., Zhang, S., Bao, F.: Gait recognition with multiple- temporal-scale 3d convolutional neural network, in 28th ACM international conference on multimedia, pp. 3054–3062, (2020) 27. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan,

J.: Multi-stream CNN: learning representations based on human- related regions for action recognition. Pattern Recog.79, 32–43 (2018)

28. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose:

realtime multi-person 2d pose estimation using part affinity fields, arXiv:1812.08008, (2018)

29. Alp Güler, R., Neverova, N., Kokkinos, I.: Densepose: dense human pose estimation in the wild, in IEEE conference on computer vision and pattern recognition, pp. 7297–7306, 2018

30. Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks, in IEEE conference on computer vision and pattern recognition, pp. 1653–1660, (2014)

31. Li, N., Zhao, X., Ma, C.: A model-based gait recognition method based on gait graph convolutional networks and joints relationship pyramid mapping,arXiv:2005.08625, (2020)

32. Shaik, S.: OpenPose based gait recognition using triplet loss architecture. PhD thesis, Dublin, National College of Ireland, (2020) 33. Liao, R., Yu, S., An, W., Huang, Y.: A model-based gait recognition

method with body pose and human prior knowledge. Pattern Recog.

98, (2020)

34. Mao, M.,Song, Y.: Gait recognition based on 3d skeleton data and graph convolutional network, in International joint conference on biometrics, pp. 1–8, IEEE, 2020

35. Shao, H., Zhong, D.: Few-shot palmprint recognition via graph neural networks. Electron. Lett.55(16), 890–892 (2019) 36. Wang, W., Lu, X., Shen, J., Crandall, D. J., Shao, L.: Zero-shot

video object segmentation via attentive graph neural networks, in ieee international conference on computer vision, pp. 9236–9245, (2019)

37. Valsesia, D., Fracastoro, G.,Magli, E.: Image denoising with graph-convolutional neural networks, in 2019 IEEE international conference on image processing (ICIP), pp. 2399–2403, IEEE, (2019)

38. Zhang, W., Lin, Z., Cheng, J., Ma, C., Deng, X., Wang, H.:

STA-GCN: two-stream graph convolutional network with spatial-

(12)

temporal attention for hand gesture recognition. The Vis. Comput.

36(10), 2433–2444 (2020)

39. Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network, in European conference on computer vision (ECCV), pp. 486–504, 2018 40. Zhang, J., Ye, G., Tu, Z., Qin, Y., Zhang, J., Liu, X., Luo, S.: A spa-

tial attentive and temporal dilated. SATD) gcn for skeleton-based action recognition, CAAI Transactions on Intelligence Technology (2020)

41. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks, in IEEE conference on computer vision and pattern recognition, pp. 7912–7921, (2019) 42. Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., Tang, H.: Dynamic GCN:

context-enriched topology learning for skeleton-based action recognition, in 28th ACM international conference on multimedia, pp. 55–63, (2020)

43. Huang, B., Carley, K. M.: Residual or gate? towards deeper graph neural networks for inductive graph representation learning, arXiv:1904.08035, (2019)

44. Murray, M.P.: Gait as a total pattern of movement: including a bibliography on gait. Am. J. Phys. Med. Rehab.46(1), 290–333 (1967)

45. Murray, M.P., Drought, A.B., Kory, R.C.: Walking patterns of normal men. JBJS46(2), 335–360 (1964)

46. BenAbdelkader, C., Cutler, R.G., Davis, L.S.: Gait recognition using image self-similarity. EURASIP J. Adv. Sig. Process.

2004(4), (2004)

47. Kipf, T. N., Welling, M.: Semi-supervised classification with graph convolutional networks,arXiv:1609.02907, (2016)

48. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs,arXiv:1312.6203, (2013)

49. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016)

50. Pouyanfar, S., Wang, T., Chen, S.-C.: Residual attention-based fusion for video classification, in IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0, (2019) 51. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient

by a running average of its recent magnitude, Coursera: neural networks for machine learning

52. Teepe, T., Khan, A., Gilg, J., Herzog, F., Hörmann, S., Rigoll, G.:

GaitGraph: graph convolutional network for skeleton-based gait recognition,arXiv:2101.11228, (2021)

Publisher’s Note Springer Nature remains neutral with regard to juris- dictional claims in published maps and institutional affiliations.

Md Shopon received B.Sc. in computer science and engineering from the University of Asia Pacific in 2018. He worked in the University of Asia Pacific as a Lecturer from September 2018 to September 2020. He is cur- rently pursuing his M.Sc. degree in computer science at the Univer- sity of Calgary, Canada, under the supervision of Professor Marina L. Gavrilova. His research interests lie in the field of computer vision, deep learning, and behavioral biometrics. In addition, he has authored over 12 international conference papers and journals.

A. S. M. Hossain Barireceived the B.Sc. degree in Computer Science and Information Technology from the Islamic University of Technol- ogy in 2010. He worked in Sam- sung R&D Institute Bangladesh (SRBD) from November 2010 to August 2018. He completed his M.Sc. degree in Computer Sci- ence at the University of Calgary, Canada, under the supervision of Professor Marina L. Gavrilova in 2020. He has authored over 15 international journal and conference papers. He has secured U.S.

Patents while working in SRBD. His research interests are behavioral biometrics, computer vision, and machine learning.

Marina L. Gavrilova is a Full Professor in the Department of Computer Science, University of Calgary, and a head of the Bio- metric Technologies Laboratory.

Her publications include over 200 journal and conference papers, edited special issues, books and book chapters in the areas of image processing, pattern recognition, machine learning, biometric and online security. She is Founding Editor-in-Chief of LNCS Transactions on Compu- tational Science Journal. Dr.

Gavrilova has given over 50 keynotes, invited lectures and tutorials at major scientific gatherings and industry research centers, including Stanford University, SERIAS Center at Purdue, Microsoft Research USA, Oxford University UK, Samsung Research South Korea and oth- ers. Dr. Gavrilova serves as an Associate Editor for IEEE Access, IEEE Transactions on Computational Social Systems, the Visual Com- puter and the International Journal of Biometrics, and was appointed by the IEEE Biometric Council to serve on IEEE Transactions on Biometrics, Behavior, and Identity Science Committee. She is a pas- sionate promoter of diversity, equity and inclusion at her workplace and in the society as a whole.