Semi-supervised multi-label feature selection with local logic information preserved

(1)

https://doi.org/10.1007/s43674-021-00008-6 O R I G I N A L A R T I C L E

Semi-supervised multi-label feature selection with local logic information preserved

Yao Zhang¹·Yingcang Ma¹·Xiaofei Yang¹·Hengdong Zhu¹·Ting Yang¹

Received: 14 June 2021 / Revised: 29 July 2021 / Accepted: 13 August 2021 / Published online: 6 September 2021

Abstract

In reality, like single-label data, multi-label data sets have the problem that only some have labels. This is an excellent challenge for multi-label feature selection. This paper combines the logistic regression model with graph regularization and sparse regularization to form a joint framework (SMLFS) for semi-supervised multi-label feature selection. First of all, the regularization of the feature graph is used to explore the geometry structure of the feature, to obtain a better regression coefficient matrix, which reflects the importance of the feature. Second, the label graph regularization is used to extract the available label information, and constrain the regression coefficient matrix, so that the regression coefficient matrix can better fit the label information. Third, theL2,p-norm 0<p≤1 constraint is used to ensure the sparsity of the regression coefficient matrix so that it is more convenient to distinguish the importance of features. In addition, an iterative updating algorithm with convergence is designed and proved to solve the above problems. Finally, the proposed method is validated on eight classic multi-label data sets, and the experimental results show the effectiveness of the proposed algorithm.

Keywords Feature selection·Semi-supervised learning·Multi-label learning·L2,p-norm·Logistic regression

1 Introduction

With the development of science and technology, high- dimensional data are becoming more and more popular in real life. However, at the same time, these high-dimensional data also bring many problems to data analysis, decision-making, screening, and prediction (Cai et al. 2018). Dimensional disaster is a common problem in real life. Moreover, considering the immaturity of the existing technology, it brings significant challenges to machine learning and other fields (Yang et al.2013; Li et al.2016). Moreover, processing and

B

Yingcang Ma mayingcang@126.com Yao Zhang

820927725@qq.com Xiaofei Yang

yangxiaofei2002@163.com Hengdong Zhu

ZhuHengDong1997@163.com Ting Yang

15929121393@126.com

1 School of Science, Xi’an Polytechnic University, Xi’an, China

analyzing these high-dimensional data will cost much time, space, workforce, and material resources, which is challeng- ing to process (Elguebaly and Bouguila 2015; Yan et al.

2016) directly. In addition, these high-dimensional data often contain a lot of noise and redundant features, which brings tremendous obstacles to the pattern recognition of high- dimensional data. In Yang et al. (2013), there are two methods to solve this problem, namely feature extraction and feature selection. Feature extraction mainly depends on changing the relationship between attributes. For example, combining different attributes can get new attributes, which changes the original feature space. In contrast, the method of feature selection is to select the most representative subset from the original feature data set, which is a kind of inclusion relationship and does not change the original feature space. However, it is difficult to explain the relationship between the extracted data and the original data for feature extraction. While feature selection has many advantages in learning algorithms, including reducing measurement cost and storage requirements, shortening training time, avoiding dimension disaster, reducing overfitting, etc. (Bermingham et al.2015; Sun et al.

2012; Zhang et al. 2018). At the same time, there is clear theoretical support in explaining the relationship between the data after feature selection and the original data. There-

(2)

fore, multi-label feature selection has become a research hot spot.

Based on the existence or absence of labels, the existing feature selection methods can be roughly divided into three categories: supervised feature selection (Taher et al.

2019), unsupervised feature selection (Shi et al.2018) and semi-supervised feature selection (Wang and Wang2017).

Supervised feature selection can take advantage of feature information and label information to achieve high accuracy when the data sets all have labels. Since supervised feature selection is the earliest and mature research, many supervised feature selections are available in many fields. Supervised feature selection is a commonly used method in Zhang et al.

(2019), Cai and Zhu (2018), Chen and Zhang (2019). The multi-label feature selection method based on mutual information and label correlation is used in Zhang et al. (2019).

The multi-label feature selection method, which measures the consistency between the feature space and the label space of feature sorting and selection using check-ups to achieve auto- matic learning and processing of the importance of labels, is used in Cai and Zhu (2018). The multi-label feature selection method based on the linear regression model, feature manifold structure, and L2,1-norm sparse regularization is used in Chen and Zhang (2019). However, the premise of using supervised feature selection is to collect all the data labels, which is a costly task for high-dimensional extensive sample data. While unsupervised feature selection processes unlabeled data (Yang et al.2010; Chang et al.2016). It often extracts data information through data similarity or data variance to select features. For example, Ren et al. (2012), the unsupervised feature selection method which takes advantage of data variance to extract data information. It can select features by choosing the maximum variance. However, unsupervised feature selection methods are often difficult to achieve the desired results due to the lack of label information.

Semi-supervised feature selection is a trade-off between supervised feature selection and unsupervised feature selection. It can avoid the use of many human and material resources and provide a certain amount of label information for feature selection through a small number of labeled data. However, this also brings a severe problem to semi- supervised feature selection: how to design a framework that can fully collect data and part of labels to improve learning performance. In recent years, many semi-supervised feature selection frameworks have been proposed and improved. In these frameworks, using Laplace transform to explore feature manifold structure or label manifold structure for feature selection has been widely concerned (Tang and Zhang2020;

Kawano2013; Kawano et al.2012). A local preserving logic i-relief algorithm (LPLIR) for multi-class semi-supervised feature selection is proposed in Tang and Zhang (2020).

LPLIR is a semi-supervised feature selection algorithm that

maximizes the expected boundary of a given label data and preserves the local structure information of all given data. A semi-supervised logistic regression framework for classification problems based on covariate shift adaptation technique is proposed in Kawano (2013). However, the manifold structure of data features is not considered in this framework. A semi-supervised logic model based on the Gaussian basis function is proposed in Kawano et al. (2012), and the graph- based regularization method is used to explore the feature manifold structure.

However, in the actual classification task, the high- dimensional data contain not only single-label data, but also a large number of multi-label data. Unlike single-label data, in multi-label data, a sample may belong to multiple labels simultaneously, and each label is cross related. This brings another severe challenge to the feature selection task. A uni- fied semi-supervised multi-label feature selection framework based on Laplace integral is proposed in Alalga et al. (2016).

Although the framework can deal with the constraint relationship between labels and data, it ignores the manifold structure of features and does not extract enough data information. A robust semi-supervised multi-label dimensionality reduction method (READER) is proposed in Sun et al.

(2017). READER selects the most discriminative features for all labels in a semi-supervised way from the perspective of empirical risk minimization. However, this method uses the prediction of the correlation between labels and ignores the manifold structure of existing labels. A semi-supervised multi-label feature selection method was proposed, which put the image annotation of visual similarity and label correlation into the semi-supervised framework in Yang et al. (2012).

Similar to Yang et al. (2012), A semi-supervised multi-label feature selection method that uses label correlation and was imposed a new group sparse constraint is proposed in Chang et al. (2014). However, they all ignore the clarity of the basic manifold structure of data features.

In this paper, we propose a semi-supervised multi-label feature selection algorithm based on logistic regression, which considers both the introductory level feature manifold structure and the local label manifold structure. Considering that the sparsity of the fixedL2,1-norm sparse regularization constraint is not enough for all data sets, we use the L2,p- norm 0<p≤1 sparse regularization constraint to relax the adaptability of the sparse constraint. The main contributions of this paper are as follows:

1. The algorithm combines the feature manifold structure and the local label manifold structure into the semi-supervised multi-label feature selection learning framework, which can make full use of labeled and unlabeled data and make full use of the local label information.

(3)

2. The construction process of Laplacian matrix of a characteristic graph and Laplacian matrix of label graph is constrained byL2,p-norm 0< p≤1 sparse regularization, which can learn more clearly the characteristic base manifold structure and local label manifold structure of various data, andL2,p-norm can also relax the degree of sparse constraint.

3. Because the objective function is non-convex, it is chal- lenging to solve. In this paper, we propose an effective iterative algorithm to optimize the objective function and prove the algorithm’s convergence. At the same time, the proposed algorithm is applied to different data sets, and the experimental results verify the algorithm’s effectiveness.

The rest of this paper is organized as follows. In Sect.2, a semi-supervised multi-label feature selection model is pre- sented. In Sect. 3, the model is solved, and the iterative optimization algorithm of the model is proposed, and the convergence of the algorithm is proved. In Sect.4, experiments on eight classical data sets show that the proposed algorithm in this paper is superior to other algorithms. Finally, the con- clusion and prospect are given in Sect.5.

2 Problem description

2.1 Notations of this paper

In this paper, X ∈ Rⁿ^×^d is the data matrix, and xi is the ith row sample vector of X, fi is the ith column eigen- vector of X; Y ∈ Rⁿ^×^m is the label matrix, yi is theith row label vector of Y, representing the label corresponding to the sample vectorxi, and yi j is 0 or 1;W ∈ R^d^×^m is the weight matrix; For any matrix M, its transpose is represented by M^T, and the L2,p-norm of M is defined as M₂^p_,_p = d

i=1(m

j=1(M_{i j}²))²^p = d

i=1Mi₂^p; For any square matrix B, its trace is represented by T r(B); The matrix transformation of the vectors is represented by di ag(∗).

2.2 Local logistic regression model

Suppose that a multi-label data set D = {di}ⁿ_i₌₁ is com- posed ofnindependent and identically distributed samples.

Let X = [XL;XU]ⁿ^×^d be the new data matrix, where XL = [x1;x2; · · · ;xl]^l^×^d is labeled data matrix, XU = [xl+1;xl+2; · · · ;xn]⁽ⁿ⁻^l^)×^d is unlabeled data matrix and xi = [1,di]¹^×^d is the ith d-dimension sample variable.

Y = [y1;y2; · · · ;yl]^l^×^m is the known label matrix and yi ∈ {0,1}¹^×^mis them-dimensional label vector corresponding to the samplexi. The value of yi j is 0 or 1, indicates

whether theith sample belongs to thejth class, whereyi j =0 indicates that theith sample does not belong to thejth class, andyi j =1 indicates that theith sample belongs to the jth class. According to the definition of logistic regression, the posterior probability of yi j = 1 in local logistic regression is:

Pr(yi j =1|xi)=g(xiwj)= ex p(xiwj)

1+ex p(xiwj). (1) While the posterior probability ofyi j =0 is:

Pr(yi j =0|xi)=1−g(xiwj)= 1

1+ex p(xiwj), (2) wherewj ∈ R^d^×¹can reflect the importance of the jth feature, and is the jth column vector of W = [w1, w2,· · ·, wm]^d^×^m.

The most commonly used method to solve the coefficient matrixW in logistic regression is maximum likelihood estimation. The likelihood function of multi-label logistic regression is:

P(W)=max m j=1

l i=1

g(xiwj)^y^{i j}(1−g(xiwj))¹⁻^y^{i j}. (3)

Because (3) is not easy to solve,W is estimated by the minimum value of (4) in logistic regression.

L(W)= − m

j=1

l i=1

[yi jln(g(xiwj))+(1−yi j)ln(1−g(xiwj))]

= − m

j=1

l i=1

[yi j(xiwj)+ln(1−g(xiwj))].

(4) 2.3 Characteristic base manifold structure

Features are extracted from some manifolds, called feature manifolds (Gu and Zhou 2009). One of the major challenges in feature selection is how to explore the clarity of the basic manifold structure of data features in the feature selection process. In single-label feature selection, the optimal parameter range ofwj can be found through algorithm trial adjustment. However, when it is extended to multi-label feature selection, the optimal parameter range of each coefficient vector may be different, so it is not feasible to find the parameter range suitable for all coefficient vectors through the trial adjustment of the algorithm. Nevertheless, we know that the coefficient vector can represent the importance of its corresponding features, so we can assume that the acquaintance degree of Wi andWj is positively related to the similarity

(4)

between its corresponding feature fi and fj. SoF(Wi,Wj) is proportional to F(fi, fj). X can also be expressed as X = [f1, f2, . . . , fd]ⁿ^×^d. Then, the parameters of the coefficient vector can be adjusted by the acquaintance between features to reflect the clarity of the basic manifold structure of features. The specific expression is as follows:

1 2

d i=1

d j=1

Wi−Wj²2Si j

=1 2

d i=1

d j=1

(Wi −Wj)(Wi −Wj)^TSi j

= d i=1

WiW_i^TDii− d i=1

d j=1

WiW^T_j Si j

=T r(W^T(D−S)W)

=T r(W^TLSW),

(5)

where S is the similarity matrix between features. Si j represents the similarity between fi and fj. The calculation method is as follows, wheret ∈ R:

Si j =

ex p(−^fⁱ⁻_t^f^j²²), i f fi ∈ Nk(fj)or fj ∈ Nk(fi)

0, ot her s

(6) whereNk(∗)represents thek-nearest neighbor set of∗.D∈ R^d^×^d is the diagonal matrix, and Dii = d

j=1Si j is the ith diagonal element ofD.LSis the Laplacian matrix ofS, LS =D−S.

The objective function with semi-supervised information is constructed by combining local logistic regression with characteristic manifold structure:

obj(W)=mi nWL(W)+λ

2T r(W^TLSW). (7) 2.4 Label base manifold structure

However, only the basic manifold structure with features is not enough to select the optimal feature subset. It is also required that the results of the combination of the learned coefficient matrix and the data matrix must fit the available local label information appropriately; that is, the recognition degree ofg(X W)andYis increased in the process of feature selection. According to the requirements, we can assume that:

1. The similarity of the posterior probability vectorg(xiW) andg(xjW)is positively correlated to the similarity of the label vector yi and yj. So F(g(xiW),g(xjW)) is proportional toF(yi,yj);

2. According to the definition of logistic regression, there is a positive correlation between the similarity of the predicted label vectorxiW,xjW and the probability vector g(xiW),g(xjW). So, F(xiW,xjW)is proportional to F(g(xiW),g(xjW)).

3. According to the hypothesis, the similarity between the predicted label vectorxiW,xjW and the label vectoryi, yj is positively correlated. SoF(xiW,xjW)is proportional toF(yi,yj).

Thus, we can construct a labeled graph to explore the clarity of the basic manifold structure of labels and make the semi-supervised multi-label feature selection framework constructed in this paper still fit the label information mod- erately in feature selection and make full use of the label information. The specific expression is as follows:

1 2

l i=1

l j=1

xiW −xjW²₂Ai j

= 1 2

l i=1

l j=1

(xiW −xjW)(xiW−xjW)^TAi j

= l i=1

xiW(xiW)^TPii− l i=1

l j=1

xiW(xjW)^TAi j

=T r(W^TX_L^T(P−A)XLW)

=T r(W^TX_L^TLAXLW)

=T r(W^TLAW),

(8)

whereAis the similarity matrix between local labels, andAi j

represents the similarity betweenyi andyj. The calculation method is as follows, wheret ∈R:

Ai j =

ex p(−^yⁱ⁻_t^y^j²²), i f yi ∈Nk(yj)or yj ∈ Nk(yi)

0, ot her s

(9) where P ∈ Rⁿ^×ⁿare diagonal matrices,Pii =_n

j=1Ai j is theith diagonal element ofP.LA= P−Ais the Laplacian matrix of A,LA=P−A,LA=X^T_LLAXL.

Combined with (7), a semi-supervised multi-label feature selection framework is constructed. The specific functions are as follows:

2T r(W^TLSW) +γ

2T r(W^TLAW). (10)

(5)

2.5 Sparse regularization

To select the optimal feature subset more accurately, the coefficient matrix must be constrained to make the coefficient matrix as stable as possible in rows and as sparse as possible between rows. Moreover, the logistic regression model may have overfitting, multi-collinearity, infinite solution, and other ill-posed problems, resulting in incorrect estimation of coefficient matrix (Liu et al.2014). To solve these problems, the common method is to imposeL2,1-norm sparse constraint on the coefficient matrix. However, the fixedL2,1-norm is not suitable for all data sets, and the sparsity constraint of some data sets may not be enough, which leads to the inability to select the optimal feature subset accurately. To avoid the problem that the sparse constraint is fixed and the degree of constraint is not enough, we chooseL2,p-norm 0<p≤1 to constrain the sparse matrix. In theL2,p-norm, the degree of sparsity constraint decreases with the increase ofp. Although the L2,p-norm is non-convex when 0 < p < 1, in some cases, the feature subset selected by local optimal selection is better than the feature subset selected whenp=1 (Zhang et al.2014; Zhu et al.2016). Therefore, we can adjust the degree of sparse constraint by adjusting the value of p to make it suitable for more data sets. The formula of L2,p- norm is as follows:

W₂^p_,_p= d i=1

⎛

⎝^m

j=1

(w²_{i j})

⎞

⎠

p 2

= d i=1

Wi₂^p s.t.0<p≤1.

(11)

Combined with (10), the objective function is changed into:

2T r(W^TLSW) +γ

2T r(W^TLAW)+β

2W₂^p_,_p.

(12)

The first term and the third term of the objective function only contain marked data; the second term contains both marked data and unlabeled data; the fourth term is the extended sparse constrained regular term. At this point, our semi-supervised multi-label feature selection framework is completed.

3 Problem solving and proof of convergence

3.1 Problem solving

Because 0< p ≤1 inL2,p-norm, whenp =1, it becomes L2,1-norm. Although the L2,1-norm is convex, it has non- smooth property. The concrete solution has been given in

(Nie et al.2010). However when 0< p <1, L2,p-norm is non-convex, so we can use iterative reweighted least square method to solve L2,p-norm. In the case of givenW^t⁺¹, the diagonal weighting matrix can be defined as:

H =di ag

p 2W_i^t₂^p

, (13)

wherei =1,2,· · · ,d. Thus, the objective function is transformed into:

2T r(W^TLSW)+γ 2 T r(W^TLAW)+β

2T r(W^TH W).

(14)

For solving (14), we can first set Has a unit matrix, then calculateW according toH, and finally updateHaccording to newW. Cycling until the optimalW is found.

According to the differentiability of (14), we can use the Newton–Raphson algorithm to solve (14). The first derivative ofW in (14) is as follows:

∂(obj(W))

∂W = −X_L^T[Y −g(XLW)] +λLSW +γLAW +βH W.

(15)

The second derivative ofW in (16) is as follows:

∂²(obj(W))

∂W∂W^T = −X^T_LU XL+λLS+γLA+βH, (16)

where

U =di ag m

j=1

[(1−g(xiwj))g(xiwj)], (17)

wherei =1,2, . . . ,l.

Thus, we obtain the update formula ofW:

W^t⁺¹=W^t−

∂²(obj(W))

∂W∂W^T −1

∂(obj(W))

∂W . (18)

(6)

Algorithm 1:Semi-supervised multi-label feature selection with local logic information preserved (SMLFS) Input:Data matrix X = [XL;XU]ⁿ^×^d, label matrix Y ∈ Rⁿ^×^m, regularization parameter λ, β and γ, selecting the number of featuresK, fixed the value of p, 0<p≤1.

Output:The feature selection resultI.

1. According to (6), calculate the feature similarity matrix S. And calculateLS.

2. According to (9), calculate the similarity matrix A. And calculateLAandLA.

3. Sett=0. Initialize matrixHand coefficient matrixW. 4. Repeat:

5. CalculateU^t⁺¹=di agm

j=1[(1−g(xiw^t_j))g(xiw^t_j)].

6. Calculate^∂(^obj_∂_W⁽^Wt ^t⁾⁾ = −X^T_L[Y−g(XLW^t)]+λLSW^t+ γLAW^t+βH^tW^t

7. Calculate^∂_∂²⁽^obj⁽^W^t⁾⁾

W^t∂W^{t T} = −X^T_LU XL+λLS+γLA+βH^t. 8. CalculateW^t⁺¹=W^t−(^∂_∂²⁽_W^objt∂⁽W^W^{t T}^t⁾⁾)⁻¹^∂(^obj_∂_W⁽^Wt ^t⁾⁾. 9. CalculateH^t⁺¹=di ag(₂_W^pt

i₂^p).

10. t =t+1

11. Untilconvergence criterion has been satisfied.

12. Calculate and sortWi2(i = 1,2,· · · ,d)to find out the firstK largest assignments valueI.

3.2 Proof of convergence

In this section, we prove that the iterative process of algorithm 1 is convergent. Therefore, in thetth iteration, we know that:

W^t⁺¹=ar gmi nWL(W)+λ

2T r(W^TLSW) +β

2T r(W^TH^tW)+γ

2T r(W^TLAW),

(19)

whereH_ii^t =₂_W^pt

i₂^p andi =1,2, . . . ,d. Therefore:

L(W^t⁺¹)+λ

2T r((W^t⁺¹)^TLSW^t⁺¹)+β 2 T r((W^t⁺¹)^TH^tW^t⁺¹)+γ

2T r((W^t⁺¹)^TLAW^t⁺¹)

≤ L(W^t)+λ

2T r((W^t)^TLSW^t)+β

2T r((W^t)^TH^tW^t) +γ

2T r((W^t)^TLAW^t).

(20) So it can be transformed into:

L(W^t⁺¹)+λT r((W^t⁺¹)^TLSW^t⁺¹)

+γ

2T r((W^t⁺¹)^TLAW^t⁺¹)+β 2

d i=1

pW_i^t⁺¹^2p₂ 2W_i^t₂^p

≤L(W^t)+λ

2T r((W^t)^TLSW^t)+γ

2T r((W^t)^TLAW^t) +β

2 d i=1

pW_i^t^2p₂ 2W_i^t₂^p

⇒L(W^t⁺¹)+λ

2T r((W^t⁺¹)^TLSW^t⁺¹)+γ 2 T r((W^t⁺¹)^TLAW^t⁺¹)+βp

2 d i=1

W_i^t⁺¹^2p₂ 2W_i^t₂^p

≤L(W^t)+λ

2T r((W^t)^TLAW^t) +βp

2 d i=1

W_i^t^2p₂

2W_i^t₂^p. (21)

It can be further transformed into:

L(W^t⁺¹)+λ

2T r((W^t⁺¹)^TLSW^t⁺¹)+γ 2 T r((W^t⁺¹)^TLAW^t⁺¹)

+βp

2 W^t⁺¹₂^p−βp

2 (W^t⁺¹₂^p− d i=1

W_i^t⁺¹^2p₂ 2W_i^t₂^p )

≤L(W^t)+λ

2T r((W^t)^TLAW^t) +βp

2 W^t₂^p−βp

2 (W^t₂^p− d i=1

W_i^t^2p₂

2W_i^t₂^p). (22) Because Wi₂^p = ((_d

i=1w_i²)^p)¹^/², then we have (_d

i=1w_i²)^p≥0.

According to√

a−₂^√^a_b ≤√

b−₂^√^b_b, whereaandbare positive numbers, so:

W_i^t⁺¹₂^p−W_i^t⁺¹^2p₂

2W_i^t₂^p ≤ W_i^t₂^p−W_i^t^2p₂

2W_i^t₂^p. (23) By summation, we can get:

d i=1

(W_i^t⁺¹₂^p−W_i^t⁺¹^2p₂ 2W_i^t₂^p )≤

d i=1

(W_i^t₂^p−W_i^t^2p₂ 2W_i^t₂^p).

(24) Thus, it can be concluded that:

W^t⁺¹₂^p− d i=1

W_i^t⁺¹^2p₂

2W_i^t₂^p ≤ W^t₂^p− d i=1

W_i^t^2p₂ 2W_i^t₂^p.

(25)

(7)

In the SLMFS algorithm, The total number of iterations is t(The value oftis often not too large), The time complexity of each updateW isO(d²n). Therefore, the total time com- plexity of the algorithm isO(t d²n), which is mainly affected by the number of samplesnand sample dimensiond.

In summary, the convergence of Algorithm 1 is proved.

The visual display is shown in Fig.1.

4 Experiments and results

We verify the effectiveness of the SLMFS algorithm by comparing it with baseline and some of the most advanced algorithms on eight public data sets. The experiment uses ML-KNN Zhang and Zhou (2007) as the representative of the multi-label classification algorithm to evaluate.

4.1 Dataset and experimental environment

Eight public data sets were used in the experiment: Emotion belongs to music data sets; Computers, Health, Business, and Bibtex belong to text data sets; Yeast belongs to bio- logical data sets; scene and Image belongs to image data sets. These data are all from (http://mulan.sourceforge.net/

datasets.html). Table1shows the specific parameters of the data set:

In terms of experimental environment, all experimental related environments are: Microsoft Windows7 system, pro- cessor: Intel (R) Core (TM) i5-4210U CPU @ 1.70GHz 2.40GHz, memory: 4.00GB, programming software: Matlab R2016a .

Fig. 1 The change of objective function value on Emotion and Scene (The proportion of labeled data was 0.1)

0 5 10 15 20 25 30 35 40 45 50

Number of iterations 0.4

0.45 0.5 0.55 0.6 0.65 0.7

Objective function value

Image (p=0.6)

0 5 10 15 20 25 30 35 40 45 50 Number of iterations 0.4

0.45 0.5 0.55 0.6 0.65 0.7

Image (p=0.8)

0 5 10 15 20 25 30 35 40 45 50

Number of iterations 0.48

0.5 0.52 0.54 0.56 0.58 0.6 0.62

Emotion (p=0.6)

0.5 0.52 0.54 0.56 0.58 0.6 0.62

Emotion (p=0.8)

0 5 10 15 20 25 30 35 40 45 50

Number of iterations 0

0.1 0.2 0.3 0.4 0.5 0.6

Scene (p=0.6)

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Scene (p=0.8)

(8)

Table 1 Dataset information

Data Samples Features Label Training Test

set number number number samples samples

Yeast 2417 103 14 1500 917

Scene 2407 294 6 1211 1196

Computers 5000 681 33 2000 3000

Image 600 294 5 400 200

Bibtex 7395 1836 159 4880 2515

Emotion 593 72 6 391 202

Health 5000 612 32 2000 3000

Business 5000 438 30 2000 3000

4.2 Experimental setup

To verify the effectiveness of the proposed feature selection methods, we compare the following most advanced feature selection algorithms:

1. Baseline: Without any feature selection, the data set is directly learned by ML-KNN, and the results on each evaluation index are obtained.

2. SCLS Lee and Kim (2017): Multi-label feature selection method based on scalable standards.

3. MDMR Lin et al. (2015): Multi-label feature selection is carried out by combining mutual information with maximum dependence and minimum redundancy.

4. PMU Lee et al. (2012): Multi-label feature selection algorithm based on mutual information. Multi-label feature selection is performed by selecting the dependency between the selected feature and the label.

5. FIMF Lee and Kim (2015): A fast multi-label feature selection method based on information theory feature ranking. Based on information theory, a scoring function for evaluating the importance of features is derived, and its computational cost is analyzed.

6. DRMFS Hu et al. (2020): A multi-label feature selection algorithm based on label graph, feature graph, and sparse regularization.

The experimental parameters are set as follows, in which (1) and (2) are the default settings of the corresponding algorithm.

1. In ML-KNN algorithm, set smoothing parameterS =1 and neighbor parameterk=10.

2. In the algorithm of FIMF, MDMR, and PMU, the data set is discretized using equal width interval Duda (1995).

And in the FIMF algorithm, setQ=10 andb=2.

Table 2 Average precision comparison of different algorithms under each data set (The proportion of labeled data was 0.2)

Algorithms SMLFS (P = 0.6) SMLFS (P = 0.8) SMLFS (P = 1) DRMFS SCLS MDMR PMU FIMF Baseline

Scene 0.8303 0.8289 0.8329 0.8158 0.8163 0.7633 0.8034 0.6906 0.8512

Yeast 0.7648 0.7641 0.7641 0.7653 0.7563 0.7579 0.7562 0.7552 0.7585

Emotion 0.7865 0.7865 0.7865 0.7917 0.7496 0.7551 0.7143 0.7510 0.6938

Computers 0.6324 0.6315 0.6347 0.6344 0.6317 0.6304 0.6093 0.6203 0.6334

Image 0.7568 0.7626 0.7568 0.7434 0.7437 0.7058 0.7002 0.6791 0.7214

Bibtex 0.4444 0.4355 0.4122 0.3800 0.2306 0.2247 0.1960 0.2252 0.3449

Health 0.7239 0.7231 0.7238 0.7245 0.7159 0.7256 0.6593 0.7090 0.6812

Business 0.8748 0.8749 0.8746 0.8689 0.8758 0.8757 0.8690 0.8730 0.8798

Table 3 Average precision comparison of different algorithms under each data set (The proportion of labeled data was 0.4)

Algorithms SMLFS (P = 0.6) SMLFS (P = 0.8) SMLFS (P = 1) DRMFS SCLS MDMR PMU FIMF Baseline

Scene 0.8258 0.8292 0.8323 0.8158 0.8163 0.7633 0.8034 0.6906 0.8512

Yeast 0.7681 0.7669 0.7685 0.7653 0.7563 0.7579 0.7562 0.7552 0.7585

Emotion 0.8085 0.8085 0.8085 0.7917 0.7496 0.7551 0.7143 0.7510 0.6938

Computers 0.6388 0.6394 0.6381 0.6344 0.6317 0.6304 0.6093 0.6203 0.6334

Image 0.7563 0.7591 0.7655 0.7434 0.7437 0.7058 0.7002 0.6791 0.7214

Bibtex 0.4648 0.4509 0.4572 0.3800 0.2306 0.2247 0.1960 0.2252 0.3449

Health 0.7306 0.7323 0.7285 0.7245 0.7159 0.7256 0.6593 0.7090 0.6812

Business 0.8793 0.8776 0.8770 0.8689 0.8758 0.8757 0.8690 0.8730 0.8798

(9)

Fig. 2 Average precision comparison of different feature selection methods when ML-KNN is used as the basic classifier. (The proportion of labeled data was 0.2)

Number of the selected features 0.6

0.65 0.7 0.75 0.8 0.85

Average Precision

Scene (The proportion of labeled data was 0.2)

SMLFS(p=0.6) SMLFS(p=0.8) SMLFS(p=1) FIMF MDMR DRMFS SCLS Baseline PMU

0.73 0.735 0.74 0.745 0.75 0.755 0.76 0.765 0.77

Average Precision

Yeast (The proportion of labeled data was 0.2)

0.68 0.7 0.72 0.74 0.76 0.78 0.8

Average Precision

Emotion (The proportion of labeled data was 0.2)

0.6 0.605 0.61 0.615 0.62 0.625 0.63 0.635

Average Precision

Computers (The proportion of labeled data was 0.2)

10 15 20 25 30 35 40 45 50

0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78

Average Precision

Image (The proportion of labeled data was 0.2)

0.2 0.25 0.3 0.35 0.4 0.45

Average Precision

Bibtex (The proportion of labeled data was 0.2)

10 15 20 25 30 35 40 45 50

10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50

3. Since the comparison algorithms are supervised multi- label feature selection algorithm, rather than semi- supervised multi-label feature selection algorithm, the proportion of labeled data in the training set of the com- parative algorithm is set to 1. In the SLMFS algorithm, the proportion of labeled data in the training set is set as [0.2,0.4], and the value ofpis set to[0.6,0.8,1]. There- fore, compared with the algorithm in the label set, it has a great advantage.

4. The nearest neighbor parameters of all feature selection algorithms are set tok=5, and the maximum iterations are set to 50.

5. The regularization parameters of all methods are adjusted by “grid search” strategy. The search scope is set to [10⁻³,10⁻²,10⁻¹,1,10¹,10²,10³], and the num-

ber of selected functions is set to [10,15,20,25,30, 35,40,45,50].

For all the algorithms, the best results are obtained when the parameters are optimal.

4.3 Evaluation index

Unlike the performance evaluation of a single-label learning system, the evaluation index of a multi-label learning system is more complex. In the experiment, we use average precision as the evaluation index to measure the proportion of more closely related labels related to specific labels. The higher the value, the better the performance.