• Keine Ergebnisse gefunden

7.3 Joint Facial Expression Recognition and Point Localization

7.3.1 Developed Models for both: Facial Expressions and Points . 126

I evaluate each potential location for each facial point using four models, two trained across the facial expressions and the other two per expression. Configu-ral and shape features are important bases for robust facial analysis approaches [107]. Those features encode the relative locations of the facial components, de-fined here as the geometry-based features. To this end, the following two models were developed. Letpdenote a potential set of the eight facial points, as numbered in Figure 7.12.

p={p1, ...,p8}, pi ∈R2×1. (7.17) The first model was designed to evaluate each p according to its distance to the expression-specific prior locationpp, the mean location across the training data.

p(p|pp, c) = p(p|Φppc) =

m

X

i=1

αipi(p|φi), (7.18) φi = (µii),Φppc = (α1, ..., αm, φ1, ..., φm)is the points prior GMM model for the facial expressionc, which is estimated via the Expectation Maximization (EM) al-gorithm. Eachpiis a16-dimensional multivariate Gaussian distribution given by

pi(p|φi) = 1

(2π)162i|12 exp{−1

2(p−µi)TΣ−1i (p−µi)}, (7.19) µi ∈ R16×1 is the mean vector of the ith subpopulation; whereΣi is its 16×16 covariance matrix. αi ∈[0,1]for alliand theαi’s are constrained to sum to one.

Next, a cascade-regression approach was built to locate only the eight facial points. Unlike the SDM used in [170], I exploited here the SVR for the non linear

7.3. Joint Facial Expression Recognition and Point Localization 127

Start p1

p2 p3 p4

p5 p6 p7 p8End

Figure 7.12: The relative location of the eight facial points is measured via two-point models in the depicted sequence.

mapping of the features to the point location. The training was performed across the facial expressions. Each set of pointspwas then evaluated with respect to the setprthat was detected via the cascade-regression approach as follows.

p(p|pr) = p(p|Φpr) (7.20)

Φpr is a GMM, whose mean is the detected points, and the identity matrix is its covariance.

The facial wrinkles, bulges, and furrows carry significant cues about both the facial expression and the point location, which can be inferred using appearance-based features. To this end, I built the next two models. I constructed a model to evaluate the texture around each potential pointIp according to its distance to the ground truth location. In particular, HoG features were extracted from patches, each of size 20% of the cropped face size, those patches surround the potential points. This modelψwas trained per expressioncusing SVR, where the likelihood of each set is calculated as follows.

p(p|Ip, c)∝ψ(Ip, c). (7.21)

7.3. Joint Facial Expression Recognition and Point Localization 128 The fourth model Ψwas built to estimate the probability of a facial expression c given texture features extracted from the entire face patch (I) through SVM classi-fier [28] as follows.

p(c|I)∝Ψ(I, c), (7.22)

Throughout this section, HoG descriptor was used to encode the face appearance as it is one of the most effective descriptors for the texture. The cropped face is scaled to a fixed size before applying the aforementioned models. For the fourth model, the face was scaled into 160×160 pixels, and then divided into cells of 20×20 pixels. Next, I constructed an eight-bin orientation histogram for each cell, in which each pixel orientation was weighted by its magnitude. Normalized his-tograms over block regions were concatenated to form the final feature vector of length 1568. I used a block of 2×2 cells, and a block stride of one cell, same setup as in Sec. 7.1.3.

The final task is to jointly recognize the facial expression (c) and locate the facial points (p) as follows.

p, c = arg max

p,c

pn1(p|pp, c)pn2(p|pr)pn3(p|Ip, c)pn4(c|I).

n1, ..., n4 assign the importance of each model in the joint inference. All the pa-rameters (model importance, of SVC, SVR, HoG) were estimated using a grid-search along with cross-validation experiments conducted on the training set aim-ing more accurate point localization and higher expression recognition rate at rea-sonable processing and resource cost.

7.3.2 Data Fusion

It is computationally expensive to evaluate all possible combinations of the eight facial points from their potential locations. Seeking more efficient system, I eval-uated the potential locations of each facial point independently (not in set). The relative relation of the points was ignored in Eqs. 7.20, and 7.21, where I reformu-lated them under the assumption of point independence as follows.

p(p|pr)≈

8

Y

i=1

p(pi|pri) (7.23)

7.3. Joint Facial Expression Recognition and Point Localization 129

p(p|Ip, c)≈

8

Y

i=1

p(pi|Ipi, c). (7.24)

The model in Eq. (7.18) was reformulated as well, but in a way preserving the relative location of the facial points.

p(p|pp, c)≈

8

Y

i=1

p(pi|ppi, c)

8

Y

k=2

p(pk|pk−1, c) (7.25) p(pi|ppi, c)is the likelihood of the potential location (pi) given the corresponding prior location hypothesized the facial expression isc. p(pi|pi−1, c)evaluates the rel-ative location ofpi andpi−1 hypothesized the facial expression isc; the sequential order of the points is illustrated in Figure 7.12. All the aforementioned point-wise models are GMMs.

Above I prepared each potential point set to be evaluated in a sequence, in what follows I adapt the Viterbi algorithm to perform this evaluation efficiently. For a hidden Markov model (HMM) of N states, characterized by initial probabilities, stationary transition matrix, the Viterbi approach is used to find the most likely state sequence that produces a given observation sequence of lengthT, where the computation complexity degrades fromO(T NT)toO(T N2). Following the same idea, I built seven networks, each for one facial expression. Every model has a sequence of length eight steps, each for one facial point where the points arrange-ment is depicted in Figure 7.12. Each step contains a variable number of states representing the probable locations of the corresponding pointNps, e.g. psi is the potential locationi of the pointps. The evaluation of both the states (point loca-tion) and the transition between the consecutive steps was carried out with respect to the aforementioned four models.

At step 1, each potential location is locally evaluated using (pn1(p1|pp1, c)pn2(p1|pr1)pn3(p1|Ip1, c),

where the response is stored as a metric value corresponding to this location. Start-ing from step two, each state bonds with a state from the earlier stage that has the maximum value resulting from multiplying the state metric bypn1(pi|pi−1, c),i >

1. The result of multiplying this maximum value by the local response of the un-derlying state is considered as its metric value. Each state bonds with only one

7.3. Joint Facial Expression Recognition and Point Localization 130 from the previous step and varying number from the next step. In the last step, I choose the state with the maximum metric and multiply it by pn4(c|I)to produce the network response. Finally, I consider the maximum response across the net-works our join estimation. The winner network belongs to the recognized facial expression, and recursively from the maximum metric of its last step I get the lo-cation of the facial points. This entire fusion process is summarized in Algorithm 4.