Parameter optimization of histogram-based local descriptors for facial expression recognition

Antoine Badi Mame; Jules-Raymond Tapamo

doi:10.7717/peerj-cs.1388

Parameter optimization of histogram-based local descriptors for facial expression recognition

Discipline of Electrical, Electronic, and Computer Engineering, University of KwaZulu-Natal, Durban, KwaZulu-Natal, South Africa

DOI: 10.7717/peerj-cs.1388

Published: 2023-06-28
Accepted: 2023-04-20
Received: 2022-12-08

Academic Editor: Carlos Fernandez-Lozano

Subject Areas: Bioinformatics, Algorithms and Analysis of Algorithms, Artificial Intelligence, Computer Vision, Data Mining and Machine Learning
Keywords: Image registration, Feature extraction, Facial expression recognition, Classification, Local descriptors, Parameter optimization

Copyright: © 2023 Badi Mame and Tapamo
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Badi Mame A, Tapamo J. 2023. Parameter optimization of histogram-based local descriptors for facial expression recognition. PeerJ Computer Science 9:e1388 https://doi.org/10.7717/peerj-cs.1388

The authors have chosen to make the review history of this article public.

Abstract

An important task in automatic facial expression recognition (FER) is to describe facial image features effectively and efficiently. Facial expression descriptors must be robust to variable scales, illumination changes, face view, and noise. This article studies the application of spatially modified local descriptors to extract robust features for facial expressions recognition. The experiments are carried out in two phases: firstly, we motivate the need for face registration by comparing the extraction of features from registered and non-registered faces, and secondly, four local descriptors (Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Compound Local Binary Patterns (CLBP), and Weber’s Local Descriptor (WLD)) are optimized by finding the best parameter values for their extraction. Our study reveals that face registration is an important step that can improve the recognition rate of FER systems. We also highlight that a suitable parameter selection can increase the performance of existing local descriptors as compared with state-of-the-art approaches.

Introduction

Facial expressions are essential for humans to communicate effectively. The most common way humans convey their emotions is by performing facial expressions. This insight into the hidden emotions of humans has led to much attention from the research community to develop facial expression recognition systems. Facial expression recognition (FER) has several applications such as pain assessment (Rathee & Ganotra, 2015), health monitoring (Altameem & Altameem, 2020), market research (Garbas et al., 2013), online learning (Bhadur, 2021), to name but a few. A facial recognition system typically operates in three stages: facial detection, feature extraction, and classification.

Feature extraction is one of the key stages in FER because it provides discriminative features related to a particular expression. The features extracted from a facial image must be robust to changes in illumination, pose variation, noise, occlusions, scale variation, and low resolution. The recognition of expressions is also difficult because of the complexity of facial expressions. Local descriptors can solve this problem by capturing the local appearance of image features (such as edges, lines, and spots) related to various facial expressions. Jia et al. (2015) provided a solution for image noise, low resolution, and multi-intensity expressions. Hesse et al. (2012) handled different face views, while Makhmudkhujaev et al. (2019) studied the recognition of noisy and position-varied facial expressions.

This article investigates four local descriptors which demonstrate powerful characteristics for facial expression recognition. It highlights how a proper parameter selection of these descriptors can increase the efficiency of FER systems in terms of recognition rates and speed. This article further shows the importance of correct face registration to preserve only facial features and discard unnecessary regions.

Figure 1 summarizes the steps followed by the system from processing an input image to the classifying a facial expression.

Figure 1: Proposed facial expression recognition approach.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-1

This work contributes to the field of facial expression recognition in the following ways:

It proposes a novel face registration algorithm that works for color and gray-scale images.
It measures the impact of face registration on the efficiency of facial expression recognition through a quantitative analysis.
It presents a simple method for optimizing the recognition performance of local descriptors. The method is illustrated using four existing local descriptors and evaluated on two publicly available datasets.
It shows how different feature extraction parameters affect a local descriptor’s ability to efficiently represent facial features.

The following is a summary of the remainder of this article: “Background and Related Works” presents the background of this research and some related works. “Materials and Methods” discusses the methodology adopted for facial expression recognition, alongside the datasets used. “Experimental Results and Discussion” assesses the impact of face registration, and the optimization of local descriptors. The findings and directions for future work are summarized in “Conclusion”.

Background and related works

Extracting robust features from a facial image is one of the most pressing issues for FER systems today. Even the best classifier could fail if the features used do not appropriately represent facial expressions. It has been identified that face registration/pre-processing and feature extraction stages significantly impact the recognition performance of FER systems. The following paragraphs present some developments in the research in these two areas.

Face registration

Face registration is a procedure that aims to remove image regions that contain no information related to facial expression (such as background and hair) while giving the face a predefined size and pose. Many existing approaches include a normalization or registration step before feature extraction. For instance, Carcagnì et al. (2015) used Viola-Jones-based algorithms and ellipse fitting to detect and align the face image while removing background objects. However, the method depends on facial feature color models, limiting the applicability to gray-scale images. Zhang et al. (2018) used the Viola-Jones algorithm to detect faces and crop the images to 100 $\times$ 100 pixels. A similar approach was followed by Mandal et al. (2019). Hassan & Suandi (2019) used Cascade Linear Regression to detect facial landmarks. The distance between the eyes was used as a reference to align the face and normalize it 110 $\times$ 150. Liu et al. (2021) used a facial landmark detector that marks 68 points on the face, 27 of which permit the face region to be extracted from the background. Note that this approach does not align the face image.

Based on the above works, face pre-processing plays an essential role in FER systems. However, the impact of face registration on the recognition rate has not sufficiently been quantified. Our work investigates the impact of face registration in more detail. We also propose a novel automatic face registration algorithm inspired by the work in Carcagnì et al. (2015) that can also be used with gray-scale images.

Feature extraction

Robustness of feature extraction is dependent on their discriminative power, the ease with which images are captured, their dimension and tolerance to noise, such as illumination changes, and have low intra-class variations (Turan & Lam, 2018). Hence, extensive work has been done to achieve robust feature extraction. Geometric features have been investigated to represent facial expressions (Ghimire & Lee, 2013; Loconsole et al., 2014; Lin, Cheng & Li, 2014). Though geometric features can offer high discriminability, they depend heavily on the accurate detection and alignment of fiducial points in a face. Appearance features can solve this problem because they extract features related to the appearance of the entire face or local patches. Lu et al. (2015) proposed combining Gabor features and sparse representation-based classification to recognize facial expressions. The method is robust to illumination changes, different resolutions, and orientation. Sajjad et al. (2018) fused two appearance and texture features: Histogram of Oriented Gradients and Uniform Local Ternary Patterns, and then classified them using support vector machines. Experiments showed that the method performed well in the presence of occlusions. Deep learning methods have also been applied for facial expression recognition. Some notable works include Zhang et al. (2019), Bougourzi et al. (2020), Dong, Wang & Hang (2021). Though deep learning methods can achieve high recognition rates, they often require complex networks and expensive pre-processing computations to produce these results.

Local descriptors have proven to be a powerful way of characterizing expression-related information while being easy to compute. In particular, the histogram-based local descriptors can represent the occurrences of micro-patterns (such as edges, lines, and bumps) that form facial expressions. However, the aggregation of these micro-patterns into histograms leads to a loss of information about the location of the micro-patterns (Shan, Gong & McOwan, 2009; Moore & Bowden, 2011). To address this issue, weighted histogram representations have been proposed to give higher weights to regions with higher discriminability (e.g., eyes and mouth) and lower weights to regions with lower discriminability (e.g., image borders). Kabir, Jabid & Chae (2012) proposed the Local Directional Patterns with variance (LDPv) descriptor to characterize the texture and contrast information of facial components. The novel descriptor adds weights to an LDP (Local Directional Patterns) code based on its variance across the image. Recently, a similar approach called the Weighted Statistical Binary Pattern descriptor was developed by Truong, Nguyen & Kim (2021). The new descriptor combines the histograms of sign and magnitude components of the mean moments of an image. The histograms are weighted based on a new variance moment that contains distinctive facial features. Though the weighted methods achieve high recognition rates and robustness, they can suffer from high dimensionality and implementation complexity.

A popular technique for enhancing a histogram-based local descriptors is dividing the face into overlapping or non-overlapping sub-regions. Shan, Gong & McOwan (2009) studied facial expression recognition with the Local Binary Patterns (LBP) descriptor by extracting features from sub-regions of a face image. In this study, an LBP histogram was created with 59 bins and extracted from 18 $\times$ 21 pixels sub-regions in a 110 $\times$ 150 image. However, the extraction parameters were not optimized for facial expression recognition. To the best of our knowledge, Carcagnì et al. (2015) were the first to optimize the cell size and histogram bins of the Histogram of Oriented Gradients (HOG) descriptor for facial expression recognition. Results show that the best HOG parameters are independent of the datasets used. The study goes further to optimize three other local descriptors and compare them with the HOG descriptor. However, the optimization of the other descriptors is not given in detail, leaving some questions unanswered, such as whether those local descriptors can also be optimized in a way that is independent of the datasets used.

Benitez-Garcia, Nakamura & Kaneko (2017) proposed a FER system that extracts appearance and geometric features around specific facial regions. This approach segments three facial regions: eye-eyebrows, nose, and mouth, dividing them into sub-regions. The Local Fourier Coefficients and Facial Fourier Descriptors are extracted from each sub-region, and the feature vectors of all the sub-regions are concatenated. It should be noted that the resulting feature vector is a linear difference between the expressive and neutral features. Hence, both the expressive and the neutral face are needed to recognize a given expression, which is not always practical in real-life. Also, the approach could be negatively affected by pose variations. Turan & Lam (2018) studied the performance of several local descriptors when the resolution and number of sub-regions of the face are varied. Five image resolutions were used, and the face was divided into $l \times l$ pixels sub-images where $l$ was varied from 3 to 11 in steps of 1. The study reports that higher resolutions and more sub-regions lead to better classification performance. However, the results do not show a trend in recognition performance as the number of sub-regions varies. Also, there is not enough discussion on how the resolution and sub-region size affects the descriptor’s ability to represent facial features.

Most of the above approaches extract local descriptors by dividing the face images into sub-regions. However, the approaches do not optimize the extraction parameters for FER, except for a few works which attempt to optimize either the sub-block size, image resolution, histogram bins, or a combination of these (Carcagnì et al., 2015; Turan & Lam, 2018). The current work aims to study the importance of optimizing the sub-block size and histogram bins of local descriptors. We believe that these are the most critical parameters for striking a balance between recognition rate and the feature vector length.

Materials and Methods

Datasets

Two publicly available face datasets are used in this study and organised as described by Badi Mame & Tapamo (2022). The first one is the CK+ dataset (Lucey et al., 2010), which is one of the most used datasets to evaluate FER solutions because it contains subjects from a variety of ethnicities, ages, and genders. This dataset is formed from 123 subjects and is composed of 593 image sequences. These images were captured under controlled and uncontrolled conditions, posed and non-posed statuses. The experiments carried out in this study only used the posed images where the subjects performed six expressions. To obtain a balanced dataset from the available sequences, the following images are selected: the last images in the sequences for anger, disgust, and happiness; the last and fourth images for the first 68 sequences related to surprise; the last and fourth from last images for the images of fear and sadness. The second dataset is the Radboud Faces Database (RFD) compiled by Langner et al. (2010). RFD is a popular dataset among researchers because it contains a variety of expressions, gaze directions, and head orientations. The dataset contains pictures of 67 subjects (both adults and children) displaying eight expressions. A subset was created, selecting 402 images labeled with six expressions, including anger, disgust, fear, happiness, sadness, and surprise. Figures 2 and 3 give some image samples from the CK+ and RFD datasets, respectively. Table 1 summarizes the datasets.

Figure 2: Sample of images from the CK+ dataset.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-2

Figure 3: Sample of images from the Radboud Faces database.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-3

Table 1:

Summary of datasets.

Dataset	No. of images	Number of expressions
CK+	347	6
RFD	402	6

DOI: 10.7717/peerj-cs.1388/table-1

Face detection and registration

An important step in recognizing a facial expression is to locate a face within the input image. Many robust face detectors have been developed (Yang, Kriegman & Ahuja, 2002; Viola & Jones, 2004). In the present study, a general frontal face detector is sufficient since recognition is done on frontal faces only. The face detector is based on the Viola-Jones detector implemented by the OpenCV library. A face region contains all the information necessary to observe a facial expression. However, much of this information can be discarded to allow the model only to use the parts of the image containing facial components. Face registration aims at giving detected faces a predefined pose, shape, and size. According to Carcagnì et al. (2015) the best features are usually extracted from images where the eyes are in a consistent position relative to the image. To address this issue, we proposed a novel face registration algorithm that can be used on gray-scale images. First, we enhanced the image by applying contrast limited automatic histogram equalization a gray-scale image. Next, we located the face blob by performing binary segmentation followed by two filtration techniques (see details further below). Given that binary segmentation is a process that separates an image into two regions based on an interval, the challenge here was to find the correct interval within which the face pixels lie. We propose an algorithm to automatically select the best thresholds quickly and efficiently. Consider an image with pixel intensity $I (x, y)$ , the segmented image B, is defined as

(1) $\begin{aligned} B (x, y) = {\begin{matrix} 1 & i f T_{1} < I (x, y) < T_{2} \\ 0 & o t h e r w i s e \end{matrix} \end{aligned}$ where $T_{1}$ and $T_{2}$ are the lower and upper thresholds respectively. These thresholds are defined as

(2) $T_{1} = μ - σ$

(3) $T_{2} = μ + 2 σ$ where $μ$ and $σ$ are the mean and standard deviation of the intensity values respectively. Specifically, the mean and standard deviation are obtained by creating a histogram from the face center pattern. The mean and standard deviation are, respectively, defined by Eqs. (4) and (5).

(4) $μ = \frac{1}{N} \sum_{x, y} W (x, y) \times I (x, y)$

(5) $σ = \sqrt{\frac{1}{N} \sum_{x, y} (I (x, y) - μ)^{2}}$ where N is the number of pixels in an image, and $W (x, y)$ is the frequency of the bin (or interval) where the intensity $I (x, y)$ lies.

The binary segmentation operation is not perfect hence unwanted regions must be filtered out. The first filter is an ellipse which is computed from the face’s dimensions and location. Elements inside the ellipse are retained while elements outside the ellipse are discarded. The second filter consists of searching for the largest connected components (Bolelli et al., 2019) of the graph-encoded binary image. The blob’s shape is then refined by estimating the polygonal shape around it. Specifically, a polygon is estimated by finding a set of contour points (Lim, Zitnick & Dollár, 2013) enclosing the blob, computing its convex hull (Sklansky, 1982) and approximate a polygonal shape. Polygon estimation is important because it fills any holes in the blob and makes the ellipse-fitting step more reliable.

A vertical rotation of the face is carried out, by fitting an ellipse around the face and using its angle of inclination (i.e., the angle between the minor axis and the x-axis). To refine the process, the image is rotated a second time using the eye locations. Elliptical cropping followed by rectangular cropping are then performed to generate an image in which eyes are in predefined position relative to the sides of the image. Finally, the output image is resized. The approach used to perform the face detection and registration is shown in Fig. 4.

Figure 4: Face detection and registration approach.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-4

Feature extraction

The goal of feature extraction is to generate a feature vector from a facial image for classification. This study is interested in local descriptors used for facial expression recognition. Four local descriptors were examined: Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Compound Local Binary Patterns (CLBP), and Weber’s Local Descriptor (WLD). As previously introduced in Badi Mame & Tapamo (2022), this study uses a method that divides the face into a number of non-overlapping and equally-sized sub-blocks, and a local descriptor is extracted from each sub-block, producing several histograms using Eq. (6). The resulting feature vector is formed by normalizing the histograms and concatenating them into a 1D feature vector using Eq. (7). Histograms of sub-blocks are calculated separately and normalized such that the frequencies sum to one. This rule is applied to all the types of descriptors except the HOG descriptor, where the blocks are normalized in groups using L2-Hys (Dalal & Triggs, 2005). Suppose $I_{j}$ is a matrix, of dimension $W_{S} \times W_{S}$ , of intensities that represents the $j$ -th sub-block in the image, and the function $D_{j} (x, y)$ calculates the descriptor at a position $(x, y)$ in $I_{j}$ , then the histogram associated with this sub-block is defined as

(6) $H^{j} = {h_{i}^{j}}_{i = 0, 1, \dots, N_{b} - 1}$

$h_{i}^{j} = \sum_{x = 0}^{W_{S} - 1} \sum_{y = 0}^{W_{S} - 1} δ (D_{j} (x, y), i)$ where $δ (d, i)$ is defined as

$\begin{aligned} δ (d, i) = {\begin{matrix} 1 & i f L (i) \leq d < U (i) \\ 0 & o t h e r w i s e \end{matrix} \end{aligned}$ with

$L (i) = i \times \frac{n}{N_{b}} a n d U (i) = (i + 1) \times \frac{n}{N_{b}}$ where $n$ is the number of possible descriptor values and $N_{b}$ is the number of histogram bins.

The resulting feature vector is defined as

(7) $H = {\frac{H^{0}}{{(W_{S})}^{2}}, \frac{H^{1}}{{(W_{S})}^{2}}, \dots, \frac{H^{m - 1}}{{(W_{S})}^{2}}}$ where $m$ is the number of sub-block¹ .

Histogram of Oriented Gradients

The Histogram of Oriented Gradients (HOG) descriptor was first proposed as a local-histogram descriptor for human detection in Dalal & Triggs (2005) and since then has been successfully applied to facial expression recognition (Chen et al., 2014; Kumar, Happy & Routray, 2016; Hussein et al., 2022). In this study, Scikit-image’s implementation https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html of the HOG descriptor was used. The HOG descriptor is calculated as follows. The first stage computes first-order image gradients by differentiating along the $x$ and $y$ directions. In the second stage, the first-order gradients are used to calculate the magnitude and orientation of the gradient images. In the third stage, the images are divided into cells, and a histogram is computed for each cell. Each cell histogram divides the gradient orientation range into a fixed number of intervals (or bins), and the gradient magnitudes of the pixels in the cell are used to vote into the histogram (see more details in the next paragraph). In the fourth stage, the cells are grouped into larger spatially connected groups called blocks, and block descriptors are calculated. A block descriptor is created by concatenating the cell histograms in that block and normalizing them (further explained below). Note that the blocks overlap, so cells are shared between blocks and appear more than once (but with different normalizations) in the final feature vector. Finally, all the block descriptors are concatenated to form the final HOG feature vector.

Consider $L (x, y)$ to be the intensity function of an $M \times N$ gray-scale image. The first order image gradients along the $x$ and $y$ directions are defined as at coordinate $(x, y)$ are defined as

(8) $I_{x} = L (x - 1, y) - L (x + 1, y)$

(9) $I_{y} = L (x, y - 1) - L (x, y + 1) .$

Given the differential components, $I_{x}$ and $I_{y}$ , the gradient magnitudes and orientations of the image are defined as

(10) $M (x, y) = \sqrt{I_{x}^{2} + I_{y}^{2}}$

(11) $α = \frac{180}{π} (a r c t a n 2 (I_{y}, I_{x}) m o d π)$ where $a r c t a n 2$ is the four-quadrant inverse tangent, which yields values between $- π$ and $π$ . Using the gradients magnitude and orientation, the histogram of oriented gradients for each cell are defined as

(12) $H_{c e l l} (i, j, k) = \sum_{(u, v) \in C_{i, j}} M (u, v) i f α (u, v) \in θ_{k}$

$0 < i < \frac{M}{c e l l S i z e} - 1, 0 < j < \frac{N}{c e l l S i z e} - 1$ where $C_{i, j}$ is a set of $x$ and $y$ coordinates within the $(i, j)$ cell and $θ_{k}$ is the range of each bin e.g., for 6 orientations, $θ_{1} = [165^{\circ}, 180^{\circ}) \cup [0^{\circ}, 15^{\circ})$ , $θ_{2} = [15^{\circ}, 45^{\circ})$ , $θ_{3} = [45^{\circ}, 75^{\circ})$ , $θ_{4} = [75^{\circ}, 105^{\circ})$ , $θ_{5} = [105^{\circ}, 135^{\circ})$ and $θ_{6} = [135^{\circ}, 165^{\circ})$ . Therefore, $H_{c e l l} (i, j, k)$ represents the histogram of oriented gradients at the $(i, j)$ cell. After obtaining cell histograms, the cells are grouped in blocks and normalized as follows:

First, the block descriptor is built by concatenating cell histograms within the block:

(13) $H_{b l o c k} = (H_{c e l l}^{1} H_{c e l l}^{2} \dots H_{c e l l}^{n_{c}})$ where $n_{c}$ is the number of cells per block. In this study $n_{c} = 3$ .

Normalize the descriptor using L2 normalization:

(14) $\hat{h_{l}} = \frac{h_{l}}{\sqrt{\sum_{l} h_{l}^{2} + e^{2}}}$ where $h_{l}$ is the $l$ -th element of the block descriptor, $\hat{h_{l}}$ is the corresponding normalized value, and $e$ is a constant to prevent division by zero.

Limit the maximum value to 0.2:

(15) $\begin{aligned} h_{l}^{^{'}} = {\begin{matrix} 0.2 & i f \hat{h_{l}} > 0.2 \\ \hat{h_{l}} & o t h e r w i s e \end{matrix} \end{aligned}$

Repeat L2 normalization using Eq. (14).

Concatenate all the block descriptors to produce the final HOG feature vector:

(16) $H_{h o g} = (H_{b l o c k}^{1} H_{b l o c k}^{2} \dots H_{b l o c k}^{n_{b}})$ where $n_{b}$ is the number of blocks.

Algorithm 1 summarizes the calculation of a HOG feature vector.

Algorithm 1 :

Histogram of oriented gradients.

Inputs:

I : i n p u t i m a g e

Outputs:

H_{h o g} : f e a t u r e v e c t o r

1: Compute first order image gradients using Eqs. (8) and (9).

2: Compute gradient orientations and gradient magnitudes using Eqs. (10) and (11)

3: Build the histograms of oriented gradients for all the cells using Eq. (12).

4: Build a descriptor for all the blocks using normalization Eqs. (13)–(15).

5: Return a feature vector by concatenating all the block descriptors Eq. (16).

DOI: 10.7717/peerj-cs.1388/table-7

Local Binary Pattern

The Local Binary Pattern descriptor was first introduced to describe textural appearance in pattern recognition problems. The LBP histogram contains information relating to the distribution of the local micro-patterns, such as flat areas and edges in an image. The original LBP descriptor is computed by considering each pixel of an input image $g_{c}$ as the ‘threshold’ of a $3 \times 3$ grid and assigning a value of 1 or 0 to each of the eight surrounding pixels. The following rule is assumed: if the neighboring pixel is less than the threshold it is assigned the value 0, otherwise 1 is assigned. A binary code is then read from the sequence of 0s and 1s and converted to a decimal number. The decimal number is then assigned to the center (or threshold) pixel. Once this operation has been performed on the entire image, the LBP image is obtained. The histogram of LBP image is then used as a feature vector (Ojala, Pietikäinen & Mäenpää, 2000). Figure 5 gives an example of computing the LBP code of a pixel in a $3 \times 3$ neighborhood block.

Figure 5: An example of the computation of LBP on a 3 × 3 neighborhood. The threshold is 25 and the labels are read anticlockwise to produce a binary pattern of 0.
An example of the computation of LBP on a 3 × 3 neighborhood. The threshold is 25 and the labels are read anticlockwise to produce a binary pattern of 0.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-5

LBP can be extended to allow any number of neighbors in a circular neighborhood and bilinear interpolation of pixel values (Ojala, Pietikainen & Maenpaa, 2002). In this study, the LBP descriptor was implemented with three neighbors and a radius of 1. The new LBP function is defined as

Algorithm 2 :

Local binary patterns.

Input Image: I

Output Feature vector:

(H^{0} H^{1} \dots H^{N_{b} - 1})

Divide the input image into blocks

(I_{b})_{b = 0, 1, \dots, N_{b} - 1}

for

e a c h b l o c k, b

Compute the LBP codes

f_{c}^{b}

for the block b using Eq. 17.

Compute the histogram

H^{b}

using Eq. 18.

end for

Concatenate the histograms,

H = (H^{b})_{b = 0, 1, \dots, N_{b} - 1}

DOI: 10.7717/peerj-cs.1388/table-8

(17) $f_{c} = \sum_{j = 0}^{P - 1} S (g_{j} - g_{c}) 2^{j}$ where $1 \leq x \leq w, 1 \leq y \leq h$ and $S (A)$ is defined as

$S (A) = {\begin{cases} 1 if A \geq 0 \\ 0 else \end{cases}$ and $g_{j}$ corresponds to the neighboring intensity values; $w$ and $h$ are the width and height of the input image respectively.

After computing all the LBP codes, $f_{c} (x, y)$ , a histogram, $(H_{i})_{i = 0, 1, \dots, n - 1}$ , is created that counts the occurrences of a range of LBP codes.

(18) $H_{i} = \sum_{x = 0}^{w - 1} \sum_{y = 0}^{h - 1} I (f_{c} (x, y) = i)$ where $n$ is the number of bins in the histogram and $I (z)$ is defined as:

$I (z) = {\begin{matrix} 1 & i f Z i s t r u e \\ 0 & e l s e \end{matrix}}$

Figure 6 shows the steps followed to compute LBP features of an image.

Figure 6: Computing the LBP features of an image.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-6

Compound local binary pattern

Compound Local Binary Patterns (CLBP) extends the LBP descriptor. CLBP differs from LBP in that it encodes both the magnitude as well as the sign of the differences between a center (or threshold) pixel and its P neighbors. Each pixel is treated as a center having P neighbors. Each neighbor is given a 2-bit code as follows: the first bit is set to 0 if the neighboring pixel is less than the center pixel and is set to 1 otherwise. The second bit is set to 1 if the magnitude of the difference between the neighboring pixel and the center pixel is greater than the threshold, $M_{a v g}$ , otherwise, the bit is set to 0. The threshold value is defined as the average magnitude of the differences between the center pixel and its P neighbors. The rest of this discussion considers $P = 8$ , since the same is used in our implementation. Figure 7 illustrates the computation of CLBP features of an image.

Algorithm 3 :

Compound local binary patterns.

Input Image: I

Output Feature vector:

H = (H^{0} H^{1} \dots H^{N_{b} - 1})

Divide the input image into blocks

(I_{b})_{b = 0, 1, \dots, N_{b} - 1}

for

e a c h b l o c k, b

Compute the sub-CLBP codes using Eqs. (20) and (21), respectively.

Create a histogram for the codes,

H_{1}^{b}

and

H_{2}^{b}

Concatenate the sub-histograms,

H^{b} = {H_{1}^{b} H_{2}^{b}}

end for

Concatenate the 1D histograms,

H = {H^{b}}_{b = 0, 1, \dots, N_{b} - 1}

DOI: 10.7717/peerj-cs.1388/table-9

Let $f_{c} (x, y)$ be the function representing the CLBP code of a pixel positioned at $(x, y)$ defined as:

(19) $f_{c} (x, y) = \sum_{p = 0}^{P - 1} s (i_{p}, i_{c}) 2^{2 p}$ where $s (i_{p}, i_{c})$ is defined as

$s (i_{p}, i_{c}) = {\begin{matrix} 00 & i f i_{p} - i_{c} < 0 a n d | i_{p} - i_{c} | \leq M_{a v g} \\ 01 & i f i_{p} - i_{c} < 0 a n d | i_{p} - i_{c} | > M_{a v g} \\ 10 & i f i_{p} - i_{c} \geq 0 a n d | i_{p} - i_{c} | \leq M_{a v g} \\ 11 & o t h e r w i s e \end{matrix}$

$i_{c}$ and $i_{p}$ represent the gray values of the center $c$ , and a neighbor $p$ , and $M_{a v g}$ is the average magnitude of the differences between $i_{c}, i_{p}$ pairs in the local neighborhood.

The CLBP binary code is further subdivided into two sub-CLBP patterns $f_{c 1}$ and $f_{c 2}$ . Hence, during implementation, Eq. (19) is not used but instead Eqs. (20) and (21) are used. The sub-CLBP patterns can be defined as follows:

(20) $f_{c 1} = \sum_{p \in O, k \in E} (S_{1 p} (i_{p}, i_{c}) 2^{k + 1} + S_{2 p} (i_{p}, i_{c}) 2^{k})$

(21) $f_{c 2} = \sum_{p \in O, k \in E} (S_{1 p} (i_{p}, i_{c}) 2^{k + 1} + S_{2 p} (i_{p}, i_{c}) 2^{k})$ where $O = {1, 3, 5, 7}, E = {0, 2, 4, 6}$ and $S_{1 p} (i_{p}, i_{c})$ and $S_{2 p} (i_{p}, i_{c})$ are defined as

$S_{1 p} (i_{p}, i_{c}) = {\begin{matrix} 0 & i f i_{p} - i_{c} < 0 \\ 1 & i f i_{p} - i_{c} \geq 0 \end{matrix}, S_{2 p} (i_{p}, i_{c}) = {\begin{matrix} 0 & i f | i_{p} - i_{c} | \leq M_{a v g} \\ 1 & i f | i_{p} - i_{c} | > M_{a v g} \end{matrix}$ with

$M_{a v g} = \frac{1}{8} \sum_{p = 0}^{7} | i_{p} - i_{c} |$

Weber’s Local Descriptor

Weber’s Local Descriptor (WLD) is inspired by the human perception of a pattern, which depends not only on the change of stimulus (such as sound and lighting) but also on the original intensity of the stimulus (Chen et al., 2009). WLD is made of two components: a differential excitation ( $ξ$ ) and a gradient orientation ( $θ$ ). The differential component is expressed as the ratio between two terms: the first is the relative intensity differences of a current pixel against its neighbors; the other is the intensity of the current pixel. The second component of WLD is the gradient orientation. A feature vector is generated by first rearranging the differential excitations into subgroups that are used to form histograms, following a particular rule that is further discussed below. Histograms are ultimately reordered and concatenated to form a single feature vector. Figure 8 is an illustration of how WLD feature vectors are extracted from an input image. Algorithm 4 summarized the extraction of WLD features.

Figure 8: Extraction of WLD features from an input image.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-8

Algorithm 4 :

Weber’s Local Descriptor.

Input Image: I

Output Feature vector:

H = (H^{0} H^{1} \dots H^{N_{b} - 1})

Divide the input image into blocks

(I_{b})_{b = 0, 1, \dots, N_{b} - 1}

for

e a c h b l o c k, b

Compute the differential excitations,

ξ_{j}

using Eq. (22).

Compute the gradient orientations,

θ_{j}

using Eq. (23).

Map the gradient orientation,

θ_{j}^{^{'}}

using Eq. (24).

Quantize the mapped orientations into T dominant orientations, producing

Φ_{t}

using Eq. (25).

Create sub-histograms H_m,t using Eq. (26)

Concatenate the sub-histograms

H^{b} = {H_{m, t}}_{t = 0, 1, \dots, T - 1; m = 0, 1, \dots, M - 1}

end for

Concatenate the 1D histograms,

H = {H^{b}}_{b = 0, 1, \dots, N_{b} - 1}

DOI: 10.7717/peerj-cs.1388/table-10

Consider a $3 \times 3$ pixels block $G_{s}$ taken from an input image G, the WLD operator can be applied as follows:

Compute the differential excitation component which is expressed as:

(22) $ξ (x_{c}) = t a n^{- 1} (\frac{V_{s}^{00}}{V_{s}^{01}}) = t a n^{- 1} [\sum_{i = 0}^{p - 1} (\frac{x_{i} - x_{c}}{x_{c}})]$

The gradient orientation $θ (x_{c})$ which is the second component of WLD is computed as follows:

(23) $θ (x_{c}) = t a n^{- 1} [\frac{V_{s}^{11}}{V_{s}^{10}}] = t a n^{- 1} [\frac{x_{5} - x_{1}}{x_{7} - x_{3}}]$

$G_{s} = [\begin{matrix} x_{0} & x_{1} & x_{2} \\ x_{7} & c & x_{3} \\ x_{6} & x_{5} & x_{4} \end{matrix}]$

$ξ (x_{c}) \in [- \frac{π}{2}, \frac{π}{2}] θ (x_{c}) \in [- \frac{π}{2}, \frac{π}{2}]$

Map the gradient orientations from the interval $[- \frac{π}{2}, \frac{π}{2}]$ to the interval $[0, 2 π]$ by applying the following function:

(24) $θ^{'} = a r c t a n 2 (V_{s}^{11}, V_{s}^{10})$

$a r c t a n 2 (V_{s}^{11}, V_{s}^{10}) = {\begin{matrix} θ & i f V_{s}^{11} > 0 a n d V_{s}^{10} > 0 \\ π - θ & i f V_{s}^{11} > 0 a n d V_{s}^{10} < 0 \\ θ - π & i f V_{s}^{11} < 0 a n d V_{s}^{10} < 0 \\ 2 π + θ & i f V_{s}^{11} < 0 a n d V_{s}^{10} > 0 \end{matrix}$

$V_{s}^{11} = x_{5} - x_{1}; V_{s}^{10} = x_{7} - x_{3}$

The gradients $θ^{^{'}}$ are further quantized into T dominant orientations using the function.

(25) $Φ_{t} = f_{q} (θ^{^{'}}) = \frac{2 t}{T} π$

$t = m o d (⌊ \frac{θ^{^{'}}}{2 π / T} ⌋, T)$ where $t$ is a linear function that maps a differential excitation with orientation $θ^{^{'}}$ to a particular dominant orientation $Φ_{t}$ . Next, a 2D histogram ${(W L D (ξ_{j}, Φ_{t}))}_{j = 0, 1, \dots, N - 1, t = 0, 1, \dots, T - 1}$ is created. N is the dimensionality (number of pixels) in an image and T is the number of dominant orientations. The 2D histogram has C rows and T columns, where C is the number of cells in each orientation. The value of C is explained in the subsequent part of this section. Intuitively, each cell of the 2D histogram represents the frequency of a differential excitation interval on a dominant orientation.

The 2D histogram is further encoded into a 1D histogram, H as follows. First, each column of the 2D histogram is projected to form a 1D histogram, ${(H (t))}_{t = 0, 1, \dots, T - 1}$ . Each histogram is produced by grouping differential excitations $ξ_{j}$ corresponding to a dominant orientation $Φ_{t}$ .

Next, each sub-histogram, $H (t)$ , is divided into M segments ${(H_{m, t})}_{m = 0, 1, \dots, M - 1}$ . The sub-histograms $H_{m, t}$ collectively form a histogram matrix. Each row of this matrix corresponds to a differential excitation segment and each column corresponds to a dominant orientation. The histogram matrix is reorganized as a 1D histogram by first concatenating the rows of the histogram matrix as a histogram, $H_{m} = {H_{m, t}}_{t = 0, 1, \dots, T - 1}$ . The resulting M sub-histograms are concatenated as a 1D histogram, $H = {H_{m}}_{m = 0, 1, \dots, M - 1}$ .

Note, that each sub-histogram, $H (t)$ , is divided into M segments, $H_{m, t}$ , by dividing the interval of differential excitations $l = [- \frac{π}{2}, \frac{π}{2}]$ into M segments $l_{m}$ . Furthermore, each sub-histogram $H_{m, t}$ is an S-bin histogram defined as:

(26) $H_{m, t} = {h_{m, t, s}}$

$h_{m, t, s} = \sum_{j} δ (S_{j} = s)$ where $S_{j}$ and $δ (A)$ are defined as

$S_{j} = m o d (⌊ \frac{ξ - η_{m, l}}{(η_{m, u} - η_{m, l}) / S} ⌋, S)$

$δ (A) = {\begin{matrix} 1 & i f A i s t r u e \\ 0 & o t h e r w i s e \end{matrix}$

$w i t h η_{m, l} = (\frac{m}{M} - \frac{1}{2}) π a n d η_{m, u} = (\frac{m + 1}{M} - \frac{1}{2}) π$

Support vector machines

The support vector machine (SVM) is a machine learning algorithm whose objective is to learn how to classify input features by defining a separating hyperplane. Considering $l$ training feature vectors $x_{1}, x_{2}, \dots, x_{l}$ with their corresponding labels $y_{1}, y_{2}, \dots, y_{l}$ , where $y_{i} \in {- 1, 1}$ for $i = 1, 2, \dots, l$ ; the following primal optimization problem can be defined as:

(27) $m i n_{w, b, ξ} = {\frac{1}{2} w^{T} w + C \sum_{i = 1}^{l} ξ_{i}} .$

Subject to

$y_{i} (w^{T} Φ (x_{i}) + b) \geq 1 + ξ_{i}$

$ξ_{i} > 0, i = 1, \dots, l$ where $ξ_{i}$ and $ξ$ are the classification error for the $i^{t h}$ training vector, and the total classification error, respectively; $w$ is the normal vector to the hyperplane; $\frac{b}{| | w | |}$ represents the offset of the hyperplane from the origin along the normal vector $w$ ( $| | \cdot | |$ being the norm operator); $Φ (x_{i})$ is the kernel function, that maps $x_{i}$ into a higher dimensional space and C is the regularization parameter. The classical SVM approach requires only two classes. So, a set of SVMs and a one-against-one strategy can make support vector machines suitable for multi-class problems such facial expression recognition (Knerr, Personnaz & Dreyfus, 1990). The experiments discussed here use the multi-class SVM strategy. More precisely, scikit-learn’s implementation of the LIBSVM library (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) was used.

To train an SVM classifier, a set of hyper-parameters must be selected: the kernel function, the regularization parameter (C), and the kernel coefficient (gamma). For each classifier, an original parameter optimization procedure is performed. Specifically, a grid-search cross-validation procedure is employed (Hsu, Chang & Lin, 2003). In grid-search cross-validation, all possible combinations of the parameters are used to train a classifier using 10-fold cross-validation. Then, the parameter values which produce the best average recall are selected. Table 2 gives the parameter selection for when each type of feature is used to train an SVM classifier.

Table 2:

Calibration of the SVM training hyper-parameters.

Descriptor	C	Gamma	Kernel
HOG	1,000	0.05	RBF
LBP	1,000	0.05	RBF
CLBP	1,000	0.05	Linear
WLD	1,000	0.05	Linear

DOI: 10.7717/peerj-cs.1388/table-2

Experimental results and discussion

Experimental phase 1: the effect of face registration

This section studies the effect of face registration on the FER system. To achieve this, the system is first trained on features extracted from unregistered faces and the performance is recorded. Then, the system is trained on features of registered faces. Note than the non-registered images are resized to 65 $\times$ 65 pixels, while the registered images are resized to 59 $\times$ 65 pixels to strike a balance between performance and feature vector length. Also, the parameters used for feature extraction are those recorded in Table 2.

Table 3 shows each expression’s difference in recognition rate, showing that registration either increases the recognition rate (e.g., fear) or has no effect (e.g., HOG happiness). The most significant increase is observed for the LBP descriptor for the anger expression. The results reveal that face registration leads to a better recognition rate regardless of the descriptor used. These results suggest that face registration has a positive (and non-negligible) impact on the feature extraction phase and, ultimately, on the recognition performance. Figure 9 shows an example of a registered image compared to a non-registered one.

Table 3:

Recognition performance per class before and after registration.

	HOG		LBP		CLBP		WLD
	NR	RE	NR	RE	NR	RE	NR	RE
Anger	88.9	93.3	68.9	91.1	84.4	80	77.8	93.3
Disgust	91.5	96.6	94.9	93.2	83.1	94.1	93.2	94.9
Fear	92	100	90	100	96	98	92	98
Happiness	98.6	98.6	95.7	98.6	91.3	97.1	95.7	98.6
Sadness	96.4	98.2	91.1	96.4	91.1	96.4	98.2	98.2
Surprise	98.5	98.5	97.1	97.1	95.6	98.5	95.6	98.5
Average	94.1	97.4	89.3	96.1	90.1	94.1	92.2	96.9

DOI: 10.7717/peerj-cs.1388/table-3

Note:

RE, Registered; NR, Non-registered.

Figure 9: Sample of non-registered and registered images.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-9

Experimental phase 2: optimizing spatial local descriptors

This section evaluates the performance of four local descriptors: HOG, LBP, CLBP, and WLD. Several classifiers are trained on features from different descriptors, and the average recalls during 10-fold cross-validation are recorded. The classifiers are trained on features extracted from two datasets: CK+ dataset of six expressions and RFD dataset of seven expressions. A parameter selection is performed on each descriptor to find the configuration which gives the best performance. There are two parameters for the extraction of LBP and CLBP features: the histogram bin size and the block size. Bin sizes of 8, 16 and 32, and block sizes within the range 2 to 24 (inclusive) are available for selection (where the block size refers to the side length of a square). For the HOG descriptor, the number of orientation bins and cell size are tuned. The orientations used are 3, 5, 7, 9, 12, 15 and 55 and the cell size are varied between four and 15 (inclusive). Concerning the WLD descriptor, four parameters must be tuned (or optimized): T, M, S, and the block size. Block sizes ranging from 2 and 24 (inclusive) are available for selection while T, M and S can take values of 4, 6, and 8.

Figures 10A–10D give the FER results (expressed as average recall) when each descriptor is optimized on the CK+ dataset of six expressions. The best parameters for extracting HOG descriptor are reported as nine orientations and a cell size of 8, giving a recognition rate of 97.4%. Concerning the LBP features, the recognition rate has a maximum value of 96.1% when the number of bins is 16 and the block size of 7 pixels. The best recognition rate for the CLBP descriptor is 94.1% when the number of bins is eight, and the block size is 11. Finally, the FER results when using WLD features, where the best performance (96.9%) when the parameters are block size of 8, $T = 8$ , $M = 4$ , and $S = 4$ . The same parameter optimization procedure is repeated; the features are extracted from the RFD dataset of six expressions. Figure 11A shows that the best parameters of HOG are 7 orientations and a cell size of 8 (RR = 97.2%). From Fig. 11b, the best LBP parameters are reported as 16 bins and a block size of 8 (RR = 94.6%). The best parameters for CLBP are 8 bins and 6-pixel blocks as shown in Fig. 11c (RR = 95.6%). Finally, Fig. 11d shows the best parameters of WLD as a block size of 7, $T = 6$ , $M = 4$ , and $S = 4$ (RR = 96.5%). The results show that the recognition rate rises and falls as the block size varies. This suggests that smaller blocks lead to too much fine-grain information, which is less useful in representing Action Units. On the other hand, larger blocks may merge multiple Action Units, which is less useful because it mixes discriminatory and non-discriminatory information. A similar finding was presented in Moore & Bowden (2011).

Figure 10: LBP results.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-10

Figure 11: HOG results.

Download full-size image

DOI: 10.7717/peerj-cs.1388/fig-11

The above results mean that the choice of parameter values improved local descriptors’ performance. Tables 4 and 5 give the best parameter selection and feature vector length for each descriptor on each dataset. It can be observed that the most frequent value for block size is 8, and other values are close, e.g., 6 and 7. This means that a block size of $8 \times 8$ pixels could be the best window for extracting features at the current resolution. On the other hand, the number of histogram bins is different for all descriptors.

Table 4:

Best parameters for the extraction on the CK+ dataset.

Descriptor	Number of bins	Block size	T	M	S	Feature vector length
HOG	9	8				2,430
LBP	16	7				1,440
CLBP	8	11				576
WLD		8	8	4	4	9,216

DOI: 10.7717/peerj-cs.1388/table-4

Table 5:

Best parameters for the extraction on the RFD dataset.

Descriptor	Number of bins	Block size	T	M	S	Feature vector length
HOG	7	8				1,890
LBP	16	8				1,152
CLBP	8	6				1,760
WLD		7	6	4	4	8,640

DOI: 10.7717/peerj-cs.1388/table-5

Comparison with the state-of-the-art

Comparing different FER systems in the literature is a challenging task because many studies omit or poorly discuss their evaluation strategy. Also, the training-testing data split often varies from one study to another. For this reason, only approaches evaluated using a 10-fold cross-validation method and the recognition rate (or average recall) were compared with this study. Table 6 gives various recent state-of-the-art approaches evaluated on the extended Cohn-Kanade (CK+) dataset (when a value is not available, it is replaced by ‘−’). All the approaches used SVM for classification, showing its superior performance, especially when working with small datasets. The proposed approach using the HOG descriptor achieved the best recognition rate, improving the results in Carcagnì et al. (2015) by a difference of 1.6%. We also note that the proposed approach gives a better recognition rate for LBP, CLBP, and WLD descriptors than previously obtained results. This improvement is probably due to this study’s optimum selection of extraction parameters.

Table 6:

Performance comparison of our approach against various state-of-the-art approaches (CK+ 6 expressions).

Approach	Year	Feature	Feature vector length	Classifier	Face registration	No. images	Recognition rate (%)
Kabir, Jabid & Chae (2012)	2012	LDPv	2,352	SVM	No	1,224	96.7
Carcagnì et al. (2015)	2015	HOG	–	SVM	Yes	347	95.8
Carcagnì et al. (2015)	2015	LBP	–	SVM	Yes	347	91.7
Carcagnì et al. (2015)	2015	CLBP	–	SVM	Yes	347	92.3
Carcagnì et al. (2015)	2015	WLD	–	SVM	Yes	347	86.5
Zhang et al. (2018)	2018	CLGDNP	–	SVM	No	1,482	95.3
Mandal et al. (2019)	2019	DRADAP	–	SVM	No	1,043	90.6
Hassan & Suandi (2019)	2019	LBP	2,478	SVM	Yes	150	96.2
Proposed	2022	HOG	2,430	SVM	Yes	347	97.4
Proposed	2022	LBP	1,440	SVM	Yes	347	96.1
Proposed	2022	CLBP	576	SVM	Yes	347	94.1
Proposed	2022	WLD	9,216	SVM	Yes	347	96.9

DOI: 10.7717/peerj-cs.1388/table-6

As observed in experimental results, the recognition rates first rise then fall as the block sizes are varied. Hence, the block size must be selected carefully when extracting each descriptor. Unlike previous studies, the current approach empirically determines the extraction parameters for LBP, CLBP, WLD, as opposed to determining them by trial and error. Another important aspect that makes the superiority of the proposed method is the histogram normalization. The HOG descriptor provides different histogram normalization methods which enhance its performance (Dalal & Triggs, 2005). Like previous methods, the current approach used a technique called L2-Hys normalization which has proven to be superior than the others. However, the better performance shown by this approach could be attributed to the choice of the number of cells per block ( $n_{c}$ ) used during the normalization process. Different $n_{c}$ values have shown to produce a measurable impact on the performance of HOG, and in this study $n_{c} = 3$ gave the best results. The other local descriptors were previously not normalized. Hence, we introduced a simple normalization method ² which also enhanced their performances. This is evident because the results obtained with normalization were better than those obtained without normalization.

Factors affecting facial expression recognition performance

The performance of the system is seriously affected by three main factors: the selection of datasets, the face registration, and the feature extraction parameters. Incorrect labeling also tends to introduce ambiguity between expressions and so can bring down the accuracy of predictions. The registration step is a crucial component to increase the classification performance. A correctly registered frontal face image must have all the facial components to guarantee extraction of the most key features and high prediction accuracy. This finding was pointed in Carcagnì et al. (2015) and our results confirm that the recognition performance is higher when using registered images than when the images are not registered.

Conclusion

This article presented a novel method for optimized facial expression recognition. The proposed approach recognizes facial expressions in three stages: (1) face detection and registration, (2) feature extraction, and (3) classification. It is discovered that face registration is necessary to improve the performance of a FER system. The study has been validated on two popular facial expressions datasets, and the performance obtained is comparable to the state-of-the-art. Four local descriptors were optimized, resulting in improved recognition rates. The actual values of the best parameters differed depending on the dataset used. However, it is worth mentioning that the best block sizes were very similar, taking values close to $8 \times 8$ pixels. Though the objectives of this study were achieved successfully, there is still room for improvement. The proposed FER system is limited to processing only frontal view faces. Future work must study the recognition of expressions on multi-view face datasets such as the BU-3DFE dataset (Eroglu Erdem, Turan & Aydin, 2015). Though the recognition rates were very high, this was at the expense of the feature vector length. Hence, future studies must explore feature space reduction techniques, such as AdaBoost, principle component analysis, and linear discriminant analysis, that can reduce the size of the feature vector while achieving similar or even better recognition rates.

Supplemental Information

Code that implements an approach for an optimized Facial Expression Recognition.

The impact of face registration and the extraction parameters were also investigated.

DOI: 10.7717/peerj-cs.1388/supp-1

Download

Note that the superscripts in

H^{0}

H^{1}

\dots

are not exponents but orders. Hence,

H^{0}

denotes the first histogram,

H^{1}

is the second histogram, and so on.

More details can be found in the “Feature Extraction” section.

[1] Altameem T, Altameem A. 2020. Facial expression recognition using human machine interaction and multi-modal visualization analysis for healthcare applications. Image and Vision Computing 103(19):104044

[2] Badi Mame A, Tapamo J-R. 2022. A comparative study of local descriptors and classifiers for facial expression recognition. Applied Sciences 12(23):12156

[3] Benitez-Garcia G, Nakamura T, Kaneko M. 2017. Facial expression recognition based on local fourier coefficients and facial fourier descriptors. Journal of Signal and Information Processing 8(3):132-151

[4] Bhadur S. 2021. Detection of facial emotions while reading online text. Master’s thesis, Itä-Suomen yliopisto

[5] Bolelli F, Allegretti S, Baraldi L, Grana C. 2019. Spaghetti labeling: directed acyclic graphs for block-based connected components labeling. IEEE Transactions on Image Processing 29:1999-2012

[6] Bougourzi F, Dornaika F, Mokrani K, Taleb-Ahmed A, Ruichek Y. 2020. Fusing transformed deep and shallow features (ftds) for image-based facial expression recognition. Expert Systems with Applications 156(9):113459

[7] Carcagnì P, Del Coco M, Leo M, Distante C. 2015. Facial expression recognition and histograms of oriented gradients: a comprehensive study. SpringerPlus 4(1):1-25

[8] Chen J, Chen Z, Chi Z, Fu H. 2014. Facial expression recognition based on facial components detection and hog features.

[9] Chen J, Shan S, He C, Zhao G, Pietikäinen M, Chen X, Gao W. 2009. Wld: a robust local image descriptor. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1705-1720

[10] Dalal N, Triggs B. 2005. Histograms of oriented gradients for human detection.

[11] Dong C, Wang R, Hang Y. 2021. Facial expression recognition based on improved vgg convolutional neural network. In: Journal of Physics: Conference Series. Bristol: IOP Publishing. 2083:32030

[12] Eroglu Erdem C, Turan C, Aydin Z. 2015. Baum-2: a multilingual audio-visual affective face database. Multimedia Tools and Applications 74(18):7429-7459

[13] Garbas J-U, Ruf T, Unfried M, Dieckmann A. 2013. Towards robust real-time valence recognition from facial expressions for market research applications.

[14] Ghimire D, Lee J. 2013. Geometric feature-based facial expression recognition in image sequences using multi-class adaboost and support vector machines. Sensors 13(6):7714-7734

[15] Hassan H, Suandi SA. 2019. Analysis of local binary pattern for facial expression recognition using patch local binary pattern on extended cohn kanade database.

[16] Hesse N, Gehrig T, Gao H, Ekenel HK. 2012. Multi-view facial expression recognition using local appearance features.

[17] Hsu C-W, Chang C-C, Lin C-J. 2003. A practical guide to support vector classification. (accessed 30 September 2022)

[18] Hussein HI, Dino HI, Mstafa RJ, Hassan MM. 2022. Person-independent facial expression recognition based on the fusion of hog descriptor and cuttlefish algorithm. Multimedia Tools and Applications 81(8):11563-11586

[19] Jia Q, Gao X, Guo H, Luo Z, Wang Y. 2015. Multi-layer sparse representation for weighted lbp-patches based facial expression recognition. Sensors 15(3):6719-6739

[20] Kabir MH, Jabid T, Chae O. 2012. Local directional pattern variance (ldpv): a robust feature descriptor for facial expression recognition. The International Arab Journal of Information Technology 9(4):382-391

[21] Knerr S, Personnaz L, Dreyfus G. 1990. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Neurocomputing. Berlin: Springer. 41-50

[22] Kumar P, Happy S, Routray A. 2016. A real-time robust facial expression recognition system using hog features.

[23] Langner O, Dotsch R, Bijlstra G, Wigboldus DHJ, Hawk ST, van Knippenberg A. 2010. Presentation and validation of the radboud faces database. Cognition and Emotion 24(8):1377-1388

[24] Lim JJ, Zitnick CL, Dollár P. 2013. Sketch tokens: a learned mid-level representation for contour and object detection.

[25] Lin K, Cheng W, Li J. 2014. Facial expression recognition based on geometric features and geodesic distance. International Journal of Signal Processing, Image Processing and Pattern Recognition 7(1):323-330

[26] Liu C, Hirota K, Ma J, Jia Z, Dai Y. 2021. Facial expression recognition using hybrid features of pixel and geometry. IEEE Access 9:18876-18889

[27] Loconsole C, Miranda CR, Augusto G, Frisoli A, Orvalho V. 2014. Real-time emotion recognition novel method for geometrical facial features extraction. In: 2014 International Conference on Computer Vision Theory and Applications (VISAPP). Piscataway: IEEE. 1:378-385

[28] Lu X, Kong L, Liu M, Zhang X. 2015. Facial expression recognition based on gabor feature and src.

[29] Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. 2010. The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression.

[30] Makhmudkhujaev F, Abdullah-Al-Wadud M, Iqbal MTB, Ryu B, Chae O. 2019. Facial expression recognition with local prominent directional pattern. Signal Processing: Image Communication 74:1-12

[31] Mandal M, Verma M, Mathur S, Vipparthi SK, Murala S, Kumar DK. 2019. Regional adaptive affinitive patterns (radap) with logical operators for facial expression recognition. IET Image Processing 13(5):850-861

[32] Moore S, Bowden R. 2011. Local binary patterns for multi-view facial expression recognition. Computer Vision and Image Understanding 115(4):541-558

[33] Ojala T, Pietikainen M, Maenpaa T. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7):971-987

[34] Ojala T, Pietikäinen M, Mäenpää T. 2000. Gray scale and rotation invariant texture classification with local binary patterns.

[35] Rathee N, Ganotra D. 2015. A novel approach for pain intensity detection based on facial feature deformations. Journal of Visual Communication and Image Representation 33(4):247-254

[36] Sajjad M, Shah A, Jan Z, Shah SI, Baik SW, Mehmood I. 2018. Facial appearance and texture feature-based robust facial expression recognition framework for sentiment knowledge discovery. Cluster Computing 21(1):549-567

[37] Shan C, Gong S, McOwan PW. 2009. Facial expression recognition based on local binary patterns: a comprehensive study. Image and Vision Computing 27(6):803-816

[38] Sklansky J. 1982. Finding the convex hull of a simple polygon. Pattern Recognition Letters 1(2):79-83

[39] Truong HP, Nguyen TP, Kim Y-G. 2021. Weighted statistical binary patterns for facial feature representation. Applied Intelligence 52(12):1-20

[40] Turan C, Lam K-M. 2018. Histogram-based local descriptors for facial expression recognition (fer): a comprehensive study. Journal of Visual Communication and Image Representation 55:331-341

[41] Viola P, Jones MJ. 2004. Robust real-time face detection. International Journal of Computer Vision 57(2):137-154