# "An Exploratory Framework for Mapping Facial Preferences: Utilizing Paired Comparison Tasks and Gaussian Process Preference Learning"(仮)
Running head 案
Exploratory Mapping of Facial Preferences
Facial Preference Exploration with GPPL
# 城下の論文のドラフト
# 1. Introduction
Faces are complex visual objects that exhibit numerous morphological and surface properties that vary between individuals. These properties are critical for forming subjective impressions, recognizing individuals, and discerning emotional expressions. The cognitive system responsible for face perception represents individual faces as points in a higher-dimensional space, where each dimension corresponds to a different aspect of facial properties (known as the face space model; Valentine, 1991). Social judgments are made by combining multiple cues that are derived from this face space model.
Facial preference is a crucial aspect of facial perception that has a significant impact on real-world social interactions. Subjective preference plays a key role in forming impressions of faces and influences a range of social judgments (Oosterhof & Todorov, 2008; Sutherland et al., 2012). Over the years, various morphometric and skin properties have been identified as critical cues for facial preference, including the size and shape of facial features, the overall configuration of the face, and skin tone (Little, 2014; Thornhill & Gangestad, 1999). Consequently, exploring the quantitative relationship between multidimensional facial cues and facial preference is essential for gaining insight into the mechanisms underlying the perception of facial preference.
The psychological process underlying facial preference can be conceptualized as a utility function denoted by $f(\mathbf{x})$, which maps a multivariate input $\mathbf{x}$ (i.e., a presented face) to a scalar value representing the utility of that face. Each individual possesses their unique utility function for facial preference, which can be inferred from data-driven models. Typically, multiple faces that vary in several facial feature dimensions are presented, and preference ratings are collected using Likert scales. A linear regression model is then constructed to estimate the contribution of each facial feature dimension to facial preference (Oosterhof & Todorov, 2008; Nakamura & Watanabe, 2019). While such models have identified and visualized some critical cues to subjective preference, some issues still need to be considered.
The first issue concerns the limitations of traditional data-driven studies in investigating shared tastes in facial preference at a population level, while neglecting individual variations in facial preference judgments. Recent research has highlighted that facial preference judgments can vary across individuals and cultures, and that an individual's private taste can significantly impact facial preference judgments. In fact, private tastes are as crucial as shared tastes in predicting facial preference judgments. Therefore, given the idiosyncratic nature of facial preference judgments, more sophisticated modeling methods are needed to account for the individual specificity of facial preference judgments.
The second issue pertains to the non-linear relationship between facial features and subjective preferences. Past research (Komori & Kawamura, 2009) has examined the attractiveness of primary facial feature components using non-linear regression models, which revealed that the utility function is non-linear and not simply U-shaped. Therefore, one cannot assume the shape of the utility function a priori. It is possible that the utility functions have multiple peaks or a complex shape, making it inappropriate to analyze the relationship using linear regression models. To overcome these issues, more sophisticated modeling methods are required that can consider the individual specificity of facial preference judgments and explore the non-linear relationship between facial features and subjective preferences.
One potential approach to uncovering the shape of an unknown utility function is to employ a grid search method, which involves collecting evaluations for all possible combinations of facial features. However, this method can be expensive and time-consuming, especially when dealing with many facial feature dimensions. The sheer number of possible stimuli combinations makes it practically challenging to use a grid search approach for revealing the utility function.
Conjoint analysis is a multivariate analysis technique that investigates the influence of each attribute and attribute level on preferences based on the preferences for stimuli with multiple attributes. However, one difficulty with using conjoint analysis for studying facial attractiveness is that it is hard to investigate the response to continuous differences in facial features, as there cannot be too many levels for each attribute. Additionally, if a model that considers high-order interaction terms is adopted to investigate the effects of combinations of facial features, the benefit of using conjoint analysis is lost due to the significant increase in the number of required trials.
Therefore, in this study, we adopted Gaussian process regression (GPR) as an alternative method to estimate the utility function $f$. GPR is a non-parametric Bayesian approach to regression analysis that uses a kernel function to estimate a function from a set of input and output data. Unlike linear regression analysis, GPR can handle nonlinear functions.
However, applying GPR to estimate utility functions in psychological research poses challenges. In a typical GPR approach, the function being estimated is expected to have continuous responses, while psychological research often uses methods that require discrete responses, such as Likert scales, dichotomous choices, and paired comparison methods. Komori et al. (2022) used Gaussian Process Ordinal Regression (GPOR) to investigate the relationship between the perceived cuteness of faces and their morphological features based on responses from Likert scales. However, it has been suggested that paired comparison methods are more accurate and easier to evaluate than Likert scales (Phelps et al., 2015).
In this study, we employed Gaussian Process with Preference Learning (GPPL) to estimate nonlinear functions from sparse pairwise preference datasets. GPPL combines Thurstone's pairwise comparison model with Gaussian process regression, enabling the examination of the relationship between multidimensional facial features and facial preferences. This approach is based on a two-alternative forced-choice task where participants choose their preferred face from a pair of facial images. To our knowledge, GPPL has not been previously used to investigate judgments involving complex facial stimuli.
In addition, this study employed a subspace within the latent space of the deep learning model StyleGAN2 to represent facial features in a face space. The stimuli presented to participants were generated using StyleGAN2, effectively addressing the issue of image unnaturalness often encountered when employing warping techniques in previous psychological research on faces. This approach ensures the generation of non-existent yet highly realistic images.
## 2. Analytical Models
### 2.1 Gaussian Process with Preference Learning
In this study, we make the assumption that each participant has a utility function $f:\mathcal{X}\rightarrow\mathbb{R}$ that maps the multivariate input $\mathbf{x}_i\in \mathbb{R^d}$ in the set of $\mathcal{X}=\{\mathbf{x}_i|i=1,\ldots,n\}$ to a scalar value $f(\mathbf{x}_i)$ representing latent evaluation of a given stimulus.
We consider a dataset $\mathcal{D}$ consisting of $m$ pairwise preference relations between stimuli, denoted as
\begin{align*}
\mathcal{D}=\{\mathbf{v}_{k}\succ \mathbf{u}_{k};k=1,\ldots,m\}
\end{align*}
where $\mathbf{v}_{k}\in\mathcal{X},\mathbf{u}_{k}\in\mathcal{X}$, and $\mathbf{v}_{k}\succ \mathbf{u}_{k}$ indicates the stimulus $\mathbf{v}_{k}$ is preferred to $\mathbf{u}_{k}$.
To model the noise in participants' responses, we assume Gaussian noise $\mathcal{N}(\delta;\mu,\sigma^2)$, where $\mu$ represents the mean, and $\sigma^2$ represents the variance.
By assuming Gaussian noise $\mathcal{N}(\delta;\mu,\sigma^2)$ on participants' responses, the likelihood of a pairwise comparison $\mathbf{v}_{k}\succ \mathbf{u}_{k}$ becomes
\begin{align*}
&\mathcal{P}\left(\mathbf{v}_{k}\succ \mathbf{u}_{k} \mid f(\mathbf{v}_{k}),f(\mathbf{u}_{k})\right)\\
&=\iint\mathcal{P}_{\mathrm{ideal}}\left(\mathbf{v}_{k}\succ \mathbf{u}_{k} \mid f(\mathbf{v}_{k})+\delta_v,f(\mathbf{u}_{k})+\delta_u\right) \\
&\qquad\qquad\mathcal{N}(\delta_v;0,\sigma^2)\mathcal{N}(\delta_u;0,\sigma^2)d\delta_vd\delta_u\\
&=\mathrm{\Phi}(z_k)
\end{align*}
\begin{align*}
z_k=\frac{f(\mathbf{v}_{k})-f(\mathbf{u}_{k})}{\sqrt{2}\sigma_{\mathrm{noise}}}
\end{align*}
\begin{align*}
\mathrm{\Phi}(z)=\int^{z}_{-\infty}\mathcal{N}(\gamma;0,1)d\gamma$
\end{align*}
where $\mathrm{\Phi}(z)$ is the cumulative Gaussian with zero mean and unity variance.
The likelihood which is the joint probability of the preference relations given the latent function values can be written as
\begin{align*}
\mathcal{P}(\mathcal{D} \mid f) =\prod_{k=1}^{m} P\left(\mathbf{v}_{k}\succ \mathbf{u}_{k} \mid f(\mathbf{v}_{k}),f(\mathbf{u}_{k})\right)
\end{align*}
We assume that the utility function $f$ is modeled by a zero-mean Gaussian Process (GP) prior with kernel function $\mathcal{K}$ as $f \sim \mathcal{GP}(\mathbf{0},\mathcal{K})$. Based on Bayes' theorem, the posterior distribution $\mathcal{P}(f \mid \mathcal{D})$ can be defined, but since it is not tractable analytically. Following previous studies \cite {chu2005preference,brochu2010tutorial}, the posterior distribution $\mathcal{P}(f\mid\mathcal{D})$ was approximated as Gaussian using a Laplace approximation.
### 2.2 Linear Regression Model
We employed a probit linear regression model as a comparative benchmark for the GPPL. In this model, the utility $u_i$ generated by a $k$-dimensional multivariate features vector $\mathbf{x}i$, which comprises stimulus $i$, is determined by a probit linear regression model. The probability $P{ij}$ of preferring stimulus $i$ over stimulus $j$ is determined by a function $\Phi$ of the difference between the utilities of $i$ and $j$. Here, $\Phi$ represents the cumulative normal distribution function (CDF), and when the utilities of $i$ and $j$ are equal, one of them is chosen with equal probability. This concept is grounded in Thurstone's Law of Comparative Judgment.
We model the decision $Y_{ij}\in \{0,1\}$ when comparing stimulus $i$ to stimulus $j$ to be the Bernoulli random variable where corresponding probability of $P_{ij}$ where $Y_{ij}=0$ means that stimulus $i$ is preferred to stimulus $j$, and vice versa when $Y_{ij}=1$. Let $\mathbf{\beta}\in \mathbb{R^d}$ be the vector of regression coefficients and and $\mathbf{\varepsilon}\in\mathbb{R^d}$ be the Gaussian noise. The probability $P_{ij}$ is modeled via the probit link function, $\Phi^{-1}$ to the difference between $u_i$ and $u_j$ where $\Phi(\cdot)$ is the cumulative distribution function of the standard normal random variable.
$u_i=\rm{\mathbf{x}_i^T\mathbf{\beta}}+\mathbf{\varepsilon}_i, \mathbf{\varepsilon}_i\overset{\hphantom{\text{i.i.d}}}{\sim}\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})$
$P_{ij}=\Phi(u_i-u_j)$
$Y_{ij}\sim \rm{Bernoulli}(P_{ij})$
We used a prior function with a half-Cauchy distribution prior to $\sigma$, a non-informative normal prior to $\mathbf{\beta}$.
## 3. Method
### 3.1Face Space Construction
Kerras et al.\cite{karras2020analyzing}, the authors of StyleGAN2, introduced a dataset of human faces called Flickr-FacesHQ (FFHQ), which is a dataset of human faces of various races that consists of 70,000 high-quality images at a resolution of $1024\times1024$. We used the StyleGAN2 network, which was trained with FFHQ for our study. In the StyleGAN2 latent face space, the vector has $18\times512$ dimensions s, far exceeding the dimensions that can be searched by GPPL. Additionally, the latent space is not suitable for experiments in east Asia because much of the latent space is composed of male and female faces of other races. Therefore, we created a low-dimensional subspace of the latent space consisting of only Asian female faces using reference facial images.
Fifty Japanese female university students provided facial photographs at a resolution of $1024\times1024$. The neutral expression of each face was captured using a digital camera. A gray screen was set up in the background. The foreheads of the models were exposed using a headband after removing accessories such as eyeglasses. In order to examine features related to facial sexual dimorphism in the analysis, facial photographs of 50 Japanese male students were also taken in the same manner.
First, the facial images of the 50 females were embedded into the latent space of StyleGAN2 using the method of Kerras et al.\cite{karras2020analyzing}. Then, principal component analysis (PCA) was performed on the resulting $18\times512$ dimensional latent representation. Since the contribution ratios of the 9th and higher PCs were very small, we used up to the 8th PCs. The cumulative contribution ratio by the 8th PC was 29.5\%. Changes in the facial images along each principal component are illustrated in Figure\ref{face} to identify the facial features linked to each component. The 1st PC was related to the height of the forehead and the shape of eyes and eyebrows: The 2nd PC was linked to the thickness of the eyebrows and the face's length. The 3rd, the 4th, 5th, and the 6th were related to the shape of the cheekbones, eyes, eyelids, and jaw. The 7th PC is linked to the width of the face. The 8th PC was related to the shape of the face and the color and texture of the face. In this study, the 8-dimensional space obtained by the above procedure is referred to as the female face subspace. Face images of 50 males were also embedded in the latent space in a similar manner, and the position of the male average face in the female face subspace was also calculated.
\begin{figure}[p!]
\centering
\includegraphics[width=\linewidth]{image/facial_changs_along_PCs.pdf}
\caption{Facial image variation along each dimension of the face subspace (-2SD/Mean/+2SD).}
\label{face}
\end{figure}

## 3.2 Assessment of perceived attractiveness
### 3.2.1 Participant
A total of 5 participants (mean age $= 22.4$, $SD = 1.0$) took part in the experiment. All procedures employed in this study were approved by the Ethics Committee of Osaka Electro-Communication University (18-004) and the Institutional Review Board of Waseda University (2015-033).
### 3.2.2Procedure
Two-alternative forced choice (2AFC) is used in the assessment. Participants were instructed to select their preferred face from two images displayed side by side in 1000 trials. They were requested to respond in terms of their preference for the facial owner as a sexual partner or a companion, according to their first impressions of the images, i.e., without contemplating their responses.
A two-alternative forced choice (2AFC) task was utilized for the assessment. Participants were instructed to select their preferred face from two images displayed side by side in 1000 trials. They were requested to respond based on their preference for the facial owner according to their first impressions of the images, without contemplating their responses.
We prepared 1000 pairs of face parameters in an 8-dimensional face subspace, randomly selected from a range of $\pm2SD$. The order of pairs and the position displayed (right or left) were determined randomly. The session consisted of 20 practice trials and 1000 test trials. Stimulus images were generated for each trial using the StyleGAN2 network. The face images were displayed on an LCD monitor. Participants responded using a keyboard, and the application used in the experiments was implemented in a PsychoPy environment\cite{peirce2007psychopy}.
## 4. Results
### 4.1 Individual utility functions
Based on all pairwise preferences, participants’ psychological utility functions $f$ were estimated using GPPL. We assume a zero mean Gaussian process prior. The noise parameter ($\sigma_{noise}$) is set to 1.0. We used a Radial Basis Function (RBF) kernel with an automatic relevance discovery (ARD) \cite{mackay1996bayesian} that considers different length scale $\sigma_d$ in each dimension.
<!--
\begin{align}
\mathcal{K}_{\text{RBF}}(\mathbf{x}_i, \mathbf{x}_j) = \exp \left( -\frac{1}{2}
(\mathbf{x}_i - \mathbf{x}_j)^ T \Sigma^{-1} (\mathbf{x}_i - \mathbf{x}_j) \right)
%%\end{align}
-->
\begin{align}
\mathcal{K}_{\text{RBF}}(\mathbf{x}_i, \mathbf{x}_j) = \exp ( -\frac{1}{2} \sum_{d=1}^D \frac{1}{\sigma_s^2}(x_{id}-x_{jd})^2)
\end{align}
The predicted means $\mu _{s\mathbf{x}}$ and the variances $\sigma_{s \mathbf{x}}^{2}$ of the of utility function $f_s(\mathbf {x})$ for each participant were obtained within $\pm2S.D$ of each dimension of the PCs with an interval of 0.25. The estimated utility functions, or predocted means, for all experimental participants are shown in Figure XX, which shows that the utility functions are unimodal or monotonic in this search range of this study for all participants.第1主成分,第7主成分に関しては,比較的実験参加者間で判断の傾向が類似していたが,

**Figure XX Predicted mean function of attractiveness on each PC.**
For each participant, the predicted mean maximum $f_s(\mathbf {x}^*_s)$ was obtained, where $\mathbf x^*_s = \arg \max_{\mathbf x \in A} f_s(\mathbf x)$ in face space $A$. For each participant, the maximum value $f_s(\mathbf {x}^*_s)$ was shown to be significantly greater than zero $(p(fs(\mathbf x^*_s) \leq 0) <.01$ based on a posterior tail-area probability for $f_s(\mathbf {x}^*_s)<0$, indicating that differences in face shape had a consistent influence on individual judgments. The corresponding images to $\mathbf x^*_s$ were generated for all participants (FigureXX).
Furthermore, we also estimated the parameters for our probit linear regression model. To estimate the parameters, we conducted MCMC simulations with four independent chains and a total of 4,000 iterations per chain were conducted, and first 2,000 were discarded as burn-in steps. The convergence of the MCMC simulations was checked using the Gelman-Rubin statistic (Gelman and Rubin, 1992) and was less than 1.10 for all parameters indicating that the MCMC simulations converged. The results obtained for the posterior estimates of $\bf{\beta}$ were used for the predictions of the probit linear model.
### 4.2 k-fold cross validation
We computed the learning curve using $k$-fold cross-validation to examine the relationship between the accuracy of the GPPL model's prediction of individual preferences and the training sample size, and to compare the GPPL and probit linear regression model based on prediction accuracy. Because this study used a 2AFC task, the accuracy as an evaluation index was the percentage of correctly predicted responses $\mathbf{x}_i\succ \mathbf{x}_j$ or $\mathbf{x}_i\prec \mathbf{x}_j$ for a given stimulus pair $\mathbf{x}_i$ and $\mathbf{x}_j$ (i.e. $accuracy = hit/(hit+miss)$). The models were created for each participant. In a $k$-fold cross-validation, the dataset is partitioned randomly into $k$-folds. Of the $k$ folds, $k - 1$ folds were used to train the statistical model, while the remaining fold is used to evaluate the prediction accuracy. The value of $k$ was set to 5 (i.e., $5$-fold cross-validation).
The learning curves for the GPPL and its comparative probit linear model are shown in Figure XX. The learning curves indicate the averages of accuracy for each size of training data. To avoid any possible temporal order effect, train and test data set contained trials from throughout the experiment.

The prediction accuracy of GPPL for the test data reached a near plateau at a training sample size of 200, with a final accuracy of XX%. The GPPL showed higher prediction accuracy than the probit linear model when the sample size was 100 or larger. The final accuracy of the probit linear model was XX.
### 4.3 Averaged Utility Function
Next, we calculated the mean of the predicted mean $\mu _{S\mathbf{x}}$ and the predicted variance $\sigma^2_{S\mathbf x}$ for each coordinate of the face space $\mathbf x$. $M$ denotes the number of participants in this study.
\begin{align}
\mu_{S \mathbf{x}}=\frac{1}{M} \sum_{s=1}^{M} \mu_{s \mathbf{x}}
\end{align}
\begin{align}
\sigma_{S \mathbf{x}}^{2}=\frac{1}{M} \sum_{s=1}^{M}\left(\mu_{s \mathbf{x}}^{2}+\sigma_{s \mathbf{x}}^{2}\right)-\mu_{S \mathbf{x}}^{2}
\end{align}
Here, $\mu_{S\mathbf x}$ is an estimate of the average perceived attractiveness for a given facial feature $\mathbf x$, and we refer to the set of $\mu_{S\mathbf x}$ as the average utility function $f_S$. Figure \ref{fig:PCAmu_plot} and \ref{fig:PCAmu_plot} show the utility function $f_S$ and the variance averaged for each pair of face subspace dimensions up to the 8th PC. Figure X shows that the mean utility function is also a unimodal function.


The maximum of the average utility function $f_S(\mathbf {x}^*_S)$ was obtained, where $\mathbf x^*_S = \arg \max_{\mathbf x \in A} f_S(\mathbf x)$ in face space $A$. The maximum $f_S(\mathbf {x}^*_S)$ was significantly greater than zero ($p(f_S(\mathbf {x}^*_S)<0)<.01$), according to the posterior tail-area probability. Moreover, based on the predicted covariances and the differences of the predicted means between the maximum coordinate $\mathbf{x}^*_S$ and all other coordinates$\mathbf{x}$, it was shown that the coordinates where $p(f_S(\mathbf{x}^*_S)<f_S(\mathbf{x}))<.05$ were $1.30\%$ ($559,159$ points$/43,046,721$points) of the total corrdinates. These results suggest that averaging the utility functions of the participants is valid in examining the average tendencies in utility and discovering the features that maximize the utility. The corresponding images to $\mathbf x^*_S$ and $x^-_S = \arg \min_{\mathbf x \in A} f_S(\mathbf x)$ were generated (FigureXX).

The shape of the utility function offers various insights for comprehending the determinants of facial attractiveness. To facilitate the interpretation of the utility, principal component scores were acquired by embedding the average male face into the female face subspace (Figure XX). PC1, which accounts for the most significant facial variation in face space, is the component that represents differences in relative forehead size and eye shape. The considerable distance between the female and male mean faces on the PC1 axis suggests that this axis represents facial sexual dimorphism. Given that the 1st PC scores for the male average face were positive and the peak of attractiveness was in the negative zone, this result indicates a preference for female faces with low masculinity in terms of PC1. Likewise, the difference in the PC scores for the female and male average faces suggests that PC4 also strongly reflects sexual dimorphism, with PC4 representing differences in the shape of cheekbones, eyes, eyelids, and jaw. The sign of peak attractiveness is opposite to that of the average face, implying a preference for more feminine (less masculine) faces over the average female face. PC5 is another principal component reflecting sexual dimorphism; however, its maximum attractiveness is on the same side as that of the male average face, which is inconsistent with the results of PC1 and PC4.

### Discussion
In this study, we explored the applicability of Gaussian Process with Preference Learning (GPPL), an extension of Gaussian process regression, to psychological tasks. By investigating the relationship between multidimensional facial features and preferences using GPPL, we demonstrated the numerous advantages of this approach in the context of psychological research.
This methodology capitalizes on the strengths of Gaussian process regression. Unlike analysis of variance approaches such as conjoint analysis, Gaussian process regression is not confined to predetermined levels for exploration points, allowing for a more flexible analysis. Additionally, it can comprehensively handle the nonlinear relationships between multidimensional physical features and preferences. These advantages led to the superior prediction accuracy of GPPL over the linear models compared in this study, as corroborated by cross-validation results. The ability to assess the certainty of predictions within a Bayesian framework is also a notable benefit for psychological research. As a result, this method facilitated the identification of both individual and group-level characteristics in preference judgments. The advantages of Gaussian process regression are particularly valuable when examining human intuitive judgments for complex stimuli, such as facial evaluations.
Furthermore, GPPL's ability to perform estimations based on the results of pairwise comparisons, which are more straightforward for experimental participants to judge, distinguishes it from conventional Gaussian process regression. Although Gaussian process regression has been underutilized in psychological research, GPPL, like Gaussian Process Ordinal Regression, is expected to gain traction in the examination of preference judgments across various modalities (e.g., taste, timbre) in the future.
However, this approach is not without limitations. Firstly, employing the Thurstonian pairwise comparison method may necessitate a greater number of trials than conventional Gaussian process regression to achieve the desired prediction accuracy. This issue could become more pronounced as the number of stimulus dimensions increases. One potential solution is the utilization of acquisition functions. In Bayesian optimization, an application of Gaussian process regression, the number of trials required for optimization is reduced by applying acquisition functions, such as Expected Improvement (EI) and Upper Confidence Bound (UCB), to response histories. The use of acquisition functions is considered an effective method for estimating the maximum/minimum of individual utility functions. Another approach to decrease the number of trials is the adoption of multiple-choice tasks, in which participants select the most preferable stimulus from a set of multiple stimuli. The extension to multiple-choice tasks could prove to be a valuable tool in psychological research.
Another challenge is the difficulty in interpreting the results. The obtained results merely describe the shape of the utility function and its confidence interval, necessitating a careful examination from various perspectives for interpretation. The relationship between the point of maximum predicted preference and sexual dimorphism, shown in the results, is one case of consideration.Consequently, this method should be regarded as an exploratory approach for generating hypotheses rather than a means of confirming existing hypotheses.
https://www.ajronline.org/doi/10.2214/AJR.14.13022