Personality Trait Prediction Based on Smartphone Usage in the Wild

(1)

Research Collection

Master Thesis

Personality Trait Prediction Based on Smartphone Usage in the Wild

Author(s):

Kovacevic, Nikola Publication Date:

2021

Permanent Link:

https://doi.org/10.3929/ethz-b-000477330

Rights / License:

In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more

information please consult the Terms of use.

(2)

Personality Trait Prediction Based on Smartphone Usage in the Wild

Nikola Kovaˇcevi´c

Master Thesis March 2021

Prof. Dr. Markus Gross

Supervisors:

Rafael Wampfler

Prof. Dr. Christian Holz

Prof. Dr. Tobias Günther

(3)

(4)

Abstract

Imbuing computers with the ability to perceive and interpret a user’s affective traits and char-

acteristics has been a key objective in interactive technologies, as it allows devices to adapt

their behaviour according to the user’s particular needs. For example, personality-aware recom-

mender systems could provide better suited product offerings and employers’ interview systems

could find tailored roles for applicants based on their traits. To become personality-aware, de-

vices need to infer the user’s traits through the available modalities for sensing input from the

user, ideally unobtrusively and integrated with the user’s regular interaction. Additionally, they

also need to produce such predictions in real-time. In this work, we present a learning-based

method for predicting the Big Five personality traits and depression level of a person from the

time-stamped touch coordinates captured while using a mobile phone. Our implementation is

built around a support vector machine and achieves a classification accuracy of up to 84%, pre-

dicting whether each trait and depression is expressed at a low or high level. Our findings show

that the most predictive features vary greatly across personality traits, indicating that each trait

should be considered separately. While typing speed alone is most relevant to predict extraver-

sion, we found that spatial properties of typing behavior are more significant for depression

levels. This suggests that to predict depression, the typed content matters more than higher-

level features such as typing speed or touch frequency. We compare our model to previous

approaches and show that it outperforms the accuracy of previous approaches as well as the

speed of prediction. We believe that our unobtrusive approach to estimating affective traits is

an important step towards personality-aware devices and user-centered computing.

(5)

(6)

Zusammenfassung

Computer mit der Fähigkeit auszustatten, affektive Eigenschaften und Charakteristika eines Benutzers wahrzunehmen und zu deuten, ist ein zentrales Ziel von interaktiven Technologien, weil es den Geräten dadurch ermöglicht wird, ihr Verhalten den Bedürfnissen des Benutzers anzupassen. Beispielsweise können Empfehlungssysteme besser passende Produktangebote anzeigen, und Befragungssysteme von Arbeitgebern können massgeschneiderte Rollen für ihre Bewerber finden. Um sich der Persönlichkeitsmerkmale eines Benutzers bewusst zu werden, müssen die Geräte die Merkmale durch die zur Verfügung stehenden Modalitäten erschliessen können. Idealerweise würde dies auf unaufdringliche Art und Weise und integriert in die nor- male Benutzerinteraktion passieren. Zusätzlich sollten die Vorhersagen in Echtzeit geschehen.

In dieser Arbeit präsentieren wir eine lernbasierte Methode zur Vorhersage der Big Fiver Per- sönlichkeitsmerkmale und des Depressionsgrads einer Person anhand von zeitgestempelten Touch-Koordinaten, welche während der Benutzung eines Mobiltelefons gesammelt werden.

Unsere Implementation baut auf einer Support Vector Machine und erreicht eine Klassifizierungs-

genauigkeit von bis zu 84%, wobei vorhergesagt wird, ob ein Merkmal sich auf einem hohen

oder niedrigen Level zeigt. Unsere Ergebnisse zeigen, dass sich die am meist prädiktiven Fea-

tures von Merkmal zu Merkmal stark unterscheiden, was darauf hindeutet, dass jedes Merk-

mal separat betrachtet werden soll. Während die Tippgeschwindigkeit für die Vorhersage von

Extraversion am relevantesten war, zeigte sich, dass räumliche Tippcharakteristika für den De-

pressionsgrad relevanter waren. Dies deutet darauf hin, dass für die Vorhersage vom Depres-

sionsgrad der getippte Inhalt von grösserer Bedeutung ist, als Features hörerer Ordnung wie

zum Beispiel die Tippgeschwindigkeit oder die Tippfrequenz. Wir vergleichen unser Modell

mit vorherigen Ansätzen und zeigen, dass unser Modell nicht nur eine höhere Vorhersagege-

nauigkeit erzielt, sondern auch eine kürzere Vorhersagezeit beansprucht. Wir sind der Überzeu-

gung, dass unser Ansatz zur Bestimmung von affektiven Merkmalen ein wichtiger Schritt in

Richtung kontextbewusste und benutzerzentrierte Geräte darstellt.

(7)

(8)

Master Thesis

Personality Trait Prediction Based on Smartphone Usage in the Wild

Project Description

Personality traits and mental health (i.e., depression level) influence our thinking, mood, and behavior, and they have an impact on everyday life. We have conducted an extensive user study where we have collected smartphone data (i.e., sensor, touch and context data) in the wild. In addition, each user filled in a survey capturing the personality traits and mental health. In this thesis, we want to build a predictive model of the personality and mental health of the users based on the smartphone data and provide unobtrusive visual feedback to the users on Android devices (e.g., as part of a calendar). Such a model together with an appropriate visualization can help to detect and treat mental health problems early, adapt applications according to the personality trait of a user and it can be used as a supporting tool in a therapeutic setting.

Tasks

The main task of the thesis is the development of a data-driven model for predicting the personality trait of a user based on smartphone data gathered in the wild. The model should be able to predict the big five personality traits (i.e., openness, conscientiousness, extraversion, agreeableness, and neuroticism) including the different facets and mental health (i.e., depression level). Furthermore, a visualization of the personality traits and mental health should provide meaningful feedback to the user. The four main tasks of this thesis are

• Validation and statistical analysis of the smartphone and survey data.

• Creating a predictive model for the personality traits based on features extracted from the recorded smartphone data (i.e., sensor data, touch data, and context data).

• Analyzing the best performing model and adapting the model to predict mental health based on the smartphone data.

• Designing a visualization for the different personality traits and mental health and implementing the visualization in an Android application and connecting the visualization with the prediction from the best performing model (e.g., by incorporating the model into the Android application).

Remarks

The thesis is conducted by Nikola Kovacevic. A written report and an oral presentation conclude the

(9)

(10)

Acknowledgment

I thank my supervisor Rafael Wampfler for his enormous support and guidance throughout the

different stages of this thesis. Furthermore, I thank the co-supervisors Tobias Günther and

Christian Holz for their valuable input and very helpful advice during this project.

(11)

(12)

List of Figures xi

List of Tables xiii

1. Introduction 1

1.1. Focus of This Work . . . . 2

1.2. Thesis Organization . . . . 4

2. Related Work 5 2.1. Personality Traits . . . . 5

2.2. Personality Tests . . . . 6

2.2.1. Big Five Inventory (BFI) . . . . 6

2.2.2. Big Five Inventory 2 (BFI-2) . . . . 6

2.2.3. Patient Health Questionnaire (PHQ-8) . . . . 7

2.2.4. Limitations of Personality Tests . . . . 8

2.3. Personality Trait Classification . . . . 8

2.3.1. Touch-based Trait Classification . . . . 9

2.3.2. Context-based Trait Classification . . . . 9

2.3.3. Sensing-based Trait Classification . . . . 10

2.3.4. NLP-based Trait Classification . . . . 10

2.3.5. Summary and Limitations . . . . 11

2.4. Depression Level Classification . . . . 12

3. Dataset 13 3.1. User Study . . . . 13

3.1.1. Data Collection . . . . 14

3.1.2. Data Labeling . . . . 15

(13)

3.2. Data Validation . . . . 16

3.2.1. Personality Traits and Depression Level . . . . 17

3.2.2. Smartphone Usage . . . . 18

4. Methods 23 4.1. Overview . . . . 23

4.2. Data Pre-processing . . . . 24

4.3. Feature Extraction . . . . 25

4.4. Feature Selection . . . . 26

4.5. Model Selection . . . . 29

5. Results 31 5.1. Model Performance . . . . 31

5.1.1. Feature Selection Pipeline . . . . 31

5.1.2. Window Averaging . . . . 33

5.2. Feature Importance . . . . 34

5.3. Timing Experiments . . . . 35

6. Conclusion 39 7. Future Work 43 A. Appendix 45 A.1. Big-Five Inventory 2 . . . . 45

A.2. Patient Health Questionnaire PHQ-8 . . . . 48

A.3. List of Collected Data . . . . 49

A.4. List of Best Parameters . . . . 51

Bibliography 52

(14)

List of Figures

1.1. Process Overview . . . . 3

3.1. Custom Keyboard . . . . 14

3.2. Demographics . . . . 16

3.3. Distribution of Personality Characteristics . . . . 18

3.4. Radial Personality Plots . . . . 19

3.5. Absoule Changes of Traits . . . . 20

3.6. Heat Map Example . . . . 21

4.1. Feature Selection Thresholding Example . . . . 30

5.1. Average Typing Duration . . . . 32

5.2. Window Averaging for SVM . . . . 33

5.3. Feature Importance . . . . 36

(15)

(16)

List of Tables

2.1. The Big Five Traits and some Adjectives . . . . 6

2.2. Traits and Facets . . . . 7

2.3. Mean Personality Traits . . . . 7

2.4. Performance Overview from Related Work . . . . 11

3.1. Overview of Collected Data . . . . 15

3.2. Mean Personality Traits . . . . 17

4.1. Overview of Available Features . . . . 27

4.2. Overview of Time Series . . . . 28

5.1. Influence of Feature Selection . . . . 32

5.2. Performance Overview . . . . 34

A.1. BFI-2 Questionnaire . . . . 45

A.2. PHQ-8 Questionnaire . . . . 48

A.3. Exhaustive List of Collected Data . . . . 49

A.4. Best Parameters . . . . 51

(17)

(18)

1

Introduction

Personality traits reflect an individual’s tendencies to think, feel and behave in a certain way.

They have a direct influence on our behavior and often remain stable over decades [HBJD16].

Having knowledge about a person’s personality traits can be of use in various scenarios such as in personnel recruitment where an employer could tailor the personality of interest beforehand according to the company’s needs and then screen out candidates based on their personality test scores or find tailored roles for the applicants based on their traits [Bla95]. Furthermore, personality traits are of interest for context-aware recommender systems because, based on the intensity of the traits, recommendations can be made more accurately for single users or even for a group of users [BER15, RGJDSRDA09]. Knowing the personality traits can also be help- ful in a therapeutic setting where the goal is to improve a trait that is considered negative or hinders persons in their everyday life. Using a personality test, progress over time can then be tracked [MOC14]. Apart from personality traits, interest has also been shown in assessing a per- son’s depression level. Being one of the most common mental disorders worldwide, depression can have a big impact on a person’s social and physical well-being, especially for adolescents.

And yet, depression often remains undetected [YMK10]. Hence, early detection of an elevated depression level would be beneficial in such cases.

Personality traits and depression level can be assessed using pen-and-paper personality tests. A

broadly used model for assessing the traits is the Five Factor Model [MJ92] which describes the

personality as a combination of five traits, namely openness to experience, conscientiousness,

extraversion, agreeableness and neuroticism, often abbreviated in the acronym OCEAN. The

intensity of each trait is assessed by posing a number of questions to the examinee. The test

answers are then, depending on the test in use, aggregated to yield a final intensity score per

trait. However, the problem with common pen-and-paper personality tests is that people can

still be dishonest and distort their scores in high-stake situations like job interviews by figuring

out which trait is being assessed by which question and then answering according to the result

they want to obtain [Fah18]. Furthermore, people might subconsciously distort their scores

by answering according to an ideal they have and not according to their actual personality

(19)

1. Introduction

due to self-deception phenomena [MCD ⁺ 07, ZMR11]. This demands for a way that hides the obviousness of the questions and makes the testing procedure more reliable.

One way to achieve this is to use data from an input modality that is influenced by a person’s personality but where the causal relationship is non-trivial, and then infer the traits based on information extracted from this data. Smartphones and other smart devices have become an integral part of our everyday life and produce a vast amount of data on a daily basis. Using such data to infer a person’s personality has been the goal of several studies [KTVA18, BTK ⁺ 19, CBGP11, KBFH08, MWKG20]. It has been shown that based on smartphone usage patterns it is possible to assess whether the intensity of a certain personality trait lies below or above a global median with an accuracy ranging from 68% (conscientiousness) up to 76% (extraver- sion) [CBGP11]. This indicates that there is a causal relationship between a person’s personality traits and the way people use their smartphones.

1.1. Focus of This Work

Although there are several studies that found a relationship between personality traits and col- lected user data from smartphones and other smart devices, some yet have to be validated in an in-the-wild setting. Furthermore, well-performing approaches such as the work by Chittaranjan et al. [CBGP11] rely on data that has been aggregated on a monthly level, which might not be feasible for most of the applications mentioned at the beginning, and it is not clear how shorter data collection periods influence the performance of the suggested model. Other approaches such as the work by Mesot et al. [MWKG20] rely on analyzing the semantics of chat conver- sations, which could be considered as privacy-invasive and may not be tolerated in a real-world application. Furthermore, manual feature engineering and feature selection play an important role in all of the mentioned approaches, which implies some domain expertise and knowledge about the types of features that might be predictive for the task. However, we believe that an automated feature engineering and feature selection approach might be beneficial since it could capture predictive features where the causal link between the feature and the trait may seem counter-intuitive.

Figure 1.1 visualizes the pipeline of this work divided into four stages. The first stage consists

of collecting the training data and the corresponding labels. In a user study, various types

of smartphone data were collected over a period of 10 weeks, and two personality tests were

used to obtain ground truth labels for the personality traits and depression level. The second

stage consist of validating and pre-processing the data, which includes feature extraction and

selection. In a third phase, a machine learning model is trained on the cleaned and validated

dataset. In a final stage, a personality model based on unseen data from a new user is generated

from the predictions, whereby the resulting personality model should be as close as possible to

the outcome of a personality test if this new user had taken the test beforehand. This is why

the rightmost yellow path in Figure 1.1 is not a direct link from the user to the personality, but

signalizes that the personality test constitutes an intermediate step and that the prediction is not

directly connected to the person, but to the outcome of a potential personality test.

(20)

1.1. Focus of This Work

Participants

Personality Tests

Smartphone Data

Personality Model

Dataset

New User

Predicted Model

Data Collection Data Evaluation / Validation Model Training Prediction

Figure 1.1.: Process Overview. The first stage consists of collecting the smartphone data and the corre- sponding ground truth labels using personality tests. In the next stage, the tests are evaluated and the collected data is pre-processed and structured. Then, a machine learning model is trained on the collected dataset to classify the traits of the participants into two classes (high and low) from usage patterns in the smartphone data. Based on unseen data of a new user, the model should be able to output a prediction that is as close as possible to the personality model that would have been obtained if the new user had taken the personality tests.

In this work we show how an SVM classifier can be used to infer personality traits from typing patterns while eradicating the limitations from related work. Concretely, our contributions are as follows: We present a way to classify the OCEAN personality traits and depression level into two classes obtained by using the population median as the separation margin by

1. automating the feature engineering and feature selection process, 2. using less privacy-invasive data collected in an in-the-wild setting, and 3. requiring only a few days of data collection.

We show that short time windows of only one day can already be predictive with a mean accu-

racy of up to 69%. Furthermore, we explain how the predictions of multiple 1-day windows can

be combined to increase the accuracy substantially, reaching a peak accuracy of 84%. Addition-

ally, we conclude that the most predictive features vary greatly across the traits, indicating that

one separate model should be considered for each trait. High-level features such as typing speed

or typing frequency are of big relevance for extraversion, agreeableness and conscientiousness

while the raw x and y coordinates and their sum are most relevant for neuroticism, openness

and depression level.

(21)

1. Introduction

1.2. Thesis Organization

In Chapter 2, we examine different methods for measuring personality traits and explain how in-

formation about a person’s personality traits can be useful in practice. Furthermore, an overview

of existing trait prediction methods is given. In Chapter 3, we explain our data collection pro-

cess and how the user study was conducted. We also give a detailed analysis and validation of

the collected data. In Chapter 4, we present our full feature selection and extraction pipeline

and explain the design choices step by step. The chapter is concluded by introducing our best

performing model. In Chapter 5, we present quantitative results by comparing our model to

other approaches from related work and discuss alternatives to our approach. Chapters 6 and 7

conclude previous discussions and focus on future work, respectively.

(22)

2

Related Work

This chapter covers the theory behind personality traits and gives an overview of the results from previous work. First of all, the notion of personality traits is discussed in general, be- fore concrete models for assessing the traits are presented. Afterwards, various approaches for predicting a user’s personality traits and depression level based on different input modalities are discussed. The chapter is concluded with a quantitative overview that puts the individual achievements of previous work into context.

2.1. Personality Traits

In the 1920s, psychological trait theory emerged in the field of psychology [All27]. It tried to explain differences in personality by introducing so called personality traits that were believed to be indicative for a person’s behaviour and can explain individual differences in how a person experiences and copes with different situations. Personality traits have a direct influence on our thoughts, feelings and behaviour, and they often remain stable over decades [HBJD16]. Hence, personality is considered to be a linear combination of personality traits. In order to reduce the number of traits and unify trait theory into an integral personality model, a factor analysis seeking a minimal set of traits that covers best the variability in all personalities was performed.

This factor analysis resulted in the so-called Five Factor Model [MJ92] and consists of five

traits, openness to experience, conscientiousness, extraversion, agreeableness and neuroticism,

which are often abbreviated in the acronym OCEAN. Each trait influences different aspects of

our personality but they are not pairwise exclusive, i.e., some behavioral characteristics can be

part of two traits at once. This is due to the fact that the factor analysis does not result in a

perfectly orthogonal space. A set of adjectives that describe each trait in more detail can be

found in Table 2.1.

(23)

2. Related Work

Trait Adjectives

Extraversion active, assertive, energetic, enthusiastic, outgoing, talkative Agreeableness appreciative, forgiving, generous, kind, sympathetic, trusting Conscientiousness efficient, organized, planful, reliable, responsible, thorough Neuroticism anxious, self-pitying, tense, touchy, unstable, worrying

Openness artistic, curious, imaginative, insightful, original, widely interested

Table 2.1.: The Big Five personality traits and some describing adjectives introduced by McCrae et al. [MJ92].

2.2. Personality Tests

The use of tests for evaluating a person’s skills and abilities dates back to the Han dynasty in China (202 BC to 220 AD) [KS08]. In Europe, however, such tests did not become relevant until World War I. The Woodworth Personal Data Sheet (WPDS) was one of the first personality tests used on a large scale [Woo32]. It should assess a military recruit’s emotional stability without the need of a time-consuming psychiatric interview, but rather using an easy and fast pen-and-paper test. This sudden demand for large scale testing gave rise to various approaches for creating, evaluating and interpreting such tests. As a consequence, many different testing schemes emerged in various fields [KS08].

2.2.1. Big Five Inventory (BFI)

One of the personality tests that assesses the OCEAN traits is the Big Five Inventory (BFI) [Gol92].

It quantifies the intensity of the OCEAN traits by assigning each trait a score denoting the in- tensity of the trait in a person’s personality. It consists of 44 statements that are rated on a 5-point Likert scale, indicating how strongly a person agrees or disagrees with the statement.

Each statement belongs to one trait whereby there are 8 to 10 statements per trait. Some of the statements were semantically inverted, meaning that for these statements a high rating would yield a low intensity in the respective trait. This was necessary because having only equally po- larized statements has shown to be a disruptive element in questionnaires [DAR15]. The ratings of all statements belonging to a trait are aggregated (either summed up, averaged or normalized to [0, 1]) to obtain the final test score whereby the ratings of semantically inverted statements are inverted as well in order to compensate for the change in polarization. The personality is eventually represented as a 5-tuple consisting of one intensity score per trait.

2.2.2. Big Five Inventory 2 (BFI-2)

The original BFI did not include a hierarchical structure that would allow for subdividing the

traits into sub-traits, so called facets. However, this subdivision became more relevant in recent

years because facets would enable a more detailed and comprehensive analysis of a personality.

(24)

2.2. Personality Tests Furthermore, the number of semantically inverted and not inverted statements was unequal, par- tially due to the fact that the number of statements was odd for some traits. In order to eradicate these limitations, the Big Five Inventory 2 (BFI-2) was introduced [SJ17]. The statements were revised and adapted in order to make them more understandable and easier to interpret. The number of statements was increased to 60 (12 statements per trait) and half of the statements were semantically inverted to balance the polarity of the statements. In a study with two sam- ple groups (heterogeneous population with N = 1000, and college students with N = 470) the validity and reliability of the revised questionnaire was confirmed [SJ17]. Apart from the five OCEAN traits, the BFI-2 includes fifteen facets (see Table 2.2) and thereby constitutes a hierarchical construct. The statements have been translated into numerous languages such as German, Chinese or Spanish, which allowed to investigate cultural differences around the world. Table 2.3 shows the mean for the OCEAN traits globally and with respect to western Europe [SAM ⁺ 07], indicating that there are slight differences.

Extraversion Agreeableness Conscientiousness Neuroticism Openness Sociability Compassion Organization Anxiety Intellectual Curiosity Assertiveness Respectfulness Productiveness Depression Aesthetic Sensitivity Energy Level Trust Responsibility Emotional Volatility Creative Imagination

Table 2.2.: Personality traits and their facets according to the Big Five Inventory 2 [SJ17].

2.2.3. Patient Health Questionnaire (PHQ-8)

Depression is one of the facets of neuroticism as shown in Table 2.2. Nevertheless, it is not meant to assess clinical depression levels because it merely is an indicator for a general incli- nation inside a personality and is no valid diagnostic for depressive disorders. This is why a questionnaire especially for assessing a patient’s mental health called the Patient Health Ques- tionnaire (PHQ-8) [KSS ⁺ 09] was developed. It is an 8-item inventory with a structure com- parable to the BFI. The eight statements are rated on a 4-point scale from zero (this statement never applies) to three (this statement applies daily). The ratings are then summed up to yield the final depression level on the range [0, 24]. A study with almost 200,000 participants revealed that the PHQ-8 is a valid diagnostic tool for assessing depression. The authors suggest that a depression level of 10 or higher indicates major depressive disorder or similar depressive disor- ders [KSS ⁺ 09]. In total, 8.57% of the participants showed signs of depressive disorder (i.e., a PHQ-8 level ≥ 10). In the rest of this report, depression level always refers to the PHQ-8 and should not be confused with the eponymous facet of neuroticism.

Extraversion Agreeableness Conscientiousness Neuroticism Openness

Mean World 3.34 3.69 3.52 3.01 3.64

Mean Western Europe 3.38 3.56 3.39 2.99 3.71

Table 2.3.: Mean personality traits globally and in western Europe on the range [1, 5] [SAM ⁺ 07].

(25)

2. Related Work

2.2.4. Limitations of Personality Tests

Although pen-and-paper personality tests are cost- and time-effective compared to psychiatric evaluations, there is a number of limitations. First of all, most tests like the BFI-2 are designed to be easily interpretable and understandable, which makes it easy for the person being tested to distort the score in any desired way. Especially in high-stake situations like job interviews this is a problem since the employer does not get an accurate picture of the applicant [Fah18].

Also, people might unconsciously distort their answers because they lack objectivity when as- sessing their own personality due to self-deception phenomena [MCD ⁺ 07, ZMR11]. Taking the test repeatedly (e.g., in a therapeutic setting where the test is retaken multiple times to track progress) may introduce an additional bias because people might remember their answers and be influenced by them when retaking the test. These limitations call into question the test’s validity and reliability, and ask for an alternative to the traditional pen-and-paper tests.

2.3. Personality Trait Classification

Eradicating the obviousness of the test’s structure would mean to find a more complex way to infer a person’s personality by impeding the person from directly and easily manipulat- ing the result. In other words, the goal is to find an input modality that provides data con- taining a causal link to a person’s personality traits. Smartphones and other smart devices produce a big amount of unstructured data obtained from active touch interactions with the user, context information like applications and their usage statistics, but also through vari- ous sensors such as accelerometers, proximity sensors and gyroscopes. Several studies sug- gest that there is a relationship between personality traits and smartphone/computer usage pat- terns [KTVA18, BTK ⁺ 19, CBGP11, MWKG20, KBFH08]. All of them used the Five Factor Model [MJ92] or similar as the underlying personality model. The ground truth for the pre- diction task was collected by using a pen-and-paper personality test like the aforementioned BFI-2 or alike. Using these personality tests as ground truth is problematic because they suffer from the limitations described in Section 2.2.4. Thus, the suggested causal link between traits and usage patterns is actually a link between usage patterns and the outcome of a personality test if the respective person would have taken such a test. However, under the assumption that there is no motivation for the participants to consciously distort their scores, the result of the personality test can still be considered as a good approximation to the actual ground truth. The problem of unconsciously distorting the result still remains and cannot be circumvented unless a psychiatric evaluation is performed, which is usually infeasible.

In the following, we present four works based on four different approaches and analyze their

benefits and limitations. Küster et al. [KTVA18] focused on data related to touch events, Chit-

taranjan et al. [CBGP11] addressed data extracted from usage statistics, application informa-

tion, different types of logs such as call logs and messages, Berkovsky et al. [BTK ⁺ 19] focused

on inferring personality traits based on eye-tracking data, and Mesot et al. [MWKG20] built

a language model based on text messages in order to infer the traits. Both classification and

regression are viable options for the given task and have been explored in the past (e.g., Khan et

al. [KBFH08] performed a regression analysis). However, classification has been favoured by

all four mentioned works. On one hand, this is because using the population median as a sepa-

(26)

2.3. Personality Trait Classification ration margin results in a balanced split while splitting at other values may result in significant class-imbalance due to the personality traits not being neither equally distributed nor centered on the underlying scale. Furthermore, none of the works achieved a classification accuracy higher than 76% for any trait, which is significantly below perfect separation of the two classes and suggests, that classifying the personality traits into two classes is already a hard problem.

2.3.1. Touch-based Trait Classification

In a controlled lab experiment, Küster et al. [KTVA18] investigated the touchscreen usage of 75 participants. The participants were asked to play a spelling game on a tablet for several sessions of two to four minutes while all their touch interactions with the screen were recorded.

Additionally, all participants filled out the NEO-FFI questionnaire [CM89], a personality test similar to the BFI-2 which also uses the Five Factor Model as the underlying personality model.

The scores of the questionnaire were used as ground truth for the classification task whereby all scores were split into two classes using the population median. From the collected data, two classes of features were extracted. On one hand, features related to the touch interaction such as swipe speed, time between touches, touch accuracy, touch duration and their statistical deriva- tives were extracted. On the other hand, features related to the participant’s performance in the spelling game were calculated. In total, 110 features related to touch-behaviour and perfor- mance were extracted and fed into different classifiers. Among the 10 classification algorithms, Support Vector Machines (SVMs), Random Forests (RFs) and Logistic Regression (LR) per- formed best with a mean accuracy between 0.62 (agreeableness) and 0.67 (extraversion). This suggests that touch characteristics are to some extent predictive for a person’s personality.

2.3.2. Context-based Trait Classification

In an in-the-wild experiment, Chittaranjan et al. [CBGP11] recorded different types of anonymi-

zed logs via a software installed on the phones of 83 participants. Four types of logs (calls, SMS,

Bluetooth and application usage) were recorded over a period of eight months and aggregated

on a monthly level. The personality was assessed using the TIPI questionnaire [GRS03], which

is a very brief 10-item inventory that assesses the Big Five personality traits similarly to the

BFI-2. Depending on the type of log, different features were extracted. For the application

logs, the number of uses of a specific app was counted. From the call logs, different statistics

like the number of outgoing calls or the average duration of the calls were calculated. The

Bluetooth logs consisted of statistics about the number of unique Bluetooth IDs and how often

these IDs were seen in a specific time slot. From the SMS logs, average and median word

lengths and the number of sent messages were recorded. In a first step, the features were

de-correlated and normalized. Then, an SVM was used to classify the data points using two

target classes based on the population median. Leave-one-out cross-validation was performed to

obtain the average classification accuracy. An average accuracy between 0.693 (openness) and

0.759 (extraversion) was achieved. To the best of our knowledge, this is the best performance

achieved so far on this particular task of classifying the Big Five personality traits into two

classes by using smartphone usage data.

(27)

2. Related Work

2.3.3. Sensing-based Trait Classification

In the two previous sections, the focus was on inferring traits based on usage patterns, i.e., how the user interacts with the device. However, smartphones and other smart devices are also able to collect data unrelated to the actual usage, but rather about how the user is responding physio- logically to surrounding stimuli. Various sensors such as accelerometers, gyroscopes, cameras and proximity sensors are nowadays present in most devices and provide a new modality of collecting user data. Berkovsky et al. [BTK ⁺ 19] investigated how eye-tracking data can be used to detect personality traits. In a controlled lab experiment with 21 participants, they first assessed the participants’ personality by using multiple personality tests. Unlike the studies presented in the previous sections, in this study, from the Big Five personality traits neuroti- cism was not assessed. The participants were shown a series of image and video stimuli for 55 minutes. The images and video snippets were selected from annotated databases that cat- egorize the stimuli based on the affective response they trigger in a user, i.e., based on the feeling or characteristics of the feeling that an average user experiences when looking at the stimuli. Eye-tracking glasses were used to record the participant’s saccades and gaze direction.

From the resulting data collection, three types of high-order features were extracted, namely, eye-blinking measures, eye movement and pupillary response measures. Various standard clas- sification algorithms like Support Vector Machines (SVMs), Random Forests (RFs), Logistic Regression (LR), Naïve Bayes (NB), to only name a few, were used to classify the traits into two target classes based on the population median. Leave-one-out cross-validation resulted in a high accuracy for all Big Five traits ranging from 0.857 (openness) to 0.948 (agreeableness).

These results outperform the other two presented approaches in terms of accuracy. However, it has to be noted that the setting is different. In this work, the focus was not to find a general usage pattern that would reveal the traits, but to find a direct link between traits and how the human body responds physiologically to external stimuli. These stimuli were aligned in time during the study, i.e., all participants were shown the same stimuli in the same order. Hence, given a physiological response to a stimulus and the stimulus itself that caused the response, the traits can be predicted with high accuracy. This is not the case for the aforementioned works, which is why quantitatively, these results cannot be compared directly.

2.3.4. NLP-based Trait Classification

Mesot et al. [MWKG20] also focused on touch data, but unlike Küster et al. [KTVA18] (see

Section 2.3.1), they solely focused on touch events from a chat conversation and exploited the

semantics of the written text by using Natural Language Processing (NLP). In a controlled lab

experiment with 70 participants, an experimenter held a chat conversation with the participants

on a smartphone. The touch events were recorded in a session of 60 to 70 minutes. From the

typed text, a language model using Bidirectional Encoder Representations from Transformers

(BERT) [DCLT19] was built for each user. Furthermore, touch heat maps indicating where on

the screen the user touched the most, were extracted. These heat maps were fed into a vari-

ational autoencoder to obtain a low-dimensional latent representation of the heat maps. The

latent vectors of the language model and those of the variational autoencoder showed particular

strengths for individual traits when considered separately. Combining them into one model by

learning the combination weights for each trait separately resulted in considerably high classi-

(28)

2.3. Personality Trait Classification fication accuracy for all traits (from 0.61 for extraversion up to 0.73 for neuroticism). Unlike other approaches, Mesot et al. also reported the area under the curve (AUC), which describes the performance of a classifier by integrating the true-positive rate as a function of the false-positive rate. The benefit of this metric is that it considers all potential thresholds t that can be used to separate predicted probabilites into the two target classes. This is in contrast to the accuracy metric reported by all previous works where a separation threshold was chosen implicitly.

Extraversion Agreeableness Conscientiousn. Neuroticism Openness

acc auc acc auc acc auc acc auc acc auc

Küster et al. [KTVA18]

^∗

0.67 - 0.62 - 0.63 - 0.66 - 0.65 -

Chittaranjan et al. [CBGP11]

^†

0.76 - 0.70 - 0.68 - 0.72 - 0.69 -

Berkovsky et al. [BTK

⁺

19]

^∗

0.81 - 0.90 - 0.81 - - - 0.86 -

Mesot et al. [MWKG20]

^†

0.61 0.60 0.69 0.70 0.66 0.66 0.73 0.73 0.68 0.69

∗

p unknown,

^†

p < 0.05

Table 2.4.: Mean classification accuracy (acc) and area under the curve (auc) on the Big Five personality traits using different approaches. p-values result from repeating the experiment 100 times.

2.3.5. Summary and Limitations

Several studies suggest that a person’s Big Five personality traits can be classified into two classes (low and high) with an accuracy of up to 76% (see Table 2.4) given general usage patterns like context information or direct touch interactions. Furthermore, sensing technolo- gies can be used to very accurately detect personality traits based on physiological responses triggered by external stimuli. Although not directly comparable to the other approaches, this method achieves the highest accuracy (see Table 2.4).

The approaches by Küster et al. [KTVA18] and Mesot et al. [MWKG20] report results in a controlled lab experiment, and in-the-wild performance data needs yet to be collected to see how well the approaches work in practice. Additionally, the approach by Mesot et al. [MWKG20]

might be subjectively considered to be privacy-invasive since it interprets the written text and

understands the semantics of what was written. It is also limited to languages where a language

model can be built accurately. This is not the case for languages like Swiss-German where

no universal grammar is defined. Chittaranjan et al. [CBGP11] achieve a high classificaiton

accuracy, but at the cost of needing one month of data to perform the actual prediction since the

training data was aggregated on a monthly level. Thus, it is not clear how shorter aggregation

windows would impact the classification performance. Depending on the application, a user

might not be willing to wait a whole month for a first prediction. Also, privacy-invasiveness

might be an issue. Although it is subjective to say what is privacy-invasive and what is not,

there might be users that would not agree to allow the collection of their messages, phone logs,

application names and usage times, etc. Berkovsky et al. [BTK ⁺ 19] used eye-tracking data

and achieved a very high accuracy, however at the cost of needing the user’s attention in order

to assess the traits. Unlike the other two works, the traits are not assessed passively in the

background.

(29)

2. Related Work

Although both Küster et al. [KTVA18] and Chittaranjan et al. [CBGP11] use very different approaches, they both rely on hand-crafted features that imply (at least to some extent) domain knowledge and expertise. Depending on the type of collected data, it may not be clear a priori which features are predictive, and domain knowledge might be missing. It would therefore be desirable to have an automatic feature extraction and selection process, independent of the type of data that is being researched. In the following chapters we show how such a process looks like.

2.4. Depression Level Classification

Similar to trait classification, depression level classification can be tackled using various ap- proaches. De Choudhury et al. [DCGCH13] classified Twitter users into classes of high and low depression level by analyzing their posts using different linguistic approaches. They achieved a prediction accuracy of up to 72.4% using SVM classifiers. A potential drawback of this ap- proach is that social media data may be sensitive, and analyzing them can be considered privacy- invasive which may not be tolerated by some users. Furthermore, one year of Twitter posts was analyzed, which would further complicate the use of this approach in a real-world application because this amount of data might not be available and depression level may change multiple times during this period. Farhan et al. [FYM ⁺ 16] predicted depression by using activity data obtained from smartphone sensors and GPS trackers. Depending on the amount of activity and patterns in the GPS location and movement collected over 14 days, depression was classified with a mean F1-score of up to 0.82. However, a professional psychiatric evaluation was per- formed to obtain the ground truth, which is time-consuming and costly in practice. While this is not a problem at the time when a potential application is deployed, it may be a problem during training when a bigger data set is used for training. Additionally, 14 consecutive days with good data quality were required to make a prediction, which could in practice result in much longer prediction intervals. Ware et al. [WYM ⁺ 20] predicted not only depression but also depression symptoms using a very similar approach to Farhan et al., i.e., they also used smartphone sens- ing data and location data. However, the location data was not obtained from GPS trackers but from WiFi infrastructure. Predictions were again made on 14-day-intervals if the data quality and amount of data were sufficient. An F1-score of up to 0.83 was achieved for some depression symptoms, which is comparable to previous work from Farhan et al. [FYM ⁺ 16]. The ground truth was not assessed by a professional, however, an oral face-to-face assessment using health questionnaires was conducted. While this is less costly than a professional psychiatric eval- uation, it is still time-consuming. Furthermore, the problem with potentially long prediction intervals still remains.

In summary, several approaches for predicting depression level have been developed, but the

limitations of these approaches are similar to the problems regarding personality traits described

in Section 2.2.4. Time-consuming elements such as long prediction intervals and privacy-

invasive data collection shall be eradicated by proposing a way to infer the depression level

in a fast, passive and non-privacy-invasive way. The following chapters show a way of achiev-

ing these goals.

(30)

3

Dataset

This chapter presents the dataset in detail. First, the data collection process is explained, cover- ing the collection of the groundtruth using personality tests, as well as the recording of various types of context, interaction and sensor data. In a next step, the collected data is statistically analyzed and validated. In a final step, the dataset is cleaned by excluding participants where the data validation revealed anomalies.

3.1. User Study

As has been shown in Chapter 2, there are several ways to tackle the problem of finding a

data source that reveals information about a user’s personality traits. Context information,

interaction- and sensor data have all been shown to be indicative of personality traits. Our goal

is to use a data source where limitations such as long data collection time, privacy-invasiveness

and additional user effort (see Section 2.3.5) do not arise. Hence, we seek a data source that

is predictive for the personality traits, non-privacy-invasive, and can also be collected unobtru-

sively without additional effort from the user in a short period of time. We rule out all types of

data related to context information because of potential privacy issues. Also, we forego NLP-

based approaches using touch data for the same reason. However, we do not discard touch data

entirely since raw touch data without analyzing any semantics may still be of use. Sensor data

obtained from built-in sensors are also a viable data source since they are readily accessible

and do not suffer from the mentioned limitations. The data of interest consists therefore of raw

touch data and sensor data from built-in phone sensors.

(31)

3. Dataset

(a) (b)

(c) (d)

Figure 3.1.: Screenshots from the custom keyboard showing the main alphabet in (a), the additional symbols and numbers in (b) and (c), and the main alphabet in private mode (d). The private mode is visualized by displaying a purple bar with a lock.

3.1.1. Data Collection

The dataset used in this work was embedded in a more broad research context which is not limited to the purpose of this work. The data collection was carried out in an in-the-wild setting with 82 participants over a period of 10 weeks by using a data collection app implemented for this study. From the diverse types of collected data, only a subset is relevant for this work and is described in Table 3.1. A full list of all collected types of data can be found in Appendix A.

It has to be noted that comparability easily becomes an issue when many different devices come

into play, which is why it should be considered beforehand. Screen size, different keyboard lay-

outs and diverse built-in sensors can complicate direct comparability. As for the keyboards,

all participants were asked to switch to a custom keyboard provided by the collection app (see

Figure 3.1). Depending on the screen size and the device’s orientation, we normalized the touch

events to the range [0, 1]× [0, 1] in a post-processing step, which makes all touch events directly

comparable. Another reason for using a custom keyboard is that otherwise direct access on

the touch events is impeded unless the phone is rooted because the event handling is entirely

carried out by the system. Using the custom keyboard, the event handling becomes visible to

the keyboard which can then record the events. Although it was stressed in advance that only

raw x and y coordinates were going to be logged during the study and that the corresponding

keys were neither going to be logged nor analyzed, it would generally still be possible to re-

trieve the typed text by mapping the touch coordinates on the keyboard. Since this may have

raised privacy issues, a private mode for entering passwords, e-mail addresses or other sensitive

information was added to the keyboard in order to provide full control over the data collection

to the participants. The private mode was active by default whenever password fields or other

(32)

3.1. User Study text fields related to sensitive information were being filled out. Additionally, the users could activate the private mode manually by pressing the lock in the header of the keyboard depicted in Figure 3.1 (d) for three seconds. The header of the keyboard was shown in purple to signalize that the private mode was active. The data collection was stopped entirely every time the private mode was activated either manually or automatically, and was resumed as soon as the private mode was deactivated again. As for the sensors, luckily, most built-in physical sensors like accelerometers or gyroscopes nowadays use the same units to express measurements, which makes sensor data usually directly comparable.

Type Description

Physical Sensors

Accelerometer Measures the applied acceleration force (in m/s ² ) to the device on all three physical axes (x, y, and z). The values are given in units of the earth’s gravitational acceleration g (g ≈ 9.81m/s ² ).

Gyroscope Measures a device’s rate of rotation in rad/s around each of the three physical axes (x, y, and z).

Light Measures the current ambient light level (illumination) in lx.

Fused Sensors

Rotation Vector Measures the rotation of the phone in world coordinates over time. It uses the input from the accelerometer, gyroscope and magnetometer to account for gyroscope drift.

Touch Data

Keyboard Touch Raw x and y coordinates and timestamps of the touch events happen- ing on the custom keyboard.

Context Data

Foreground Tracker Package names of all apps that have been opened including start and end time of the active period. Note that this information is not part of our model and is only used for data validation purposes (see Sec- tion 3.2).

Time Change Log entries indicating time zone changes.

Table 3.1.: Overview of the types of collected data that are relevant in this work.

3.1.2. Data Labeling

As a ground truth measure for the personality traits and the depression level, a German version

of the BFI-2 [SJ17] as introduced in Section 2.2.2, and the PHQ-8 as introduced in Section 2.2.3

were used, respectively. The statements used in the questionnaires can be found in Appendix A.

(33)

3. Dataset

20 25 30 35 40

Age 0

2 4 6 8 10 12 14

Count

Gender Male Female

Bachelor Master PhD Not A Student 0

10 20 30 40 50

Count

Figure 3.2.: The age of the participants (left) and the university program they are currently enrolled in (right).

During the recruiting of the participants, it was stated that only German-speaking participants were accepted such as to avoid language problems with the personality tests. Since personality traits remain very stable over time [HBJD16], the ground truth does not need to be assessed multiple times and a single personality test result can be used as a label for all training instances collected for one particular user. However, it has been shown that personality traits can be intentionally changed in mental coaching sessions in only 10 - 12 weeks [MOC14]. Therefore, the ground truth was assessed twice, once at the beginning of the study, and once at the end of the study. This does not only ensure that the personality traits remained constant over time (and thus, that the ground truth is accurate for the entire period of the study), but it also provides a form of validation since we can omit participants that showed big changes in a personality trait because big changes would indicate that the test was answered dishonestly.

3.2. Data Validation

A total number of 82 participants completed the study, which consisted of not only the data collection and the four personality tests (2x BFI-2, 2x PHQ-8) described in the previous sec- tion, but also of a short survey querying demographics and information about the participants’

profession. Three participants were excluded because of anomalies in their test answers (see

Section 3.2.1) and one participant was excluded because of corrupted timestamp values (see

Section 3.2.2). Two more participants were excluded because they did not fill out the person-

ality tests and the survey on time. From the remaining 76 participants, 36 were males and 40

were females. The participants were 23 years old on average (SD = 3.77) and 68 of them were

university students, most of which were currently enrolled in a Bachelor’s program (see Fig-

ure 3.2). 74 participants were native German or Swiss-German speaking and they also indicated

in the survey to type in the respective languages most of the time. In the following, we ana-

lyze the personality traits and depression levels obtained from evaluating the questionnaires and

statistically validate the dataset. In a next step, the smartphone usage is analyzed which is not

relevant to the final model itself, but can be useful in the context of data validation.

(34)

3.2. Data Validation

Extraversion Agreeableness Conscientiousness Neuroticism Openness

Our Dataset 3.42 3.92 3.58 2.58 3.79

Mean World 3.34 3.69 3.52 3.01 3.64

Mean Western Europe 3.38 3.56 3.39 2.99 3.71

Table 3.2.: Mean personality traits from our dataset, globally and in western Europe on the range [1, 5] [SAM ⁺ 07].

3.2.1. Personality Traits and Depression Level

In Table 3.2 and Figure 3.3 we present the results of the BFI-2 and the PHQ-8. The evaluation of the BFI-2 showed that the mean personality traits are very similar to the world average and the western Europe average [SAM ⁺ 07], but that there is a notable difference for neuroticism.

A possible explanation for this is that neuroticism is the only negatively associated trait since it is also known as negative emotionality [DAR15]. Thus, a low score in this trait is generally considered as something desirable which is why the participants might have distorted their scores with respect to this trait and therefore scored lower than average. In general, our set of participants presents a high variability in all traits and spreads widely over the range of possible values, which benefits our models because it indicates that the set is representative. This might not seem to be the case for the depression level because the highest depression level in our dataset is 13 whereas the highest possible level is 24. However, recall that a depression level above 9 already indicates depressive disorders [KSS ⁺ 09], which is the case for 17 users. Hence, a wide spread is not expected for the depression level, which suggests that our set of users is also representative with respect to the depression level.

The BFI-2 and PHQ-8 were carried out twice in order to ensure that the personality traits did not change during the study because it can be assumed that the traits should remain stable over this period [HBJD16, MOC14]. A difference between the first and the second outcome of the questionnaires can be an indicator for dishonest or random answers which should be filtered out beforehand. Naturally, the test results were not identical in both personality tests, hence a tolerance had to be taken into account before excluding users. First of all, there is a discretization error of 12.5% in the BFI-2 [SJ17] induced by the 5-point Likert scale because the scale is four units long (distance from 1 to 5) and the true value might be exactly in between two levels, which would result in an error of half a unit, which is 12.5% of the whole range. So, a difference smaller than that must be tolerated in any case. Furthermore, Martin et al. [MOC14]

showed that mental coaching can change a personality trait by up to 20% (± 5%) in 10 - 12 weeks. Since this coincides with the duration of the study, the tolerance cannot be lower than 20% because we cannot rule out that participants were in such a process of personality change.

We added a small margin to account for the variance, which results in a final tolerance of 25%

(i.e., we accept an absolute change of ± 1.0 or smaller). Three participants were excluded

based on this tolerance threshold. Figure 3.4 shows the personality test results of a participant

that has been excluded, and of a participant that has been kept based on these thresholds. For the

depression level, a higher tolerance should be taken into account because the PHQ-8 assesses

the depression level of the last two weeks before the test and is therefore potentially less stable

over such a long period as was the user study. Hence, higher changes over time might be

(35)

3. Dataset

Extraversion AgreeablenessConscientiousness Neuroticism Openness 1

2 3 4 5

Depression Level (PHQ-8) 0

2 4 6 8 10 12

(a) (b)

Figure 3.3.: Distribution of personality characteristics. (a) Mean score on the Big Five personality traits per trait and per user according to the BFI-2 questionnaire [SJ17]. (b) Mean rating of the statements in the PHQ-8 depression level questionnaire [KSS ⁺ 09] per participant. Note that the y-axis for the PHQ-8 is re-scaled for readability and actually ranges from 0 to 24.

possible, which is why we allowed a change of up to 40%. However, this threshold did not have an influence on the user set because all users showed a change smaller than this tolerance threshold.

In general, we could not detect any trends in the change of the personality traits (see Figure 3.5).

The mean absolute changes for the OCEAN traits were 0.03 (extraversion), -0.02 (agreeable- ness), 0.03 (conscientiousness), -0.01 (neuroticism) and -0.01 (openness) which shows that the traits remained constant on average. Due to the epidemiological situation in Switzerland caused by the COVID pandemic and the resulting restrictions, one could have expected to see a slight increase in depression level. However, the depression level did not increase much either (mean absolute change of 0.28, i.e., a 1.17% increase). It has to be noted that the study coincided with the lifting of many restrictions and the participants may not have been exposed to incisive restrictions compared to other countries.

3.2.2. Smartphone Usage

Although context information is irrelevant for our model, it was still part of the data collection as

listed in Table 3.1 because it is well-suited as a data validation method. For example, we would

expect to see an increased phone usage in the mornings, lunch breaks, evenings and weekends

when participants usually have free time and are not working. We would also expect to see no

or only few activity during the night when users are sleeping, and the activity should also be

evenly distributed over the duration of the study. The background service that performs the data

recording is stopped by the system every time the phone is locked. However, the end time of a

session may not be logged if the data collection app is not closed correctly or closed abruptly.

(36)

3.2. Data Validation

E A

C

N

O

1 2 3 4 5

Our Dataset World Western Europe

E A

C

N

O

1 2 3 4 5

Pre-Study Post-Study

E A

C

N

O

1 2 3 4 5

Pre-Study Post-Study

(a) (b) (c)

Figure 3.4.: Radial personality plots obtained from evaluating the BFI-2 [SJ17] on our dataset. The axis labels denote extraversion, agreeableness, conscientiousness, neuroticism and openness, re- spectively. (a) The mean personality of our dataset compared to the world average and the western Europe average [SAM ⁺ 07]. (b) The personality traits from the first and second personality test of a user that has been excluded. (c) The personality traits from the first and second personality test of a user that has been kept.

Hence, if the data recording was performed correctly, there should not be many sessions that last multiple hours because users usually lock their phones when they put it away or the screen is locked automatically after some time. Such cases where the recording was stopped but the end time was not logged, should be detected and filtered out because although data was recorded, the participants were not using their phone during this period.

The foreground tracker logged the package name of all used applications together with an as- sociated start and end time. Since we are not interested in the used application itself, only timestamps were extracted. The timestamps report the current time in milliseconds with re- spect to the user’s current time zone. Of course, the time zone may change, which would affect comparability across users. An additional log storing time changes (i.e., the last time zone, the new time zone and the time the change happened with respect to the old time zone) was used to align all touch events to UTC time, which ensures further comparability. In order to validate the data, we generated a usage heat map for each user as shown in Figure 3.6. User A goes to bed rather late, between 1am and 4am. The bed time shifts towards the end of the study but so does the wake-up time. User B is more constant with a very regular bed time and wake-up time throughout the whole study. Interestingly, the weekends are directly visible from the plot.

There is a big increase in the usage time two days a week every 7 days between 10am and 4pm, and it is exactly at those days when the user does not wake up at 7am as usual. Also, this user is less active on the smartphone compared to user A.

As demonstrated by this example, heat maps are useful to validate whether the recorded usage

time is plausible, and to ensure that there are no negative values nor any durations lasting longer

than 60 minutes per cell. One user was excluded based on this reasoning because the heat map

showed negative periods and activity durations over 60 minutes per hour, which is an indicator

for corrupted timestamps.

(37)

3. Dataset

0 10 20 30 40 50 60 70

1.5 1.0 0.5 0.0 0.5 1.0

1.5 Extraversion

Mean Median

0 10 20 30 40 50 60 70

1.5 1.0 0.5 0.0 0.5 1.0

1.5 Agreeableness

Mean Median

0 10 20 30 40 50 60 70

1.5 1.0 0.5 0.0 0.5 1.0

1.5 Conscientiousness

Mean Median

0 10 20 30 40 50 60 70

1.5 1.0 0.5 0.0 0.5 1.0

1.5 Neuroticism

Mean Median

0 10 20 30 40 50 60 70

1.5 1.0 0.5 0.0 0.5 1.0

1.5 Openness

Mean Median

0 10 20 30 40 50 60 70

10 5 0 5

10 Depression Level

Mean Median

Figure 3.5.: Absoulte changes of the OCEAN traits and the depression level before and after the study

per participant. Participants are shown on the x-axis and absolute trait changes are shown

on the y-axis.

(38)

3.2. Data Validation

214 217 220 223 226 229 232 235 238 241 244 247 250 253 256 259 262 265 268 271 274 277 280 283 Day Of The Year

0 2 4 6 8 10 12 14 16 18 20 22 Hour Of The Day

0 10 20 30 40 50 60

Minutes

(a) User A

224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 272 275 278 281 284 287 290 293 Day Of The Year

0 2 4 6 8 10 12 14 16 18 20 22 Hour Of The Day

0 10 20 30 40 50 60

Minutes

(b) User B

Figure 3.6.: Usage heat maps for two users. Each cell indicates the number of minutes that the user was active on the smartphone at that specific hour of the day on that day. The days are labeled as days of the year to allow for comparability since not all participants started on the same day.

In a final validation step, the typing speed of the participants was evaluated in order to ensure that the touch events were distributed in a plausible manner. We report an average typing speed of 46 words per minute, which is above the average of 36 words per minute (± 13.22) reported by Palin et al. [PFK ⁺ 19]. However, taking into account that most participants were students and are probably better at typing than the average population, the obtained average is plausible.

In conclusion, our data validation excluded six participants based on the aforementioned reason-

ings, reducing the total number of participants from the initial 82 to 76 as stated at the beginning

of this section.

(39)

(40)

4

Methods

In this chapter, a detailed explanation about the used methods is given. This includes data pre-processing, feature extraction and model selection. The chapter is concluded with a short description of our best model.

4.1. Overview

Information from touch interactions can be predictive for all personality traits as demonstrated

by Küster et al. [KTVA18] (see Section 2.3.1). In that work, the touch events were recorded

over a short session of less than an hour while the user was playing a spelling game. We

believe that touch data from typing interactions such as browser queries or chat conversations

aggregated over a longer time period might be more predictive. First of all, personality traits

are long-term behavioral characteristics and reveal themselves as trends in our behaviour, which

does not mean that certain short-term emotions or mood that contradict some personality traits

cannot occur. For example, there are certainly situations where a generally neurotic person is

calm and relaxed. If a prediction was based on such a short-term window that only captures

features when the person was calm would not result in high neuroticism although the person

is generally neurotic. So, intuitively, a longer period of time should better capture personality

characteristics because it would be more robust to outliers such as in the described situation

where the neurotic person is calm and relaxed for a short amount of time. This line of thought

is also reflected in precedent work by, e.g., Chittaranjan et al. [CBGP11] where the data was

aggregated on a monthly level, or Dechoudhury et al. [DCGCH13] where a whole year of data

was used (see Section 2.4). Second of all, if the data that is necessary to make the predictions

has to be collected in an obtrusive way where the user has to put in additional effort, then this

could discourage the user from actually using the app in a real-world setting. In the case of the

aforementioned work by Küster et al. [KTVA18], the data source was the spelling game (see

Personality Trait Prediction Based on Smartphone Usage in the Wild

Research Collection

Master Thesis

Personality Trait Prediction Based on Smartphone Usage in the Wild

Author(s):

Kovacevic, Nikola Publication Date:

2021

Permanent Link:

https://doi.org/10.3929/ethz-b-000477330

Rights / License:

In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more

information please consult the Terms of use.

Personality Trait Prediction Based on Smartphone Usage in the Wild

Nikola Kovaˇcevi´c

Master Thesis March 2021

Prof. Dr. Markus Gross

Supervisors:

Rafael Wampfler

Prof. Dr. Christian Holz

Prof. Dr. Tobias Günther

Abstract

Imbuing computers with the ability to perceive and interpret a user’s affective traits and char-

acteristics has been a key objective in interactive technologies, as it allows devices to adapt

their behaviour according to the user’s particular needs. For example, personality-aware recom-

mender systems could provide better suited product offerings and employers’ interview systems

could find tailored roles for applicants based on their traits. To become personality-aware, de-

vices need to infer the user’s traits through the available modalities for sensing input from the

user, ideally unobtrusively and integrated with the user’s regular interaction. Additionally, they

also need to produce such predictions in real-time. In this work, we present a learning-based

method for predicting the Big Five personality traits and depression level of a person from the

time-stamped touch coordinates captured while using a mobile phone. Our implementation is

built around a support vector machine and achieves a classification accuracy of up to 84%, pre-

dicting whether each trait and depression is expressed at a low or high level. Our findings show

that the most predictive features vary greatly across personality traits, indicating that each trait

should be considered separately. While typing speed alone is most relevant to predict extraver-

sion, we found that spatial properties of typing behavior are more significant for depression

levels. This suggests that to predict depression, the typed content matters more than higher-

level features such as typing speed or touch frequency. We compare our model to previous

approaches and show that it outperforms the accuracy of previous approaches as well as the

speed of prediction. We believe that our unobtrusive approach to estimating affective traits is

an important step towards personality-aware devices and user-centered computing.

Zusammenfassung

In dieser Arbeit präsentieren wir eine lernbasierte Methode zur Vorhersage der Big Fiver Per- sönlichkeitsmerkmale und des Depressionsgrads einer Person anhand von zeitgestempelten Touch-Koordinaten, welche während der Benutzung eines Mobiltelefons gesammelt werden.

Unsere Implementation baut auf einer Support Vector Machine und erreicht eine Klassifizierungs-

genauigkeit von bis zu 84%, wobei vorhergesagt wird, ob ein Merkmal sich auf einem hohen

oder niedrigen Level zeigt. Unsere Ergebnisse zeigen, dass sich die am meist prädiktiven Fea-

tures von Merkmal zu Merkmal stark unterscheiden, was darauf hindeutet, dass jedes Merk-

mal separat betrachtet werden soll. Während die Tippgeschwindigkeit für die Vorhersage von

Extraversion am relevantesten war, zeigte sich, dass räumliche Tippcharakteristika für den De-

pressionsgrad relevanter waren. Dies deutet darauf hin, dass für die Vorhersage vom Depres-

sionsgrad der getippte Inhalt von grösserer Bedeutung ist, als Features hörerer Ordnung wie

zum Beispiel die Tippgeschwindigkeit oder die Tippfrequenz. Wir vergleichen unser Modell

mit vorherigen Ansätzen und zeigen, dass unser Modell nicht nur eine höhere Vorhersagege-

nauigkeit erzielt, sondern auch eine kürzere Vorhersagezeit beansprucht. Wir sind der Überzeu-

gung, dass unser Ansatz zur Bestimmung von affektiven Merkmalen ein wichtiger Schritt in

Richtung kontextbewusste und benutzerzentrierte Geräte darstellt.

Master Thesis

Personality Trait Prediction Based on Smartphone Usage in the Wild

Project Description

Tasks

• Validation and statistical analysis of the smartphone and survey data.

• Creating a predictive model for the personality traits based on features extracted from the recorded smartphone data (i.e., sensor data, touch data, and context data).

• Analyzing the best performing model and adapting the model to predict mental health based on the smartphone data.

• Designing a visualization for the different personality traits and mental health and implementing the visualization in an Android application and connecting the visualization with the prediction from the best performing model (e.g., by incorporating the model into the Android application).

Remarks

The thesis is conducted by Nikola Kovacevic. A written report and an oral presentation conclude the

Acknowledgment

I thank my supervisor Rafael Wampfler for his enormous support and guidance throughout the

different stages of this thesis. Furthermore, I thank the co-supervisors Tobias Günther and

Christian Holz for their valuable input and very helpful advice during this project.

Contents

List of Figures xi

List of Tables xiii

1. Introduction 1

1.1. Focus of This Work . . . . 2

1.2. Thesis Organization . . . . 4

2. Related Work 5 2.1. Personality Traits . . . . 5

2.2. Personality Tests . . . . 6

2.2.1. Big Five Inventory (BFI) . . . . 6