Max Pellert (https://mpellert.at)
Currently: Group Leader at the Barcelona Supercomputing Center in the Department for Computational Social Science and Humanities
Before: Professor for Social and Behavioural Data Science (interim, W2) at the University of Konstanz
Assistant Professor (Business School of the University of Mannheim)
I worked in industry at SONY Computer Science Laboratories in Rome, Italy
PhD from the Complexity Science Hub Vienna and the Medical University of Vienna in Computational Social Science
Studies in Psychology and History and Philosophy of Science
Msc in Cognitive Science and Bsc in Economics (both University of Vienna)
One example: Linguistic Inquiry and Word Count, LIWC (pronounced “Luke”)
Simple word matching method
Generated and validated by psychologists (Pennebaker et al., 2001-today)
Examples of LIWC classes:
Positive Affect, Negative Affect
Anxiety, Sadness, Anger
Social processes
More advanced examples using deep learning
Classifiers based on transformer architectures (RoBERTa)
Large general purpose language models adapted to the task of emotion classification
Has gotten a somewhat bad name: “Why don’t we run something on the text?”
Often conceptually flawed + noisy data + inadequate annotation schemes to create many different tools
Results can be cherry-picked by optimizing on the tool
But, we argue, used right it can be a valuable research instrument
Individual text level (for example a single tweet): Not reliable, sarcasm, irony, performative nature of social media: we need a substantial number of texts to get through the noise (especially with dictionary methods, also base rates are low)
Individual person level: Associations sometimes higher (for example for depression: Eichstaedt et al., 2018) and sometimes lower (PANAS scale: Beasley & Mason, 2015) with (rather) stable personality traits
Group level (geographical): Debated, for example Twitter heart disease study (Eichstaedt et al., 2015), methods have to be validated and checked for robustness (Jaidka et al., 2020)
Metzler, H., Pellert, M., & Garcia, D. (2022). Using Social Media Data to Capture Emotions Before and During COVID-19 (World Happiness Report 2022). https://worldhappiness.report/ed/2022/using-social-media-data-to-capture-emotions-before-and-during-covid-19/
derstandard.at
An internet pioneer in the German speaking area (centered on Austria)
Popular page: almost 57 million visits in November 2020
Active forum with many postings below news articles
Twitter
Tweets from Austria (data on location from Brandwatch)
Survey on yesterday’s emotional state run for 20 days in November 2020
“How was your last day” (“Wie war der letzte Tag?”)
Was displayed in between the article text in a low barrier manner, could be answered anonymously
In a collaboration with derstandard.at, we obtained the survey results
The data allows us to investigate the relationship of the explicit survey measure with the results of methods that extract sentiment indirectly from text
Combination of dictionary based and deep learning (RoBERTa) based sentiment analysis on the text of postings (in German): LIWC and German Sentiment
These were the only two tools used, no cherry-picking the methods (see preregistration that we will discuss later)
268,128 survey responses between November 11th and 30th, 2020
11,082 unique users and 743,003 postings on derstandard.at during the survey period
11,237 unique users and 635,185 tweets for Twitter
We subtract baseline corrected negative from baseline corrected positive on the texts of each day
Baseline period from “2020-03-16” to “2020-04-20”, first COVID-19 lockdown in Austria
To match the range of the survey question, we take a three day rolling average (right-aligned)
This way we account for people answering the survey in the evening/night with different reference points to “yesterday”
Compare to: % of positive in the survey
We planned an extension of the analysis to another platform (Twitter)
To see if this a platform effect or if the correspondance between text analysis and explicit survey generalizes
We pre-registered the same study design as before but with Twitter data
Generally, the negative components of text analysis results could be improved
LIWC negative on derstandard fails (dialect words that are not included in the dictionary?)
We showed that macroscropes of emotions are possible
Here for Austria (for UK and a number of other countries see World Happiness Report 2022 chapter)
Digital traces from social media can be a complementary data source to traditional surveys
We find strong relationships between both signals
Social media data has a number of advantages: cheap large data, longitudinal and temporally fine-grained, “always-on”, people are observed indirectly and unobtrusively
Pellert, M., Metzler, H., Matzenberger, M., & Garcia, D. (2022). Validating daily social media macroscopes of emotions. Scientific Reports, 12(1), 11236. https://doi.org/10.1038/s41598-022-14579-y
Book chapter outlining the connected research program:
Performance tests of “human intelligence” play a role since the beginning of AI (for example Evans 1964)
The idea of psychometric AI was prominently brought up roughly once per decade since then, but no major works followed
We show that LLMs nowadays can be psychometrically assessed in a rich way using different approaches, we propose to adapt a Natural Language Inference task
We use the score for entailment of a premise (psychometric item text) and each hypothesis (the possible answers according to the psychometric survey specifications)
Standard psychometric inventories can be repurposed as diagnostic tools for large language models (LLMs)
Psychometric profiling enables researchers to study and compare LLMs in terms of non-cognitive traits thereby providing a window into the personalities, values, beliefs and biases these models exhibit (or mimic)
We conclude by highlighting open challenges and future avenues of this novel research perspective
Big Five Inventory
Dark Tetrad
Revised Portrait Values Questionnaire
Moral Foundations Questionnaire
Gender/Sex Diversity Beliefs Scale
Our approach is very flexible: a large number of questionnaires can be applied
Different to testing with adhoc examples, instead systematic and rich investigations building on existing, theoretically underpinned resources from psychometrics
Uncovered psychometric traits for humans often have a systematic link to behavior (for example risk aversion and neuroticism)
It’s a big, open empirical question if psychological profiles (e.g. personality or value orientation) of LLMs have a consistent, predictable link to their behavior, i.e. model outputs.
Examples: LLMs determining financing or housing eligibility or screening CVs
We can expect increasingly more societal decision making by LLMs
Serapio-García, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., Abdulhai, M., Faust, A., & Matarić, M. (2023). Personality Traits in Large Language Models (arXiv:2307.00184). arXiv. http://arxiv.org/abs/2307.00184
Hagendorff, T. (2023). Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods. https://doi.org/10.48550/ARXIV.2303.13988
Mapping out the space of model traits influence on an example task:
CV screening (Updated Resume Dataset)
“AI systems used to evaluate the credit score or creditworthiness of natural persons” as a special risk area in the coming EU regulations, because of far-reaching consequence of this assessment for the access to financial resources or essential services such as housing, electricity, and telecommunication services
Can we build taxonomies of the (causal) effect of controlled model traits such as openness?
More anecdotal evidence so far, but who would have thought that emotional appeal increases model performance in question answering tasks?
Developing related research examples: LLMs can cater targeted texts at specific personality types → points to some consistent internal representation of personality?
https://transformer-circuits.pub/2024/scaling-monosemanticity/
Ideas of “model lobotomy” may be coming closer
Or at least something like brain imaging of neural nets (detecting functional partitions)
Clamping up personality or value orientation features? (instead of the Golden Gate Bridge for example)
To craft specific non-cognitive model traits in a “hard” way (similar as with adaptors, actually changing model weights) instead of “softly” with prompting