UPF Talk 02/2025: Studies, Methods & Outlook

Max Pellert (https://mpellert.at)

https://mpellert.at/upf_talk_02_25/upf_talk_02_25.pdf

https://mpellert.at

Currently: Group Leader at the Barcelona Supercomputing Center in the Department for Computational Social Science and Humanities

Before: Professor for Social and Behavioural Data Science (interim, W2) at the University of Konstanz

https://mpellert.at

Assistant Professor (Business School of the University of Mannheim)

I worked in industry at SONY Computer Science Laboratories in Rome, Italy

PhD from the Complexity Science Hub Vienna and the Medical University of Vienna in Computational Social Science

Studies in Psychology and History and Philosophy of Science

Msc in Cognitive Science and Bsc in Economics (both University of Vienna)

Basics: Extracting Signals from Text

One example: Linguistic Inquiry and Word Count, LIWC (pronounced “Luke”)

Simple word matching method

Generated and validated by psychologists (Pennebaker et al., 2001-today)

Examples of LIWC classes:
Positive Affect, Negative Affect
Anxiety, Sadness, Anger
Social processes

Basics: Extracting Signals from Text

More advanced examples using deep learning
Classifiers based on transformer architectures (RoBERTa)
Large general purpose language models adapted to the task of emotion classification

Sentiment Analysis

Has gotten a somewhat bad name: “Why don’t we run something on the text?”

Often conceptually flawed + noisy data + inadequate annotation schemes to create many different tools

Results can be cherry-picked by optimizing on the tool

But, we argue, used right it can be a valuable research instrument

Sentiment Analysis Evidence

Individual text level (for example a single tweet): Not reliable, sarcasm, irony, performative nature of social media: we need a substantial number of texts to get through the noise (especially with dictionary methods, also base rates are low)

Individual person level: Associations sometimes higher (for example for depression: Eichstaedt et al., 2018) and sometimes lower (PANAS scale: Beasley & Mason, 2015) with (rather) stable personality traits

Group level (geographical): Debated, for example Twitter heart disease study (Eichstaedt et al., 2015), methods have to be validated and checked for robustness (Jaidka et al., 2020)

Our contribution: Macroscopically validating if we are able to capture momentary feeling of a population on a daily level

World Happiness Report

Metzler, H., Pellert, M., & Garcia, D. (2022). Using Social Media Data to Capture Emotions Before and During COVID-19 (World Happiness Report 2022). https://worldhappiness.report/ed/2022/using-social-media-data-to-capture-emotions-before-and-during-covid-19/

Data sources

derstandard.at
An internet pioneer in the German speaking area (centered on Austria)

Popular page: almost 57 million visits in November 2020

Active forum with many postings below news articles

Twitter
Tweets from Austria (data on location from Brandwatch)

Mood Survey on derstandard.at

Survey on yesterday’s emotional state run for 20 days in November 2020

“How was your last day” (“Wie war der letzte Tag?”)

Was displayed in between the article text in a low barrier manner, could be answered anonymously

In a collaboration with derstandard.at, we obtained the survey results

The data allows us to investigate the relationship of the explicit survey measure with the results of methods that extract sentiment indirectly from text

Text analysis

Combination of dictionary based and deep learning (RoBERTa) based sentiment analysis on the text of postings (in German): LIWC and German Sentiment

These were the only two tools used, no cherry-picking the methods (see preregistration that we will discuss later)

Text analysis

268,128 survey responses between November 11th and 30th, 2020

11,082 unique users and 743,003 postings on derstandard.at during the survey period

11,237 unique users and 635,185 tweets for Twitter

We subtract baseline corrected negative from baseline corrected positive on the texts of each day

Baseline period from “2020-03-16” to “2020-04-20”, first COVID-19 lockdown in Austria

Text analysis

To match the range of the survey question, we take a three day rolling average (right-aligned)

This way we account for people answering the survey in the evening/night with different reference points to “yesterday”

Compare to: % of positive in the survey

Close correspondence between explicit survey and text analysis (same platform)

Preregistration

We planned an extension of the analysis to another platform (Twitter)

To see if this a platform effect or if the correspondance between text analysis and explicit survey generalizes

We pre-registered the same study design as before but with Twitter data

Close correspondence between explicit survey and text analysis also for Twitter

Components

Generally, the negative components of text analysis results could be improved

LIWC negative on derstandard fails (dialect words that are not included in the dictionary?)

Longer term trend of the two text sentiment signals

External Validations

Summary

We showed that macroscropes of emotions are possible

Here for Austria (for UK and a number of other countries see World Happiness Report 2022 chapter)

Digital traces from social media can be a complementary data source to traditional surveys

We find strong relationships between both signals

Social media data has a number of advantages: cheap large data, longitudinal and temporally fine-grained, “always-on”, people are observed indirectly and unobtrusively

Publications

Pellert, M., Metzler, H., Matzenberger, M., & Garcia, D. (2022). Validating daily social media macroscopes of emotions. Scientific Reports, 12(1), 11236. https://doi.org/10.1038/s41598-022-14579-y

Book chapter outlining the connected research program:

Language technologies…

https://doi.org/10.1177/17456916231214460

Performance tests of “human intelligence” play a role since the beginning of AI (for example Evans 1964)

The idea of psychometric AI was prominently brought up roughly once per decade since then, but no major works followed

We show that LLMs nowadays can be psychometrically assessed in a rich way using different approaches, we propose to adapt a Natural Language Inference task

We use the score for entailment of a premise (psychometric item text) and each hypothesis (the possible answers according to the psychometric survey specifications)

AI Psychometrics

Standard psychometric inventories can be repurposed as diagnostic tools for large language models (LLMs)

Psychometric profiling enables researchers to study and compare LLMs in terms of non-cognitive traits thereby providing a window into the personalities, values, beliefs and biases these models exhibit (or mimic)

We conclude by highlighting open challenges and future avenues of this novel research perspective

We demonstrate several questionnaires

Big Five Inventory

Dark Tetrad

Revised Portrait Values Questionnaire

Moral Foundations Questionnaire

Gender/Sex Diversity Beliefs Scale

Our approach is very flexible: a large number of questionnaires can be applied

Different to testing with adhoc examples, instead systematic and rich investigations building on existing, theoretically underpinned resources from psychometrics

Downstream applications

Uncovered psychometric traits for humans often have a systematic link to behavior (for example risk aversion and neuroticism)

It’s a big, open empirical question if psychological profiles (e.g. personality or value orientation) of LLMs have a consistent, predictable link to their behavior, i.e. model outputs.

Examples: LLMs determining financing or housing eligibility or screening CVs

We can expect increasingly more societal decision making by LLMs

Mapping out the space of model traits influence on an example task:

CV screening (Updated Resume Dataset)

“AI systems used to evaluate the credit score or creditworthiness of natural persons” as a special risk area in the coming EU regulations, because of far-reaching consequence of this assessment for the access to financial resources or essential services such as housing, electricity, and telecommunication services

Can we build taxonomies of the (causal) effect of controlled model traits such as openness?

More anecdotal evidence so far, but who would have thought that emotional appeal increases model performance in question answering tasks?

Developing related research examples: LLMs can cater targeted texts at specific personality types → points to some consistent internal representation of personality?

https://transformer-circuits.pub/2024/scaling-monosemanticity/

Locating non-cognitive model traits

Ideas of “model lobotomy” may be coming closer

Or at least something like brain imaging of neural nets (detecting functional partitions)

Clamping up personality or value orientation features? (instead of the Golden Gate Bridge for example)

To craft specific non-cognitive model traits in a “hard” way (similar as with adaptors, actually changing model weights) instead of “softly” with prompting

Going more macro: Synthetic Surveys