CSSMethods Talk 03/2025: Synthetic Surveys for Population Insights

Max Pellert (https://mpellert.at)

[PDF]

Classical traditional surveys

Beloved standard tool of the social sciences

Often considered the gold standard, “ground truth” (especially when working with large representative samples of a population)

But, classical survey methodologies increasingly suffer from problems

First line of the 2024 Book “Polling at a Crossroads: Rethinking Modern Survey Research”: Survey research is in a state of crisis

Last example: US presidential elections 2024

(Generally, I think strong method conservatism in the social sciences does not make sense)

?

Ann Seltzer had an excellent multi-decade track record of accurate polling

For example: In 2008, predicting that a virtually unknown senator, Barack Obama, would beat frontrunner Hillary Clinton in the Iowa caucuses

The widely publicized final poll of Iowa by Selzer & Company showed Harris leading by 3 percentage points (in Iowa!)

On Election Day, Trump won the state by 13 points

Laudable public effort by the pollster at an error analysis: “To cut to the chase, I found nothing to illuminate the miss.”

In defence

“Within the margin of error”, “We said it’s close”, “Predicting it at almost 50-50 means that this can happen”

The 2024 elections were not close, a decisive victory on all metrics

Relevance of survey research? You don’t need much sophisticated machinery (or money) to predict that a 2-party system as the US with deeply ingrained political beliefs of the population and a very peculiar electoral system will be a tight race

Main issue? Non-response

Less than 1% of people respond even in respected well-established surveys (NYT for example)

Non-response

By now, any change in the traditional polls may just mean a new pattern of non-response

Many reasons, spam calling and ping calls a recent one

Statistics can correct for some problems, but you need some basis to work from

At the same time that they don’t respond to surveys, people are extremely expressive on other, social media

We have massive amounts of text available: found data, digital traces or whatever you like to call it

Synthetic Surveys

Britain’s mood, measured weekly

One example of an easily accessible, representative survey (UK) not directly in the political domain

Results

We can recreate dynamics with that approach of longitudinal adaptors

Not equally well for all constructs

Remember, our approach is just self-supervised next token prediction (no labels present as for example with the supervised text classification method of TweetNLP)

Our approach is very flexible, we can in principle ask any question and get survey-like responses for each week

Why does that work?

Wrap-up

I don’t think we should be replacing survey research

Also with complementary synthetic methods we will need classical approaches for example to learn about the sampling frame

But we should be making use of the text that people are producing (and potentially other modalities too)

It’s first steps for now and we strongly have to validate what we are doing

Huge potential: Low costs, scalability, unobtrusive observation, high temporal resolution, …

Bridging the gap between “qualitative” data and quantitative insights