The Twitter developer agreement prohibits sharing the full text of tweets:
“If you provide Twitter Content to third parties, including downloadable datasets of Twitter Content or an API that returns Twitter Content, you will only distribute or allow download of Tweet IDs, Direct Message IDs, and/or User IDs.” (“Developer Agreement and Policy – Twitter Developers,” n.d., Point I.F.2)
As an example, consider Pfeffer & Morstatter (2016) that collected tweets with geo-location information from the United States in two 6 months periods in 2014 and 2015. The data can be accessed through https://data.gesis.org/sharing/ by creating an account and requesting the data set. If the request is authorized, all Tweet IDs of the data set can be downloaded. The list of Tweet IDs can then be used to “rehydrate” (CITE) the full Tweet JSON object from the Twitter API. This process will produce fewer full tweets than in the original data set, as Twitter is monitoring and removing bots from its services. Additionally, users have left the platform, which prohibits retrieving their tweets. The rehydration procedure makes sure that less bots, spammers and other automated accounts are included over time and that user privacy is respected.
The excellent twarc package
(https://twarc-project.readthedocs.io/en/latest/) makes rehydration a
one-liner. After running twarc configure
once to set up access
credentials to Twitter, we just have to run:
twarc hydrate ONETWEETPERLINE.TXT > FULLTWEETOBJECTS
twarc2 supports the Academic API for Twitter that gives access to the all historical tweets. To get all tweets of a user (and not only the most recent 3200) run the following:
twarc2 timeline USERID OUTFILE
The syntax is valid for twarc==v2.4.3.
Web scraping gives us access to online data without using an API.
Reverse engineering a page using the internal console of browsers such as Chromium or Firefox. Get an overview of the parts of a webpage. In the best case discover an internal API to query.
Otherwise, scrape whole page, in the worst case use a headless browser.
Always, respect robots.txt. Put a notifier in the user agent (like “robot”) and inform the page administrators about your project. Inform yourself about rate limits and other rules to follow.
Developer Agreement and Policy – Twitter Developers. (n.d.). Retrieved January 30, 2020, from https://developer.twitter.com/en/developer-terms/agreement-and-policy
Pfeffer, J., & Morstatter, F. (2016). Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness [Data set]. GESIS Data Archive. doi: 10.7802/1166