We are making publicly available VoterFraud2020, a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users that includes key phrases and hashtags related to voter fraud claims between October 23rd and December 16th. The dataset also includes the full set of links and links to YouTube videos shared in these tweets, with data about their spread in different Twitter sub-communities. Key takeaways from our initial analysis of the data are listed below.

The methodology for our data collection including the tracked keywords is detailed in the paper: A. Abilov, Y. Hua, H. Matatov, O. Amir and M. Naaman. (2021). VoterFraud2020: a Multi-modal Dataset of Election Fraud Claims on Twitter. International Conference on Web and Social Media (ICWSM 2021).

The dataset is enhanced with the sub-community labels to enable quick study of how URLs, images and youtube videos spread within these sub-communities. The navigation on this page provides a quick summary of the top users, tweets, videos and links shared. Of the users in our data, we found that 99,884 were later suspended by Twitter. The suspension status of users are included, allowing researchers to investigate Twitter’s response to voter fraud claims. The dataset also includes perceptual hash values for all images in the data. With these values, researchers can easily find duplicates and near-duplicate images in tweets to identify popular images. This page presents top images shared in the dataset.

The dataset can be used by researchers to study the spread, reach, and dynamics of the campaign involving voter fraud claims on Twitter. The data can help expose, for example, how different public figures spread different claims, and the types of engagement various narratives received.

The paper, dataset, and this web page are done in a collaboration between the Social Technologies Lab at Cornell Tech and The Technion.

The project was led by Anton Abilov, Yiqing Hua, and Hana Matatov. Inquiries should go to Professor Mor Naaman at mor.naaman@cornell.edu.

Key Takeaways

Privacy and Ethical Considerations

The dataset was collected and made available according to Twitter’s Terms of Service for academic researchers, following established guidelines for ethical Twitter data use. We do not directly share content of individual tweets. By using Tweet IDs as the main data element the dataset does not expose information about users whose data had been removed from the service.

Anton Abilov (@AntonAbilov), Yiqing Hua (@yiqqqing), Hana Matatov (@HanaMatatov), Ofra Amir (@ofraam) & Mor Naaman (@informor).

Figure 1: Five communities in the retweet graph of people posting about voter-fraud claims; the blue cluster on the left side includes mainly detractors of voter-fraud claims.

Figure 2: Where suspended users were located in the retweet graph (orange); they mostly came from one specific sub-community of claim promoters (yellow in Figure 1)