External Data Sources

A curated collection of publicly available platform data sources from research repositories, data archives, and transparency reports across major social media platforms.

About This Guide

This comprehensive list contains external datasets that researchers have collected and published for academic and public use. These datasets span multiple platforms and cover various research themes including sentiment analysis, misinformation, content moderation, and platform transparency.

Note: Click any column header to sort the table. Most datasets include links to their source repositories where you can access documentation and download the data.

Available Datasets (54 total)

Title	Dataset	Platform	Source	Themes	Date
#Coronavirus on TikTok: User engagement with misinformation as a potential threat to public health behavior 166 TikTok videos were identified with #coronavirus	Kaggle	Tiktok	Scraped	COVID-19, Misinformation	09/01/2020
#Disgusted: Identifying Potential Sub-Factors of Moral Disgust through Qualitative Analysis of Tweets Twitter API	Open Science Framework	X	API	Morality	05/01/2019
2M Transcribed Videos 400K videos with transcriptions	Hugging Face	Youtube	Scraped	Big Data	2017 to 2024
32M Tiktok Metadata Dataset 32,489,068 TikTok videos, 200GB	Reddit Comment	Tiktok	Scraped	Big Data	July 2020 to October 2020
All Publicly Available Reddit Comments 1.7 Billion reddit comments. Over a TB uncompressed. Represents all public comments prior to 2015	Reddit Comment Explaining Process of Downloading ItBigQuery	Reddit	Scraped	Big Data	2015
Anti-Asian Hate Speech Evolution from Pre-COVID to Post-COVID on Reddit Content level training data for a sentiment analysis. Extracted using Praw	Open Science Framework	Reddit	API	Hate Speech, Sentiment	January 2018 to December 2023
Characterizing Clickbaits on Instagram InstaLooter API	Harvard Dataverse	Meta	API	Clickbait	07/01/2017
Clubhouse Dataset 9.7M User data of 9.7M clubhouse users	Kaggle	Clubhouse	Scraped		2021
COVID-19 Vaccine Perceptions on Reddit Content level data from Pushshift API and the Python Reddit API Wrapper	Open Science Framework	Reddit	API	Misinformation	04/01/2021
Customer Support on Twitter Large dataset of customer support content on twitter. Scraped using PointScrape.	Kaggle	X	Scraped	Big Data, Support	2014
Decoding Reddit Memes Virality Extracted memes that went viral or didn't along with extracted & generated features about the images themselves PRAW	Open Science Framework Github	Reddit	API	Virality	05/01/2024
Dehydrated Twitter data on the #MeTwo movement Dehydrated data (only contains the post_id) using Twitter API	Open Science Framework	X	API		July 2018 to August 2018
Do Differences in Values Influence Disagreements in Online Discussions? Praw	Open Science Framework	Reddit	API	Sentiment	September 2015 to April 2022
Emotional expression on social media support forums for substance cessation: Observational study of Reddit posts and discussions Sentiment data from 2 million posts from 394 forums. Pushshift.io	Open Science Framework	Reddit	API	Sentiment	November 2019 to January 2020
Evaluating narrative-driven movie recommendations on Reddit Extracted comments Pushshift	Open Science Framework	Reddit	API	Recommender Systems	03/01/2019
Facebook News - 1M Comments & 20K Posts 19,850 posts from 83 various news organizations & personalities representing up to the last 250 page posts. Each post has up to 100 comments for a total of 1,025,403 comments.	BigQueryGithub	Meta	Scraped	Big Data, News	07/01/2017
Facebook Privacy-Protected Full URLs Data Set Data on the demographics of people who viewed, shared, and otherwise interacted with web pages (URLs) shared on Facebook. 68 million URLs, over 3.1 trillion rows, and over 71 trillion cell values	Harvard Dataverse	Meta	Public		January 2017 and October 2022
GeoCoV19 dataset Large dataset (>500M tweets) of multilingual COVID related tweets	Crisis NLP	X	API	COVID-19, Multilingual	February 2020 to March 2020
Gifted Education in Social Media: A Sentiment Analysis Sentiment data from 4 subreddits	Open Science Framework	Reddit	Scraped	Sentiment	2021-11-01
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories [UK] Content analysis was conducted on the ads libraries provided by Meta (https://www.facebook.com/ads/library)	Open Science Framework	Meta	API	Lootboxes, Advertising, Video Games	September 2021 to May 2024
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories [UK] Content analysis was conducted on the ads libraries provided by TikTok (https://library.tiktok.com/ads/)	Open Science Framework	Tiktok	API	Lootboxes, Advertising, Video Games	September 2021 to May 2024
Influencer Data (Instagram) User data on the top 1000 influencers on Instagram (2022)	Kaggle	Meta	API	Influencer	March 2022 to Dec 2022
Influencer Data (Tiktok) User data on the top 1000 influencers on Tiktok (2022)	Kaggle	Tiktok	API	Influencer	March 2022 to Dec 2022
Influencer Data (Youtube) User data on the top 1000 influencers on Youtube (2022)	Kaggle	Youtube	API	Influencer	March 2022 to Dec 2022
LinkedIn Influencer Posts This dataset contains LinkedIn Influencers' post details and other details(post dependent as well as independent) per post.	Kaggle	Linkedin	Scraped	Influencer	2019 to 2021
Linkedin Job Postings Dataset This dataset contains information about job postings on LinkedIn.	Kaggle	Linkedin	Scraped	Job Descriptions	2024
LinkedIn Profile Data Anonymized data from profiles scraped on LinkedIn. Contains data from about 15000 profiles.	Kaggle	Linkedin	Scraped		2018
Linkedin Transparency Center Linkedin's official transparency center	Linkedin Transparency Report	Linkedin	Public	Transparency	2019 to 2023
Meta Transparency Center Meta's official transparency center	Transparency Report	Meta	Public	Transparency	2024
Partisans neither expect nor receive reputational rewards for sharing falsehoods over truth online Collected using twitter API	Open Science Framework	X	API	Partisan, Misinformation	2023
Pfizer Vaccine Tweets Pfizer data on twitter	Kaggle	X	API	COVID-19, Vaccine	December 2020 to November 2021
Political Ads on Facebook 160K Political Ads on FB collected via a browser plugin	Kaggle	Meta	Scraped	Advertisements	July 2017 to May 2019
Political Astroturfing on Twitter: How to Coordinate a Disinformation Campaign From Twitter's Rtween library in R	Open Science Framework	X	API	Disinformation	2006-2012
Political Social Media Posts Data was provided by the Data For Everyone Library on Crowdflower.	Kaggle	Meta	API	Politics	08/01/2015
Reddit Transparency Center Reddit's official transparency center	Transparency Report	Reddit	Public	Transparency	2023
Russian Ad Dataset 3500+ ads created by the Internet Research Agency between 2015 and 2017. Released by House Democrats	Github	Meta	Public	Russia, Advertisements, USA	2015 to 2017
Snap Transparency Center Snapchat's official transparency center	Snap Transparency Report	Snap	Public	Transparency	2014 to 2023
Speculator and Influencer Evaluation in Stock Market by Using Social Media 3M tweets on the top 500 companies from 2015-2020	Kaggle	X	API	USA, Stock Market	2015 to 2020
Stanford Large Network Dataset Collection The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.	Stanford Research		Scraped	Big Data	2006 to 2024
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach Crowdtangle	Open Science Framework	Meta	API	Polarization	January 2020 to May 2020
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach Twitter API for Academic Research	Open Science Framework	X	API	Polarization	January 2020 to May 2020
TikTok Hashtag Dataset This is the Dataset of popular hashtags on TikTok, this includes the author name, author id, author signature, comment count, hashtags details, URL, share count, hashtags which i scrape are meme, funny, humor, comedy, education, lol, dance, song, music, etc.	Kaggle	Tiktok	Scraped		07/01/2022
TikTok Trending Videos First 1000 trending videos on TikTok	Kaggle	Tiktok	Scraped	Trending	2021
Tiktok User Data Tiktok user data	Kaggle	Tiktok	Scraped		July 2023 to August 2023
TikTok User Engagement Data Each row represents a different published TikTok video in which a claim/opinion has been made.	Kaggle	Tiktok	Scraped	Engagement	2023
Top Instagram Influencers Data (Cleaned) Influencer (top 200 accounts) data on instagram	Kaggle	Meta	Scraped	Influencer	2022
Tweeting about alcohol: Exploring differences in Twitter sentiment during the onset of the COVID-19 pandemic Twiter content data looking at Alcohol and COVID 19. GeoCoV19 dataset	Open Science Framework	X	API	COVID-19, Sentiment	February 2020 to April 2020
Ukraine Twitter Data Academic Twitter API. Daily posts on Ukraine in various languages	Open Science Framework	X	API	Ukraine, Misinformation	February 2022 to May 2023
US Elections 2020 Dataset Dataset containing around 1.7M tweets about US Election 2020	Kaggle	X	API	USA, Elections	October to November 2020
Wikipedia Transparency Center Wikipedia's official transparency center	Wikipedia Transparency Report	Wikipedia	Public	Transparency	2012 to 2023
X Transparency Report X's official transparency center	Transparency Report	X	Public	Transparency	2012 to 2021
Youtube Transparency Report Youtube's official transparency report	Google	Youtube	Public	Transparency	2018 to 2024
Youtube Trending Videos 235,187 Trending vdeos	Gigasheet	Youtube	Scraped	Big Data	2020 to 2023
Youtube-8M Segments Dataset 237K segments on 1000 classes	Google Research	Youtube	Public	Big Data	2019

💡 Tips:

Click any dataset name to search for it on Google Dataset Search
Use your browser's search (Ctrl+F or Cmd+F) to find specific platforms, themes, or keywords
Click any column header to sort the table by that column

Submit a Dataset

Know of a dataset that's not listed here? Help expand this collection by suggesting new data sources!

How to Submit:

Submit your dataset through this Google Form.

What to include:

Dataset title
Source/repository name (e.g., OSF, Kaggle, Harvard Dataverse)
Direct link to the dataset
Platform(s) covered (Reddit, X, Meta, etc.)
Brief description of the data
Date range or year of data collection
Research themes (optional)

Previous: Common Pitfalls Next: Glossary