Skip to main content
Show Me The Data
HomeIntro
About

Newsletter

Get insights on platform data and research

Subscribe

YouTube Channel

Video tutorials and insights

Subscribe

Support on Patreon

Help create more content

Become a Patron

Buy Me a Coffee

One-time support

Buy Coffee

Created by Matt Motyl

© 2025 Matt Motyl. All rights reserved.

On This Page

Submit Feedback
Back to Home

External Data Sources

A curated collection of publicly available platform data sources from research repositories, data archives, and transparency reports across major social media platforms.

About This Guide

This comprehensive list contains external datasets that researchers have collected and published for academic and public use. These datasets span multiple platforms and cover various research themes including sentiment analysis, misinformation, content moderation, and platform transparency.

Note: Click any column header to sort the table. Most datasets include links to their source repositories where you can access documentation and download the data.

Available Datasets (54 total)

Title
Dataset
Platform
Source
Themes
Date
#Coronavirus on TikTok: User engagement with misinformation as a potential threat to public health behavior
166 TikTok videos were identified with #coronavirus
KaggleTiktokScrapedCOVID-19, Misinformation09/01/2020
#Disgusted: Identifying Potential Sub-Factors of Moral Disgust through Qualitative Analysis of Tweets
Twitter API
Open Science FrameworkXAPIMorality05/01/2019
2M Transcribed Videos
400K videos with transcriptions
Hugging FaceYoutubeScrapedBig Data2017 to 2024
32M Tiktok Metadata Dataset
32,489,068 TikTok videos, 200GB
Reddit CommentTiktokScrapedBig DataJuly 2020 to October 2020
All Publicly Available Reddit Comments
1.7 Billion reddit comments. Over a TB uncompressed. Represents all public comments prior to 2015
Reddit Comment Explaining Process of Downloading ItBigQueryRedditScrapedBig Data2015
Anti-Asian Hate Speech Evolution from Pre-COVID to Post-COVID on Reddit
Content level training data for a sentiment analysis. Extracted using Praw
Open Science FrameworkRedditAPIHate Speech, SentimentJanuary 2018 to December 2023
Characterizing Clickbaits on Instagram
InstaLooter API
Harvard DataverseMetaAPIClickbait07/01/2017
Clubhouse Dataset 9.7M
User data of 9.7M clubhouse users
KaggleClubhouseScraped2021
COVID-19 Vaccine Perceptions on Reddit
Content level data from Pushshift API and the Python Reddit API Wrapper
Open Science FrameworkRedditAPIMisinformation04/01/2021
Customer Support on Twitter
Large dataset of customer support content on twitter. Scraped using PointScrape.
KaggleXScrapedBig Data, Support2014
Decoding Reddit Memes Virality
Extracted memes that went viral or didn't along with extracted & generated features about the images themselves PRAW
Open Science Framework GithubRedditAPIVirality05/01/2024
Dehydrated Twitter data on the #MeTwo movement
Dehydrated data (only contains the post_id) using Twitter API
Open Science FrameworkXAPIJuly 2018 to August 2018
Do Differences in Values Influence Disagreements in Online Discussions?
Praw
Open Science FrameworkRedditAPISentimentSeptember 2015 to April 2022
Emotional expression on social media support forums for substance cessation: Observational study of Reddit posts and discussions
Sentiment data from 2 million posts from 394 forums. Pushshift.io
Open Science FrameworkRedditAPISentimentNovember 2019 to January 2020
Evaluating narrative-driven movie recommendations on Reddit
Extracted comments Pushshift
Open Science FrameworkRedditAPIRecommender Systems03/01/2019
Facebook News - 1M Comments & 20K Posts
19,850 posts from 83 various news organizations & personalities representing up to the last 250 page posts. Each post has up to 100 comments for a total of 1,025,403 comments.
BigQueryGithubMetaScrapedBig Data, News07/01/2017
Facebook Privacy-Protected Full URLs Data Set
Data on the demographics of people who viewed, shared, and otherwise interacted with web pages (URLs) shared on Facebook. 68 million URLs, over 3.1 trillion rows, and over 71 trillion cell values
Harvard DataverseMetaPublicJanuary 2017 and October 2022
GeoCoV19 dataset
Large dataset (>500M tweets) of multilingual COVID related tweets
Crisis NLPXAPICOVID-19, MultilingualFebruary 2020 to March 2020
Gifted Education in Social Media: A Sentiment Analysis
Sentiment data from 4 subreddits
Open Science FrameworkRedditScrapedSentiment2021-11-01
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories
[UK] Content analysis was conducted on the ads libraries provided by Meta (https://www.facebook.com/ads/library)
Open Science FrameworkMetaAPILootboxes, Advertising, Video GamesSeptember 2021 to May 2024
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories
[UK] Content analysis was conducted on the ads libraries provided by TikTok (https://library.tiktok.com/ads/)
Open Science FrameworkTiktokAPILootboxes, Advertising, Video GamesSeptember 2021 to May 2024
Influencer Data (Instagram)
User data on the top 1000 influencers on Instagram (2022)
KaggleMetaAPIInfluencerMarch 2022 to Dec 2022
Influencer Data (Tiktok)
User data on the top 1000 influencers on Tiktok (2022)
KaggleTiktokAPIInfluencerMarch 2022 to Dec 2022
Influencer Data (Youtube)
User data on the top 1000 influencers on Youtube (2022)
KaggleYoutubeAPIInfluencerMarch 2022 to Dec 2022
LinkedIn Influencer Posts
This dataset contains LinkedIn Influencers' post details and other details(post dependent as well as independent) per post.
KaggleLinkedinScrapedInfluencer2019 to 2021
Linkedin Job Postings Dataset
This dataset contains information about job postings on LinkedIn.
KaggleLinkedinScrapedJob Descriptions2024
LinkedIn Profile Data
Anonymized data from profiles scraped on LinkedIn. Contains data from about 15000 profiles.
KaggleLinkedinScraped2018
Linkedin Transparency Center
Linkedin's official transparency center
Linkedin Transparency ReportLinkedinPublicTransparency2019 to 2023
Meta Transparency Center
Meta's official transparency center
Transparency ReportMetaPublicTransparency2024
Partisans neither expect nor receive reputational rewards for sharing falsehoods over truth online
Collected using twitter API
Open Science FrameworkXAPIPartisan, Misinformation2023
Pfizer Vaccine Tweets
Pfizer data on twitter
KaggleXAPICOVID-19, VaccineDecember 2020 to November 2021
Political Ads on Facebook
160K Political Ads on FB collected via a browser plugin
KaggleMetaScrapedAdvertisementsJuly 2017 to May 2019
Political Astroturfing on Twitter: How to Coordinate a Disinformation Campaign
From Twitter's Rtween library in R
Open Science FrameworkXAPIDisinformation2006-2012
Political Social Media Posts
Data was provided by the Data For Everyone Library on Crowdflower.
KaggleMetaAPIPolitics08/01/2015
Reddit Transparency Center
Reddit's official transparency center
Transparency ReportRedditPublicTransparency2023
Russian Ad Dataset
3500+ ads created by the Internet Research Agency between 2015 and 2017. Released by House Democrats
GithubMetaPublicRussia, Advertisements, USA2015 to 2017
Snap Transparency Center
Snapchat's official transparency center
Snap Transparency ReportSnapPublicTransparency2014 to 2023
Speculator and Influencer Evaluation in Stock Market by Using Social Media
3M tweets on the top 500 companies from 2015-2020
KaggleXAPIUSA, Stock Market2015 to 2020
Stanford Large Network Dataset Collection
The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.
Stanford ResearchScrapedBig Data2006 to 2024
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach
Crowdtangle
Open Science FrameworkMetaAPIPolarizationJanuary 2020 to May 2020
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach
Twitter API for Academic Research
Open Science FrameworkXAPIPolarizationJanuary 2020 to May 2020
TikTok Hashtag Dataset
This is the Dataset of popular hashtags on TikTok, this includes the author name, author id, author signature, comment count, hashtags details, URL, share count, hashtags which i scrape are meme, funny, humor, comedy, education, lol, dance, song, music, etc.
KaggleTiktokScraped07/01/2022
TikTok Trending Videos
First 1000 trending videos on TikTok
KaggleTiktokScrapedTrending2021
Tiktok User Data
Tiktok user data
KaggleTiktokScrapedJuly 2023 to August 2023
TikTok User Engagement Data
Each row represents a different published TikTok video in which a claim/opinion has been made.
KaggleTiktokScrapedEngagement2023
Top Instagram Influencers Data (Cleaned)
Influencer (top 200 accounts) data on instagram
KaggleMetaScrapedInfluencer2022
Tweeting about alcohol: Exploring differences in Twitter sentiment during the onset of the COVID-19 pandemic
Twiter content data looking at Alcohol and COVID 19. GeoCoV19 dataset
Open Science FrameworkXAPICOVID-19, SentimentFebruary 2020 to April 2020
Ukraine Twitter Data
Academic Twitter API. Daily posts on Ukraine in various languages
Open Science FrameworkXAPIUkraine, MisinformationFebruary 2022 to May 2023
US Elections 2020 Dataset
Dataset containing around 1.7M tweets about US Election 2020
KaggleXAPIUSA, ElectionsOctober to November 2020
Wikipedia Transparency Center
Wikipedia's official transparency center
Wikipedia Transparency ReportWikipediaPublicTransparency2012 to 2023
X Transparency Report
X's official transparency center
Transparency ReportXPublicTransparency2012 to 2021
Youtube Transparency Report
Youtube's official transparency report
GoogleYoutubePublicTransparency2018 to 2024
Youtube Trending Videos
235,187 Trending vdeos
GigasheetYoutubeScrapedBig Data2020 to 2023
Youtube-8M Segments Dataset
237K segments on 1000 classes
Google ResearchYoutubePublicBig Data2019

💡 Tips:

  • Click any dataset name to search for it on Google Dataset Search
  • Use your browser's search (Ctrl+F or Cmd+F) to find specific platforms, themes, or keywords
  • Click any column header to sort the table by that column

Submit a Dataset

Know of a dataset that's not listed here? Help expand this collection by suggesting new data sources!

How to Submit:

Submit your dataset through this Google Form.

What to include:
  • Dataset title
  • Source/repository name (e.g., OSF, Kaggle, Harvard Dataverse)
  • Direct link to the dataset
  • Platform(s) covered (Reddit, X, Meta, etc.)
  • Brief description of the data
  • Date range or year of data collection
  • Research themes (optional)
Previous: Common PitfallsNext: Glossary