A curated collection of publicly available platform data sources from research repositories, data archives, and transparency reports across major social media platforms.
This comprehensive list contains external datasets that researchers have collected and published for academic and public use. These datasets span multiple platforms and cover various research themes including sentiment analysis, misinformation, content moderation, and platform transparency.
Note: Click any column header to sort the table. Most datasets include links to their source repositories where you can access documentation and download the data.
Title | Dataset | Platform | Source | Themes | Date |
|---|---|---|---|---|---|
#Coronavirus on TikTok: User engagement with misinformation as a potential threat to public health behavior 166 TikTok videos were identified with #coronavirus | Kaggle | Tiktok | Scraped | COVID-19, Misinformation | 09/01/2020 |
#Disgusted: Identifying Potential Sub-Factors of Moral Disgust through Qualitative Analysis of Tweets Twitter API | Open Science Framework | X | API | Morality | 05/01/2019 |
2M Transcribed Videos 400K videos with transcriptions | Hugging Face | Youtube | Scraped | Big Data | 2017 to 2024 |
32M Tiktok Metadata Dataset 32,489,068 TikTok videos, 200GB | Reddit Comment | Tiktok | Scraped | Big Data | July 2020 to October 2020 |
All Publicly Available Reddit Comments 1.7 Billion reddit comments. Over a TB uncompressed. Represents all public comments prior to 2015 | Reddit Comment Explaining Process of Downloading ItBigQuery | Scraped | Big Data | 2015 | |
Anti-Asian Hate Speech Evolution from Pre-COVID to Post-COVID on Reddit Content level training data for a sentiment analysis. Extracted using Praw | Open Science Framework | API | Hate Speech, Sentiment | January 2018 to December 2023 | |
Characterizing Clickbaits on Instagram InstaLooter API | Harvard Dataverse | Meta | API | Clickbait | 07/01/2017 |
Clubhouse Dataset 9.7M User data of 9.7M clubhouse users | Kaggle | Clubhouse | Scraped | 2021 | |
COVID-19 Vaccine Perceptions on Reddit Content level data from Pushshift API and the Python Reddit API Wrapper | Open Science Framework | API | Misinformation | 04/01/2021 | |
Customer Support on Twitter Large dataset of customer support content on twitter. Scraped using PointScrape. | Kaggle | X | Scraped | Big Data, Support | 2014 |
Decoding Reddit Memes Virality Extracted memes that went viral or didn't along with extracted & generated features about the images themselves PRAW | Open Science Framework Github | API | Virality | 05/01/2024 | |
Dehydrated Twitter data on the #MeTwo movement Dehydrated data (only contains the post_id) using Twitter API | Open Science Framework | X | API | July 2018 to August 2018 | |
Do Differences in Values Influence Disagreements in Online Discussions? Praw | Open Science Framework | API | Sentiment | September 2015 to April 2022 | |
Emotional expression on social media support forums for substance cessation: Observational study of Reddit posts and discussions Sentiment data from 2 million posts from 394 forums. Pushshift.io | Open Science Framework | API | Sentiment | November 2019 to January 2020 | |
Evaluating narrative-driven movie recommendations on Reddit Extracted comments Pushshift | Open Science Framework | API | Recommender Systems | 03/01/2019 | |
Facebook News - 1M Comments & 20K Posts 19,850 posts from 83 various news organizations & personalities representing up to the last 250 page posts. Each post has up to 100 comments for a total of 1,025,403 comments. | BigQueryGithub | Meta | Scraped | Big Data, News | 07/01/2017 |
Facebook Privacy-Protected Full URLs Data Set Data on the demographics of people who viewed, shared, and otherwise interacted with web pages (URLs) shared on Facebook. 68 million URLs, over 3.1 trillion rows, and over 71 trillion cell values | Harvard Dataverse | Meta | Public | January 2017 and October 2022 | |
GeoCoV19 dataset Large dataset (>500M tweets) of multilingual COVID related tweets | Crisis NLP | X | API | COVID-19, Multilingual | February 2020 to March 2020 |
Gifted Education in Social Media: A Sentiment Analysis Sentiment data from 4 subreddits | Open Science Framework | Scraped | Sentiment | 2021-11-01 | |
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories [UK] Content analysis was conducted on the ads libraries provided by Meta (https://www.facebook.com/ads/library) | Open Science Framework | Meta | API | Lootboxes, Advertising, Video Games | September 2021 to May 2024 |
Illegal loot box advertising on social media: an empirical study using the Meta and TikTok ad transparency repositories [UK] Content analysis was conducted on the ads libraries provided by TikTok (https://library.tiktok.com/ads/) | Open Science Framework | Tiktok | API | Lootboxes, Advertising, Video Games | September 2021 to May 2024 |
Influencer Data (Instagram) User data on the top 1000 influencers on Instagram (2022) | Kaggle | Meta | API | Influencer | March 2022 to Dec 2022 |
Influencer Data (Tiktok) User data on the top 1000 influencers on Tiktok (2022) | Kaggle | Tiktok | API | Influencer | March 2022 to Dec 2022 |
Influencer Data (Youtube) User data on the top 1000 influencers on Youtube (2022) | Kaggle | Youtube | API | Influencer | March 2022 to Dec 2022 |
LinkedIn Influencer Posts This dataset contains LinkedIn Influencers' post details and other details(post dependent as well as independent) per post. | Kaggle | Scraped | Influencer | 2019 to 2021 | |
Linkedin Job Postings Dataset This dataset contains information about job postings on LinkedIn. | Kaggle | Scraped | Job Descriptions | 2024 | |
LinkedIn Profile Data Anonymized data from profiles scraped on LinkedIn. Contains data from about 15000 profiles. | Kaggle | Scraped | 2018 | ||
Linkedin Transparency Center Linkedin's official transparency center | Linkedin Transparency Report | Public | Transparency | 2019 to 2023 | |
Meta Transparency Center Meta's official transparency center | Transparency Report | Meta | Public | Transparency | 2024 |
Partisans neither expect nor receive reputational rewards for sharing falsehoods over truth online Collected using twitter API | Open Science Framework | X | API | Partisan, Misinformation | 2023 |
Pfizer Vaccine Tweets Pfizer data on twitter | Kaggle | X | API | COVID-19, Vaccine | December 2020 to November 2021 |
Political Ads on Facebook 160K Political Ads on FB collected via a browser plugin | Kaggle | Meta | Scraped | Advertisements | July 2017 to May 2019 |
Political Astroturfing on Twitter: How to Coordinate a Disinformation Campaign From Twitter's Rtween library in R | Open Science Framework | X | API | Disinformation | 2006-2012 |
Political Social Media Posts Data was provided by the Data For Everyone Library on Crowdflower. | Kaggle | Meta | API | Politics | 08/01/2015 |
Reddit Transparency Center Reddit's official transparency center | Transparency Report | Public | Transparency | 2023 | |
Russian Ad Dataset 3500+ ads created by the Internet Research Agency between 2015 and 2017. Released by House Democrats | Github | Meta | Public | Russia, Advertisements, USA | 2015 to 2017 |
Snap Transparency Center Snapchat's official transparency center | Snap Transparency Report | Snap | Public | Transparency | 2014 to 2023 |
Speculator and Influencer Evaluation in Stock Market by Using Social Media 3M tweets on the top 500 companies from 2015-2020 | Kaggle | X | API | USA, Stock Market | 2015 to 2020 |
Stanford Large Network Dataset Collection The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges. | Stanford Research | Scraped | Big Data | 2006 to 2024 | |
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach Crowdtangle | Open Science Framework | Meta | API | Polarization | January 2020 to May 2020 |
The Manifestation of Affective Polarization on Social Media: A Cross-Platform Supervised Machine Learning Approach Twitter API for Academic Research | Open Science Framework | X | API | Polarization | January 2020 to May 2020 |
TikTok Hashtag Dataset This is the Dataset of popular hashtags on TikTok, this includes the author name, author id, author signature, comment count, hashtags details, URL, share count, hashtags which i scrape are meme, funny, humor, comedy, education, lol, dance, song, music, etc. | Kaggle | Tiktok | Scraped | 07/01/2022 | |
TikTok Trending Videos First 1000 trending videos on TikTok | Kaggle | Tiktok | Scraped | Trending | 2021 |
Tiktok User Data Tiktok user data | Kaggle | Tiktok | Scraped | July 2023 to August 2023 | |
TikTok User Engagement Data Each row represents a different published TikTok video in which a claim/opinion has been made. | Kaggle | Tiktok | Scraped | Engagement | 2023 |
Top Instagram Influencers Data (Cleaned) Influencer (top 200 accounts) data on instagram | Kaggle | Meta | Scraped | Influencer | 2022 |
Tweeting about alcohol: Exploring differences in Twitter sentiment during the onset of the COVID-19 pandemic Twiter content data looking at Alcohol and COVID 19. GeoCoV19 dataset | Open Science Framework | X | API | COVID-19, Sentiment | February 2020 to April 2020 |
Ukraine Twitter Data Academic Twitter API. Daily posts on Ukraine in various languages | Open Science Framework | X | API | Ukraine, Misinformation | February 2022 to May 2023 |
US Elections 2020 Dataset Dataset containing around 1.7M tweets about US Election 2020 | Kaggle | X | API | USA, Elections | October to November 2020 |
Wikipedia Transparency Center Wikipedia's official transparency center | Wikipedia Transparency Report | Wikipedia | Public | Transparency | 2012 to 2023 |
X Transparency Report X's official transparency center | Transparency Report | X | Public | Transparency | 2012 to 2021 |
Youtube Transparency Report Youtube's official transparency report | Youtube | Public | Transparency | 2018 to 2024 | |
Youtube Trending Videos 235,187 Trending vdeos | Gigasheet | Youtube | Scraped | Big Data | 2020 to 2023 |
Youtube-8M Segments Dataset 237K segments on 1000 classes | Google Research | Youtube | Public | Big Data | 2019 |
💡 Tips:
Know of a dataset that's not listed here? Help expand this collection by suggesting new data sources!
Submit your dataset through this Google Form.