Skip to main content
Show Me The Data
HomeIntro
About

Newsletter

Get insights on platform data and research

Subscribe

YouTube Channel

Video tutorials and insights

Subscribe

Support on Patreon

Help create more content

Become a Patron

Buy Me a Coffee

One-time support

Buy Coffee

Created by Matt Motyl

© 2025 Matt Motyl. All rights reserved.

On This Page

Submit Feedback
Back to Home

SQL Guide for Platform Data

Master SQL queries to analyze platform data effectively. Practice with our interactive SQL playground using synthetic platform data.

What is SQL? And, why?

When people take an introductory course in research methods or statistics, especially in the social sciences, they are often taught to work with data that have been cleaned and organized into a simple, singular spreadsheet. Typically, each row in the spreadsheet will correspond to the subjects of the research (e.g., participants, customers, products), and each column will correspond to information about each subject (e.g., their survey responses, purchases, symptoms, prices).

This is a great way to start, and is often perfect for the relatively small datasets that university labs or think tanks may use for their individual research projects that are well-structured with relatively few specified goals in mind. For example, if you were studying whether a medication is effective at treating a medical condition, you might have a spreadsheet listing each patient, their dosage, and their symptoms over time. There is no need for multiple spreadsheets because the data are small enough and straightforward.

However, when data are collected at massive scale with less clear uses of the data, a single spreadsheet quickly becomes impractical and inefficient. So, very large online platforms, marketplaces, and search engines tend to store data in a bunch of different spreadsheets (typically called "tables") within a massive database. Some tables might focus on users, others might focus on content (e.g., posts, videos, images), others might focus on interactions (e.g., likes, shares, purchases, searches), and others might focus on metadata (time, location, device type, language, etc.).

If all of these data were stored in a single table, the file size could well be measured in zettabytes or exabytes (1 EB = 1 MILLION terabytes, or 1 BILLION gigabytes 🤯) instead of terabytes or gigabytes, which would make it impossible to open the data on even the most powerful personal computers, let alone doing anything with those data once the file is opened. Now, these companies typically use massive data centers running high-performance computing clusters that can handle much larger data, but even then, it's expensive, time-consuming, and unnecessary to process all of the data when only a few variables might be needed.

This is where SQL (Structured Query Language) comes in. SQL is a simpler programming language specifically designed for:

  • Working with platform data, because they commonly store their data in relational databases
  • Selecting, filtering, aggregating, and joining large datasets efficiently

Additionally, SQL is useful because:

  • It can convert outputs from APIs, which many platforms use to grant external parties access to internal data, into formats that are easier to analyze (as opposed to JSON or XML)
  • Many research databases (like Social Science One) require SQL knowledge to extract the desired data from the more than 71 trillion cells of data in them

Key SQL Concepts

Fortunately, SQL is one of the simpler programming languages to learn. It can do complex things, but there are just 4 main commands that you need to know to accomplish most of what you would want to do with platform data as an external researcher or auditor. Additionally, SQL syntax is designed to be very close to plain English, making it easier to read, understand, and write.

I describe these basic commands in the boxes below with a functioning example query for each.

🫴 SELECT Queries

All SQL queries include a SELECT statement and a FROM statement.

  • 'SELECT' is a special term in SQL that means "get me this data." After writing SELECT, you write out the names of whatever columns you want to retrieve. If you want to retrieve all data in columns named username and follower_count, you simply spell out those column names.
  • 'FROM' is another special SQL term that means "from this table." After writing FROM, you write the name of the table where the data live. In this case, the table is named users.
SELECT username, follower_count
FROM users

If you were to translate the above code into plain English, it would read "Get me the username and follower count data from the users table."

🔗 JOIN Operations

Often, you'll want data from more than one table. To connect data from multiple tables, you'll use the JOIN and ON statements.

  • 'JOIN' is a special term in SQL that means "combine data from these two tables." After writing JOIN, you write the name of the second table you want to connect to the first table.
  • 'ON' is another special SQL term that specifies some variable that exists in both tables that you want to use to connect the data. If we have a table containing data on users and another table containing data on posts, both of those tables would contain a variable like user_id identifying the user (in the users table) and the author (in the posts table). So, you'll need to say what the variable is that you want to join the data on.
SELECT u.username, p.content
FROM posts p
JOIN users u ON p.user_id = u.user_id;

If you were to translate the above code into plain English, it would read "Get the username data from the users table and the content data from the posts table, and then give me each user's content."

Note: In the example above, I used table aliases (the single-letter names after the table names) to make the query easier to read and write. This is optional, but recommended when working with multiple tables because it is possible that multiples tables contain variables with the same names, even if they refer to different things. If you try to SELECT variables with the same names from multiple tables without using aliases, SQL will return an error. By using aliases (like "u" for users and "p" for posts), you can specify exactly which table each variable comes from and avoid getting an error I encountered too many times when I first started learning SQL!

🤏 Filtering & Sorting

When working with large data sets, you'll typically want (and potentially need) to reduce the total volume of data you're requesting. The main command for filtering data is the WHERE statement. You might also want to sort the data in a particular order, which you can do with the ORDER BY statement.

  • 'WHERE' is a special term in SQL that means "only give me data meeeting this condition." After writing WHERE, you write the name of the variable you want to filter along with what condition the variable values must meet to be included.
  • 'ORDER BY' is another special SQL term that reorders the output of your request on some variable. After writing ORDER BY, you write the name of the variable you want to sort by, along with whether you want the data sorted in ascending (ASC) or descending (DESC) order.

SELECT content, likes_count FROM posts
WHERE likes_count > 1000
ORDER BY post_date DESC;

If you were to translate the above code into plain English, it would read "Get me the content and likes count data from the posts table, but only for posts that have more than 1000 likes, and sort the results by post date from most recent to oldest."

Note: In the example above, I sorted the data using a variable I didn't include in the SELECT statement (i.e., post_date) to illustrate that you can filter on variables even if you aren't requesting them for the output the post_date variable. This is useful when you want to filter or sort on variables that aren't necessary for your analysis.

🧮 Aggregations

You might also want to calculate summary statistics across your data, such as counts, averages, sums, minimums, and maximums. SQL has built-in functions for these common calculations, and you can use them directly in your SELECT statements to get those statistics quickly.

You can do more complex statistics, too, like comparing groups of data with the GROUP BY statement, but the examples below illustrate the basic idea. Personally, I would recommend doing more complex statistics (like regressions or sentiment modeling) outside of SQL in a statistical programming language like R or Python, but simple aggregations are fine.

SELECT COUNT(*) as total_posts,
       AVG(likes_count) as avg_likes
FROM posts;

If you were to translate the above code into plain English, it would read "Get me the total number of posts and the average likes count across all posts from the posts table."

Note: In the example above, the asterisk (*) in COUNT(*) means "count all rows." I used the COUNT() and AVG() functions to calculate the total number of posts and the average likes count, respectively.

Understanding Fact Tables

Platform data is typically organized using a star schema (described in more detail further down on this page), with fact tables at the center storing event-level data and dimension tables providing contextual information. Fact tables are the foundation of platform analytics, containing metrics and measurements of user behaviors and interactions.

Example Fact Table Data Dictionary

A fact table stores event-level data with metrics and foreign keys to dimension tables. Each row represents a single event or interaction on the platform.

Variable NameDescriptionData TypeRetention Period
user_idUnique identifier for the userVARCHAR(255)90 days
session_idUnique identifier for the user sessionVARCHAR(255)90 days
event_idUnique identifier for the eventBIGINT90 days
post_idUnique identifier for the post interacted withBIGINT90 days
event_timeTimestamp when the event occurredTIMESTAMP90 days
event_typeType of event (view, click, like, share, comment)VARCHAR(50)90 days
time_spent_secondsTime spent on content in secondsINT90 days
device_typeDevice used (mobile, desktop, tablet)VARCHAR(50)90 days

Example Fact Table Data

This example shows user interactions with posts, including views, likes, shares, and comments.

user_idsession_idevent_idpost_idevent_timeevent_typetime_spent_secondsdevice_type
101sess_a1b2c3100150012024-10-21 09:15:23view45mobile
101sess_a1b2c3100250012024-10-21 09:16:08like0mobile
102sess_d4e5f6100350022024-10-21 09:17:45view120desktop
102sess_d4e5f6100450022024-10-21 09:19:45share0desktop
103sess_g7h8i9100550032024-10-21 09:22:10view30mobile
104sess_j1k2l3100650012024-10-21 09:25:33view67tablet
104sess_j1k2l3100750012024-10-21 09:26:40comment0tablet
105sess_m4n5o6100850042024-10-21 09:28:15view95desktop

Extracting Metrics from Fact Tables

Here are common SQL queries for analyzing event data from fact tables:

1. Calculate Total Engagement by Post

SELECT post_id,
       COUNT(*) as total_events,
       SUM(CASE WHEN event_type = 'like' THEN 1 ELSE 0 END) as likes,
       SUM(CASE WHEN event_type = 'share' THEN 1 ELSE 0 END) as shares,
       SUM(CASE WHEN event_type = 'comment' THEN 1 ELSE 0 END) as comments,
       AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
GROUP BY post_id
ORDER BY total_events DESC;

2. Analyze User Behavior by Device Type

SELECT device_type,
       COUNT(DISTINCT user_id) as unique_users,
       COUNT(*) as total_events,
       AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
WHERE event_type = 'view'
GROUP BY device_type
ORDER BY unique_users DESC;

3. Track User Activity Over Time

SELECT DATE(event_time) as event_date,
       COUNT(DISTINCT user_id) as daily_active_users,
       COUNT(DISTINCT session_id) as total_sessions,
       COUNT(*) as total_events
FROM events_fact
GROUP BY DATE(event_time)
ORDER BY event_date DESC;

💡 Tip: Fact tables are optimized for aggregation queries. They typically contain millions or billions of rows, so always use WHERE clauses to filter by time ranges and use indexes on frequently queried columns like user_id, post_id, and event_time.

Understanding Dimension Tables

Dimension tables complement fact tables by providing descriptive context. When you join fact tables with dimension tables, you can answer questions like "Which countries have the most engaged users?" or "How does content performance vary by language?"

What Are Dimension Tables?

Dimension tables contain descriptive attributes about entities in your data. They provide context for the metrics stored in fact tables. Common dimension tables include users, posts, locations, and time periods.

Key Difference: While fact tables store measurements (likes, views, time spent), dimension tables store descriptions (who, what, where, when). Dimension tables are typically much smaller than fact tables but provide essential context for analysis.

Example Users Dimension Table

Contains descriptive information about each user on the platform.

user_idusernameaccount_createdcountryverifiedfollower_count
101alice_research2022-03-15FranceYes1,250
102bob_analyst2021-08-22GermanyNo450
103carol_data2023-01-10SpainYes3,200
104dave_insights2020-11-05ItalyNo890
105eve_metrics2022-07-18NetherlandsYes2,100

Example Posts Dimension Table

Contains descriptive information about each post or piece of content on the platform.

post_idauthor_idcontent_typecontent_lengthlanguagecreated_datehashtags
5001101Video120French2024-10-20#research #data
5002102Text280English2024-10-20#analytics #EU
5003103Image150Spanish2024-10-21#dataviz
5004105Video45English2024-10-21#DSA #platforms

Joining Fact and Dimension Tables

The real power comes from joining fact tables with dimension tables to answer complex questions:

1. Engagement by User Country

SELECT u.country,
       COUNT(DISTINCT e.user_id) as active_users,
       COUNT(*) as total_events,
       AVG(e.time_spent_seconds) as avg_time_spent
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
WHERE e.event_type = 'view'
GROUP BY u.country
ORDER BY total_events DESC;

2. Content Performance by Type and Language

SELECT p.content_type,
       p.language,
       COUNT(*) as total_interactions,
       SUM(CASE WHEN e.event_type = 'like' THEN 1 ELSE 0 END) as likes,
       SUM(CASE WHEN e.event_type = 'share' THEN 1 ELSE 0 END) as shares
FROM events_fact e
JOIN posts_dim p ON e.post_id = p.post_id
GROUP BY p.content_type, p.language
ORDER BY total_interactions DESC;

3. Verified vs Non-Verified User Engagement

SELECT u.verified,
       COUNT(DISTINCT e.user_id) as user_count,
       AVG(e.time_spent_seconds) as avg_time_spent,
       COUNT(*) / COUNT(DISTINCT e.user_id) as events_per_user
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
GROUP BY u.verified;

💡 Best Practice: Always join dimension tables to fact tables using the primary key (like user_id or post_id). This ensures accurate results and optimal query performance. Dimension tables are typically small enough to fit in memory, making joins very efficient.

The Star Schema Design

Now that you understand fact tables and dimension tables, let's see how they connect together in a star schema. This design pattern is used by most major platforms for their data warehouses, including Facebook, TikTok, YouTube, and X (Twitter).

What is a Star Schema?

A star schema is a database design pattern commonly used for platform data warehouses. It consists of:

  • A central fact table containing measurements and metrics (events, interactions, views)
  • Multiple dimension tables radiating outward like points of a star, providing context
  • Foreign key relationships connecting the fact table to each dimension table

Platform Data Star Schema

EVENTS_FACT
• event_id (PK)
• user_id (FK)
• post_id (FK)
• session_id (FK)
• time_id (FK)
• event_type
• time_spent
• device_type
USERS_DIM
• user_id (PK)
• username
• country
• verified
• created_date
POSTS_DIM
• post_id (PK)
• author_id
• content_type
• language
• hashtags
SESSIONS_DIM
• session_id (PK)
• start_time
• end_time
• ip_address
• location
TIME_DIM
• time_id (PK)
• date
• hour
• day_of_week
• month

The central fact table connects to dimension tables via foreign keys (FK). Primary keys (PK) uniquely identify each record.

Benefits of Star Schema

  • ✓ Fast queries: Simplified joins make aggregations very efficient
  • ✓ Easy to understand: Intuitive structure for analysts and researchers
  • ✓ Flexible analysis: Slice and dice data by any dimension
  • ✓ Optimized storage: Dimension data is not duplicated in fact table

Real-World Example

When you want to analyze "How much time do verified users from France spend viewing video content?"

You join the events fact table with the users dimension (for country and verification status) and posts dimension (for content type), then filter and aggregate.

Star Schema Query Example

This query demonstrates how to join the fact table with multiple dimension tables:

-- Analyze video viewing time by country and verification status
SELECT
    u.country,
    u.verified,
    COUNT(*) as total_views,
    AVG(e.time_spent_seconds) as avg_viewing_time,
    SUM(e.time_spent_seconds) as total_viewing_time
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
JOIN posts_dim p ON e.post_id = p.post_id
JOIN time_dim t ON e.time_id = t.time_id
WHERE e.event_type = 'view'
  AND p.content_type = 'Video'
  AND t.date BETWEEN '2024-10-01' AND '2024-10-31'
GROUP BY u.country, u.verified
ORDER BY total_viewing_time DESC;

💡 Note: This single query joins the central fact table with three dimension tables, filters by event type, content type, and time period, then aggregates by country and verification status. This is the power of the star schema design!

Try It Yourself

Practice SQL queries with our interactive playground. The database contains synthetic platform data with users, posts, and interactions tables.

Interactive SQL Playground

Practice SQL with synthetic platform data

Example Queries:

SELECT * FROM posts LIMIT 5;
Database Schema & Data Dictionary

users (Dimension Table)

Contains information about platform users

VariableDescriptionData Type
user_idUnique identifier for user (Primary Key)INTEGER
usernameUser's handle/usernameTEXT
display_nameUser's display nameTEXT
follower_countNumber of followersINTEGER
following_countNumber of accounts followedINTEGER
account_createdDate account was createdDATE
verifiedWhether account is verified (1=Yes, 0=No)BOOLEAN

posts (Content Table)

Contains posts/content published by users

VariableDescriptionData Type
post_idUnique identifier for post (Primary Key)INTEGER
user_idID of user who created the post (Foreign Key)INTEGER
contentText content of the postTEXT
post_dateTimestamp when post was createdDATETIME
likes_countNumber of likes/reactionsINTEGER
shares_countNumber of shares/repostsINTEGER
comments_countNumber of comments/repliesINTEGER

interactions (Fact Table)

Records individual user interactions with posts

VariableDescriptionData Type
interaction_idUnique identifier for interaction (Primary Key)INTEGER
post_idID of post being interacted with (Foreign Key)INTEGER
user_idID of user performing interaction (Foreign Key)INTEGER
interaction_typeType of interaction (like, share, comment)TEXT
interaction_dateTimestamp when interaction occurredDATETIME

Working with Query Results

Once you retrieve data with SQL, you'll often want to sort and analyze it. Here's an example of platform statistics that you can sort interactively:

Example Platform Statistics (Click headers to sort)

PlatformTotal EU UsersAvg Daily PostsData Formats
TikTok15000000045000000JSON
Instagram12000000038000000JSON
Facebook18000000052000000CSV, JSON
YouTube9500000012000000JSON, XML
X (Twitter)7500000028000000JSON

Click column headers to sort. 5 total rows.

Common SQL Patterns for Platform Analysis

Finding Top Content

-- Find posts with the most engagement
SELECT p.post_id, p.content,
       (p.likes_count + p.shares_count + p.comments_count) as total_engagement
FROM posts p
ORDER BY total_engagement DESC
LIMIT 10;

Time-Based Analysis

-- Count posts by day
SELECT DATE(post_date) as day,
       COUNT(*) as post_count
FROM posts
GROUP BY DATE(post_date)
ORDER BY day DESC;

User Behavior Analysis

-- Find most active users
SELECT u.username,
       COUNT(DISTINCT p.post_id) as posts_created,
       COUNT(DISTINCT i.interaction_id) as interactions_made
FROM users u
LEFT JOIN posts p ON u.user_id = p.user_id
LEFT JOIN interactions i ON u.user_id = i.user_id
GROUP BY u.user_id, u.username
ORDER BY posts_created DESC;

Additional Resources

  • W3Schools SQL Tutorial - Comprehensive SQL basics
  • Mode SQL Tutorial - Advanced SQL for data analysis
  • Select Star SQL - Interactive SQL book
Previous: Mapping DataNext: API Guide