SQL Guide for Platform Data

Master SQL queries to analyze platform data effectively. Practice with our interactive SQL playground using synthetic platform data.

What is SQL? And, why?

When people take an introductory course in research methods or statistics, especially in the social sciences, they are often taught to work with data that have been cleaned and organized into a simple, singular spreadsheet. Typically, each row in the spreadsheet will correspond to the subjects of the research (e.g., participants, customers, products), and each column will correspond to information about each subject (e.g., their survey responses, purchases, symptoms, prices).

This is a great way to start, and is often perfect for the relatively small datasets that university labs or think tanks may use for their individual research projects that are well-structured with relatively few specified goals in mind. For example, if you were studying whether a medication is effective at treating a medical condition, you might have a spreadsheet listing each patient, their dosage, and their symptoms over time. There is no need for multiple spreadsheets because the data are small enough and straightforward.

However, when data are collected at massive scale with less clear uses of the data, a single spreadsheet quickly becomes impractical and inefficient. So, very large online platforms, marketplaces, and search engines tend to store data in a bunch of different spreadsheets (typically called "tables") within a massive database. Some tables might focus on users, others might focus on content (e.g., posts, videos, images), others might focus on interactions (e.g., likes, shares, purchases, searches), and others might focus on metadata (time, location, device type, language, etc.).

If all of these data were stored in a single table, the file size could well be measured in zettabytes or exabytes (1 EB = 1 MILLION terabytes, or 1 BILLION gigabytes 🤯) instead of terabytes or gigabytes, which would make it impossible to open the data on even the most powerful personal computers, let alone doing anything with those data once the file is opened. Now, these companies typically use massive data centers running high-performance computing clusters that can handle much larger data, but even then, it's expensive, time-consuming, and unnecessary to process all of the data when only a few variables might be needed.

This is where SQL (Structured Query Language) comes in. SQL is a simpler programming language specifically designed for:

Working with platform data, because they commonly store their data in relational databases
Selecting, filtering, aggregating, and joining large datasets efficiently

Additionally, SQL is useful because:

It can convert outputs from APIs, which many platforms use to grant external parties access to internal data, into formats that are easier to analyze (as opposed to JSON or XML)
Many research databases (like Social Science One) require SQL knowledge to extract the desired data from the more than 71 trillion cells of data in them

Key SQL Concepts

Fortunately, SQL is one of the simpler programming languages to learn. It can do complex things, but there are just 4 main commands that you need to know to accomplish most of what you would want to do with platform data as an external researcher or auditor. Additionally, SQL syntax is designed to be very close to plain English, making it easier to read, understand, and write.

I describe these basic commands in the boxes below with a functioning example query for each.

🫴 SELECT Queries

All SQL queries include a SELECT statement and a FROM statement.

'SELECT' is a special term in SQL that means "get me this data." After writing SELECT, you write out the names of whatever columns you want to retrieve. If you want to retrieve all data in columns named username and follower_count, you simply spell out those column names.
'FROM' is another special SQL term that means "from this table." After writing FROM, you write the name of the table where the data live. In this case, the table is named users.

SELECT username, follower_count
FROM users

👩‍🏫If you were to translate the above code into plain English, it would read "Get me the username and follower count data from the users table."

🔗 JOIN Operations

Often, you'll want data from more than one table. To connect data from multiple tables, you'll use the JOIN and ON statements.

'JOIN' is a special term in SQL that means "combine data from these two tables." After writing JOIN, you write the name of the second table you want to connect to the first table.
'ON' is another special SQL term that specifies some variable that exists in both tables that you want to use to connect the data. If we have a table containing data on users and another table containing data on posts, both of those tables would contain a variable like user_id identifying the user (in the users table) and the author (in the posts table). So, you'll need to say what the variable is that you want to join the data on.

SELECT u.username, p.content
FROM posts p
JOIN users u ON p.user_id = u.user_id;

👩‍🏫If you were to translate the above code into plain English, it would read "Get the username data from the users table and the content data from the posts table, and then give me each user's content."

Note: In the example above, I used table aliases (the single-letter names after the table names) to make the query easier to read and write. This is optional, but recommended when working with multiple tables because it is possible that multiples tables contain variables with the same names, even if they refer to different things. If you try to SELECT variables with the same names from multiple tables without using aliases, SQL will return an error. By using aliases (like "u" for users and "p" for posts), you can specify exactly which table each variable comes from and avoid getting an error I encountered too many times when I first started learning SQL!

🤏 Filtering & Sorting

When working with large data sets, you'll typically want (and potentially need) to reduce the total volume of data you're requesting. The main command for filtering data is the WHERE statement. You might also want to sort the data in a particular order, which you can do with the ORDER BY statement.

'WHERE' is a special term in SQL that means "only give me data meeeting this condition." After writing WHERE, you write the name of the variable you want to filter along with what condition the variable values must meet to be included.
'ORDER BY' is another special SQL term that reorders the output of your request on some variable. After writing ORDER BY, you write the name of the variable you want to sort by, along with whether you want the data sorted in ascending (ASC) or descending (DESC) order.

SELECT content, likes_count FROM posts
WHERE likes_count > 1000
ORDER BY post_date DESC;

👩‍🏫If you were to translate the above code into plain English, it would read "Get me the content and likes count data from the posts table, but only for posts that have more than 1000 likes, and sort the results by post date from most recent to oldest."

Note: In the example above, I sorted the data using a variable I didn't include in the SELECT statement (i.e., post_date) to illustrate that you can filter on variables even if you aren't requesting them for the output the post_date variable. This is useful when you want to filter or sort on variables that aren't necessary for your analysis.

🧮 Aggregations

You might also want to calculate summary statistics across your data, such as counts, averages, sums, minimums, and maximums. SQL has built-in functions for these common calculations, and you can use them directly in your SELECT statements to get those statistics quickly.

You can do more complex statistics, too, like comparing groups of data with the GROUP BY statement, but the examples below illustrate the basic idea. Personally, I would recommend doing more complex statistics (like regressions or sentiment modeling) outside of SQL in a statistical programming language like R or Python, but simple aggregations are fine.

SELECT COUNT(*) as total_posts,
       AVG(likes_count) as avg_likes
FROM posts;

👩‍🏫If you were to translate the above code into plain English, it would read "Get me the total number of posts and the average likes count across all posts from the posts table."

Note: In the example above, the asterisk (*) in COUNT(*) means "count all rows." I used the COUNT() and AVG() functions to calculate the total number of posts and the average likes count, respectively.

Understanding Fact Tables

Platform data is typically organized using a star schema (described in more detail further down on this page), with fact tables at the center storing event-level data and dimension tables providing contextual information. Fact tables are the foundation of platform analytics, containing metrics and measurements of user behaviors and interactions.

Example Fact Table Data Dictionary

A fact table stores event-level data with metrics and foreign keys to dimension tables. Each row represents a single event or interaction on the platform.

Variable Name	Description	Data Type	Retention Period
user_id	Unique identifier for the user	VARCHAR(255)	90 days
session_id	Unique identifier for the user session	VARCHAR(255)	90 days
event_id	Unique identifier for the event	BIGINT	90 days
post_id	Unique identifier for the post interacted with	BIGINT	90 days
event_time	Timestamp when the event occurred	TIMESTAMP	90 days
event_type	Type of event (view, click, like, share, comment)	VARCHAR(50)	90 days
time_spent_seconds	Time spent on content in seconds	INT	90 days
device_type	Device used (mobile, desktop, tablet)	VARCHAR(50)	90 days

Example Fact Table Data

This example shows user interactions with posts, including views, likes, shares, and comments.

user_id	session_id	event_id	post_id	event_time	event_type	time_spent_seconds	device_type
101	sess_a1b2c3	1001	5001	2024-10-21 09:15:23	view	45	mobile
101	sess_a1b2c3	1002	5001	2024-10-21 09:16:08	like	0	mobile
102	sess_d4e5f6	1003	5002	2024-10-21 09:17:45	view	120	desktop
102	sess_d4e5f6	1004	5002	2024-10-21 09:19:45	share	0	desktop
103	sess_g7h8i9	1005	5003	2024-10-21 09:22:10	view	30	mobile
104	sess_j1k2l3	1006	5001	2024-10-21 09:25:33	view	67	tablet
104	sess_j1k2l3	1007	5001	2024-10-21 09:26:40	comment	0	tablet
105	sess_m4n5o6	1008	5004	2024-10-21 09:28:15	view	95	desktop

Extracting Metrics from Fact Tables

Here are common SQL queries for analyzing event data from fact tables:

1. Calculate Total Engagement by Post

SELECT post_id,
       COUNT(*) as total_events,
       SUM(CASE WHEN event_type = 'like' THEN 1 ELSE 0 END) as likes,
       SUM(CASE WHEN event_type = 'share' THEN 1 ELSE 0 END) as shares,
       SUM(CASE WHEN event_type = 'comment' THEN 1 ELSE 0 END) as comments,
       AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
GROUP BY post_id
ORDER BY total_events DESC;

2. Analyze User Behavior by Device Type

SELECT device_type,
       COUNT(DISTINCT user_id) as unique_users,
       COUNT(*) as total_events,
       AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
WHERE event_type = 'view'
GROUP BY device_type
ORDER BY unique_users DESC;

3. Track User Activity Over Time

SELECT DATE(event_time) as event_date,
       COUNT(DISTINCT user_id) as daily_active_users,
       COUNT(DISTINCT session_id) as total_sessions,
       COUNT(*) as total_events
FROM events_fact
GROUP BY DATE(event_time)
ORDER BY event_date DESC;

💡 Tip: Fact tables are optimized for aggregation queries. They typically contain millions or billions of rows, so always use WHERE clauses to filter by time ranges and use indexes on frequently queried columns like user_id, post_id, and event_time.

Understanding Dimension Tables

Dimension tables complement fact tables by providing descriptive context. When you join fact tables with dimension tables, you can answer questions like "Which countries have the most engaged users?" or "How does content performance vary by language?"

What Are Dimension Tables?

Dimension tables contain descriptive attributes about entities in your data. They provide context for the metrics stored in fact tables. Common dimension tables include users, posts, locations, and time periods.

Key Difference: While fact tables store measurements (likes, views, time spent), dimension tables store descriptions (who, what, where, when). Dimension tables are typically much smaller than fact tables but provide essential context for analysis.

Example Users Dimension Table Data Dictionary

A dimension table that stores descriptive attributes about users on the platform. Each row represents a single user with their profile information.

Variable Name	Description	Data Type	Retention Period
user_id	Unique identifier for each user (Primary Key)	INTEGER	90 days
username	User's unique username or handle on the platform	VARCHAR(255)	90 days
account_created	Date when the user account was created	DATE	90 days
country	Country where the user is located	VARCHAR(100)	90 days
verified	Whether the account is verified (Yes/No)	VARCHAR(3)	90 days
follower_count	Number of followers the user has	INTEGER	90 days

Example Users Dimension Table

Contains descriptive information about each user on the platform.

user_id	username	account_created	country	verified	follower_count
101	alice_research	2022-03-15	France	Yes	1,250
102	bob_analyst	2021-08-22	Germany	No	450
103	carol_data	2023-01-10	Spain	Yes	3,200
104	dave_insights	2020-11-05	Italy	No	890
105	eve_metrics	2022-07-18	Netherlands	Yes	2,100

Example Posts Dimension Table Data Dictionary

A dimension table that stores descriptive attributes about posts and content on the platform. Each row represents a single post with its metadata.

Variable Name	Description	Data Type	Retention Period
post_id	Unique identifier for each post (Primary Key)	INTEGER	90 days
author_id	User ID of the post author (Foreign Key to users table)	INTEGER	90 days
content_type	Type of content (Video, Image, Text)	VARCHAR(50)	90 days
content_length	Length of content in seconds (for video) or characters (for text)	INTEGER	90 days
language	Language of the post content	VARCHAR(50)	90 days
created_date	Date when the post was created	DATE	90 days
hashtags	Hashtags associated with the post	VARCHAR(500)	90 days

Example Posts Dimension Table

Contains descriptive information about each post or piece of content on the platform.

post_id	author_id	content_type	content_length	language	created_date	hashtags
5001	101	Video	120	French	2024-10-20	#research #data
5002	102	Text	280	English	2024-10-20	#analytics #EU
5003	103	Image	150	Spanish	2024-10-21	#dataviz
5004	105	Video	45	English	2024-10-21	#DSA #platforms

Joining Fact and Dimension Tables

The real power comes from joining fact tables with dimension tables to answer complex questions:

1. Engagement by User Country

SELECT u.country,
       COUNT(DISTINCT e.user_id) as active_users,
       COUNT(*) as total_events,
       AVG(e.time_spent_seconds) as avg_time_spent
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
WHERE e.event_type = 'view'
GROUP BY u.country
ORDER BY total_events DESC;

2. Content Performance by Type and Language

SELECT p.content_type,
       p.language,
       COUNT(*) as total_interactions,
       SUM(CASE WHEN e.event_type = 'like' THEN 1 ELSE 0 END) as likes,
       SUM(CASE WHEN e.event_type = 'share' THEN 1 ELSE 0 END) as shares
FROM events_fact e
JOIN posts_dim p ON e.post_id = p.post_id
GROUP BY p.content_type, p.language
ORDER BY total_interactions DESC;

3. Verified vs Non-Verified User Engagement

SELECT u.verified,
       COUNT(DISTINCT e.user_id) as user_count,
       AVG(e.time_spent_seconds) as avg_time_spent,
       COUNT(*) / COUNT(DISTINCT e.user_id) as events_per_user
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
GROUP BY u.verified;

💡 Best Practice: Always join dimension tables to fact tables using the primary key (like user_id or post_id). This ensures accurate results and optimal query performance. Dimension tables are typically small enough to fit in memory, making joins very efficient.

The Star Schema Design

Now that you understand fact tables and dimension tables, let's see how they connect together in a star schema. This design pattern is used by most major platforms for their data warehouses, including Facebook, TikTok, YouTube, and X (Twitter).

What is a Star Schema?

A star schema is a database design pattern commonly used for platform data warehouses. It consists of:

A central fact table containing measurements and metrics (events, interactions, views)
Multiple dimension tables radiating outward like points of a star, providing context
Foreign key relationships connecting the fact table to each dimension table

Platform Data Star Schema

EVENTS_FACT

• event_id (PK)

• user_id (FK)

• post_id (FK)

• session_id (FK)

• time_id (FK)

• event_type

• time_spent

• device_type

USERS_DIM

• user_id (PK)

• username

• country

• verified

• created_date

POSTS_DIM

• post_id (PK)

• author_id

• content_type

• language

• hashtags

SESSIONS_DIM

• session_id (PK)

• start_time

• end_time

• ip_address

• location

TIME_DIM

• time_id (PK)

• date

• hour

• day_of_week

• month

The central fact table connects to dimension tables via foreign keys (FK). Primary keys (PK) uniquely identify each record.

Benefits of Star Schema

✓ Fast queries: Simplified joins make aggregations very efficient
✓ Easy to understand: Intuitive structure for analysts and researchers
✓ Flexible analysis: Slice and dice data by any dimension
✓ Optimized storage: Dimension data is not duplicated in fact table

Real-World Example

When you want to analyze "How much time do verified users from France spend viewing video content?"

You join the events fact table with the users dimension (for country and verification status) and posts dimension (for content type), then filter and aggregate.

Star Schema Query Example

This query demonstrates how to join the fact table with multiple dimension tables:

-- Analyze video viewing time by country and verification status
SELECT
    u.country,
    u.verified,
    COUNT(*) as total_views,
    AVG(e.time_spent_seconds) as avg_viewing_time,
    SUM(e.time_spent_seconds) as total_viewing_time
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
JOIN posts_dim p ON e.post_id = p.post_id
JOIN time_dim t ON e.time_id = t.time_id
WHERE e.event_type = 'view'
  AND p.content_type = 'Video'
  AND t.date BETWEEN '2024-10-01' AND '2024-10-31'
GROUP BY u.country, u.verified
ORDER BY total_viewing_time DESC;

💡 Note: This single query joins the central fact table with three dimension tables, filters by event type, content type, and time period, then aggregates by country and verification status. This is the power of the star schema design!

Try It Yourself

Practice SQL queries with our interactive playground. The database contains synthetic platform data with users, posts, and interactions tables.

Working with Query Results

Once you retrieve data with SQL, you'll often want to sort and analyze it. Here's an example of platform statistics that you can sort interactively:

Example Platform Statistics (Click headers to sort)

Platform	Total EU Users	Avg Daily Posts	Data Formats
TikTok	150000000	45000000	JSON
Instagram	120000000	38000000	JSON
Facebook	180000000	52000000	CSV, JSON
YouTube	95000000	12000000	JSON, XML
X (Twitter)	75000000	28000000	JSON

Click column headers to sort. 5 total rows.

Common SQL Patterns for Platform Analysis

Finding Top Content

-- Find posts with the most engagement
SELECT p.post_id, p.content,
       (p.likes_count + p.shares_count + p.comments_count) as total_engagement
FROM posts p
ORDER BY total_engagement DESC
LIMIT 10;

Time-Based Analysis

-- Count posts by day
SELECT DATE(post_date) as day,
       COUNT(*) as post_count
FROM posts
GROUP BY DATE(post_date)
ORDER BY day DESC;

User Behavior Analysis

-- Find most active users
SELECT u.username,
       COUNT(DISTINCT p.post_id) as posts_created,
       COUNT(DISTINCT i.interaction_id) as interactions_made
FROM users u
LEFT JOIN posts p ON u.user_id = p.user_id
LEFT JOIN interactions i ON u.user_id = i.user_id
GROUP BY u.user_id, u.username
ORDER BY posts_created DESC;

Additional Resources

W3Schools SQL Tutorial - Comprehensive SQL basics
Mode SQL Tutorial - Advanced SQL for data analysis
Select Star SQL - Interactive SQL book

Previous: Mapping Data Next: API Guide