Master SQL queries to analyze platform data effectively. Practice with our interactive SQL playground using synthetic platform data.
When people take an introductory course in research methods or statistics, especially in the social sciences, they are often taught to work with data that have been cleaned and organized into a simple, singular spreadsheet. Typically, each row in the spreadsheet will correspond to the subjects of the research (e.g., participants, customers, products), and each column will correspond to information about each subject (e.g., their survey responses, purchases, symptoms, prices).
This is a great way to start, and is often perfect for the relatively small datasets that university labs or think tanks may use for their individual research projects that are well-structured with relatively few specified goals in mind. For example, if you were studying whether a medication is effective at treating a medical condition, you might have a spreadsheet listing each patient, their dosage, and their symptoms over time. There is no need for multiple spreadsheets because the data are small enough and straightforward.
However, when data are collected at massive scale with less clear uses of the data, a single spreadsheet quickly becomes impractical and inefficient. So, very large online platforms, marketplaces, and search engines tend to store data in a bunch of different spreadsheets (typically called "tables") within a massive database. Some tables might focus on users, others might focus on content (e.g., posts, videos, images), others might focus on interactions (e.g., likes, shares, purchases, searches), and others might focus on metadata (time, location, device type, language, etc.).
If all of these data were stored in a single table, the file size could well be measured in zettabytes or exabytes (1 EB = 1 MILLION terabytes, or 1 BILLION gigabytes 🤯) instead of terabytes or gigabytes, which would make it impossible to open the data on even the most powerful personal computers, let alone doing anything with those data once the file is opened. Now, these companies typically use massive data centers running high-performance computing clusters that can handle much larger data, but even then, it's expensive, time-consuming, and unnecessary to process all of the data when only a few variables might be needed.
This is where SQL (Structured Query Language) comes in. SQL is a simpler programming language specifically designed for:
Additionally, SQL is useful because:
Fortunately, SQL is one of the simpler programming languages to learn. It can do complex things, but there are just 4 main commands that you need to know to accomplish most of what you would want to do with platform data as an external researcher or auditor. Additionally, SQL syntax is designed to be very close to plain English, making it easier to read, understand, and write.
I describe these basic commands in the boxes below with a functioning example query for each.
All SQL queries include a SELECT statement and a FROM statement.
SELECT username, follower_count
FROM usersIf you were to translate the above code into plain English, it would read "Get me the username and follower count data from the users table."
Often, you'll want data from more than one table. To connect data from multiple tables, you'll use the JOIN and ON statements.
SELECT u.username, p.content
FROM posts p
JOIN users u ON p.user_id = u.user_id;If you were to translate the above code into plain English, it would read "Get the username data from the users table and the content data from the posts table, and then give me each user's content."
Note: In the example above, I used table aliases (the single-letter names after the table names) to make the query easier to read and write. This is optional, but recommended when working with multiple tables because it is possible that multiples tables contain variables with the same names, even if they refer to different things. If you try to SELECT variables with the same names from multiple tables without using aliases, SQL will return an error. By using aliases (like "u" for users and "p" for posts), you can specify exactly which table each variable comes from and avoid getting an error I encountered too many times when I first started learning SQL!
When working with large data sets, you'll typically want (and potentially need) to reduce the total volume of data you're requesting. The main command for filtering data is the WHERE statement. You might also want to sort the data in a particular order, which you can do with the ORDER BY statement.
SELECT content, likes_count FROM posts
WHERE likes_count > 1000
ORDER BY post_date DESC;If you were to translate the above code into plain English, it would read "Get me the content and likes count data from the posts table, but only for posts that have more than 1000 likes, and sort the results by post date from most recent to oldest."
Note: In the example above, I sorted the data using a variable I didn't include in the SELECT statement (i.e., post_date) to illustrate that you can filter on variables even if you aren't requesting them for the output the post_date variable. This is useful when you want to filter or sort on variables that aren't necessary for your analysis.
You might also want to calculate summary statistics across your data, such as counts, averages, sums, minimums, and maximums. SQL has built-in functions for these common calculations, and you can use them directly in your SELECT statements to get those statistics quickly.
You can do more complex statistics, too, like comparing groups of data with the GROUP BY statement, but the examples below illustrate the basic idea. Personally, I would recommend doing more complex statistics (like regressions or sentiment modeling) outside of SQL in a statistical programming language like R or Python, but simple aggregations are fine.
SELECT COUNT(*) as total_posts,
AVG(likes_count) as avg_likes
FROM posts;If you were to translate the above code into plain English, it would read "Get me the total number of posts and the average likes count across all posts from the posts table."
Note: In the example above, the asterisk (*) in COUNT(*) means "count all rows." I used the COUNT() and AVG() functions to calculate the total number of posts and the average likes count, respectively.
Platform data is typically organized using a star schema (described in more detail further down on this page), with fact tables at the center storing event-level data and dimension tables providing contextual information. Fact tables are the foundation of platform analytics, containing metrics and measurements of user behaviors and interactions.
A fact table stores event-level data with metrics and foreign keys to dimension tables. Each row represents a single event or interaction on the platform.
| Variable Name | Description | Data Type | Retention Period |
|---|---|---|---|
| user_id | Unique identifier for the user | VARCHAR(255) | 90 days |
| session_id | Unique identifier for the user session | VARCHAR(255) | 90 days |
| event_id | Unique identifier for the event | BIGINT | 90 days |
| post_id | Unique identifier for the post interacted with | BIGINT | 90 days |
| event_time | Timestamp when the event occurred | TIMESTAMP | 90 days |
| event_type | Type of event (view, click, like, share, comment) | VARCHAR(50) | 90 days |
| time_spent_seconds | Time spent on content in seconds | INT | 90 days |
| device_type | Device used (mobile, desktop, tablet) | VARCHAR(50) | 90 days |
This example shows user interactions with posts, including views, likes, shares, and comments.
| user_id | session_id | event_id | post_id | event_time | event_type | time_spent_seconds | device_type |
|---|---|---|---|---|---|---|---|
| 101 | sess_a1b2c3 | 1001 | 5001 | 2024-10-21 09:15:23 | view | 45 | mobile |
| 101 | sess_a1b2c3 | 1002 | 5001 | 2024-10-21 09:16:08 | like | 0 | mobile |
| 102 | sess_d4e5f6 | 1003 | 5002 | 2024-10-21 09:17:45 | view | 120 | desktop |
| 102 | sess_d4e5f6 | 1004 | 5002 | 2024-10-21 09:19:45 | share | 0 | desktop |
| 103 | sess_g7h8i9 | 1005 | 5003 | 2024-10-21 09:22:10 | view | 30 | mobile |
| 104 | sess_j1k2l3 | 1006 | 5001 | 2024-10-21 09:25:33 | view | 67 | tablet |
| 104 | sess_j1k2l3 | 1007 | 5001 | 2024-10-21 09:26:40 | comment | 0 | tablet |
| 105 | sess_m4n5o6 | 1008 | 5004 | 2024-10-21 09:28:15 | view | 95 | desktop |
Here are common SQL queries for analyzing event data from fact tables:
SELECT post_id,
COUNT(*) as total_events,
SUM(CASE WHEN event_type = 'like' THEN 1 ELSE 0 END) as likes,
SUM(CASE WHEN event_type = 'share' THEN 1 ELSE 0 END) as shares,
SUM(CASE WHEN event_type = 'comment' THEN 1 ELSE 0 END) as comments,
AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
GROUP BY post_id
ORDER BY total_events DESC;SELECT device_type,
COUNT(DISTINCT user_id) as unique_users,
COUNT(*) as total_events,
AVG(time_spent_seconds) as avg_time_spent
FROM events_fact
WHERE event_type = 'view'
GROUP BY device_type
ORDER BY unique_users DESC;SELECT DATE(event_time) as event_date,
COUNT(DISTINCT user_id) as daily_active_users,
COUNT(DISTINCT session_id) as total_sessions,
COUNT(*) as total_events
FROM events_fact
GROUP BY DATE(event_time)
ORDER BY event_date DESC;💡 Tip: Fact tables are optimized for aggregation queries. They typically contain millions or billions of rows, so always use WHERE clauses to filter by time ranges and use indexes on frequently queried columns like user_id, post_id, and event_time.
Dimension tables complement fact tables by providing descriptive context. When you join fact tables with dimension tables, you can answer questions like "Which countries have the most engaged users?" or "How does content performance vary by language?"
Dimension tables contain descriptive attributes about entities in your data. They provide context for the metrics stored in fact tables. Common dimension tables include users, posts, locations, and time periods.
Key Difference: While fact tables store measurements (likes, views, time spent), dimension tables store descriptions (who, what, where, when). Dimension tables are typically much smaller than fact tables but provide essential context for analysis.
Contains descriptive information about each user on the platform.
| user_id | username | account_created | country | verified | follower_count |
|---|---|---|---|---|---|
| 101 | alice_research | 2022-03-15 | France | Yes | 1,250 |
| 102 | bob_analyst | 2021-08-22 | Germany | No | 450 |
| 103 | carol_data | 2023-01-10 | Spain | Yes | 3,200 |
| 104 | dave_insights | 2020-11-05 | Italy | No | 890 |
| 105 | eve_metrics | 2022-07-18 | Netherlands | Yes | 2,100 |
Contains descriptive information about each post or piece of content on the platform.
| post_id | author_id | content_type | content_length | language | created_date | hashtags |
|---|---|---|---|---|---|---|
| 5001 | 101 | Video | 120 | French | 2024-10-20 | #research #data |
| 5002 | 102 | Text | 280 | English | 2024-10-20 | #analytics #EU |
| 5003 | 103 | Image | 150 | Spanish | 2024-10-21 | #dataviz |
| 5004 | 105 | Video | 45 | English | 2024-10-21 | #DSA #platforms |
The real power comes from joining fact tables with dimension tables to answer complex questions:
SELECT u.country,
COUNT(DISTINCT e.user_id) as active_users,
COUNT(*) as total_events,
AVG(e.time_spent_seconds) as avg_time_spent
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
WHERE e.event_type = 'view'
GROUP BY u.country
ORDER BY total_events DESC;SELECT p.content_type,
p.language,
COUNT(*) as total_interactions,
SUM(CASE WHEN e.event_type = 'like' THEN 1 ELSE 0 END) as likes,
SUM(CASE WHEN e.event_type = 'share' THEN 1 ELSE 0 END) as shares
FROM events_fact e
JOIN posts_dim p ON e.post_id = p.post_id
GROUP BY p.content_type, p.language
ORDER BY total_interactions DESC;SELECT u.verified,
COUNT(DISTINCT e.user_id) as user_count,
AVG(e.time_spent_seconds) as avg_time_spent,
COUNT(*) / COUNT(DISTINCT e.user_id) as events_per_user
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
GROUP BY u.verified;💡 Best Practice: Always join dimension tables to fact tables using the primary key (like user_id or post_id). This ensures accurate results and optimal query performance. Dimension tables are typically small enough to fit in memory, making joins very efficient.
Now that you understand fact tables and dimension tables, let's see how they connect together in a star schema. This design pattern is used by most major platforms for their data warehouses, including Facebook, TikTok, YouTube, and X (Twitter).
A star schema is a database design pattern commonly used for platform data warehouses. It consists of:
The central fact table connects to dimension tables via foreign keys (FK). Primary keys (PK) uniquely identify each record.
When you want to analyze "How much time do verified users from France spend viewing video content?"
You join the events fact table with the users dimension (for country and verification status) and posts dimension (for content type), then filter and aggregate.
This query demonstrates how to join the fact table with multiple dimension tables:
-- Analyze video viewing time by country and verification status
SELECT
u.country,
u.verified,
COUNT(*) as total_views,
AVG(e.time_spent_seconds) as avg_viewing_time,
SUM(e.time_spent_seconds) as total_viewing_time
FROM events_fact e
JOIN users_dim u ON e.user_id = u.user_id
JOIN posts_dim p ON e.post_id = p.post_id
JOIN time_dim t ON e.time_id = t.time_id
WHERE e.event_type = 'view'
AND p.content_type = 'Video'
AND t.date BETWEEN '2024-10-01' AND '2024-10-31'
GROUP BY u.country, u.verified
ORDER BY total_viewing_time DESC;💡 Note: This single query joins the central fact table with three dimension tables, filters by event type, content type, and time period, then aggregates by country and verification status. This is the power of the star schema design!
Practice SQL queries with our interactive playground. The database contains synthetic platform data with users, posts, and interactions tables.
Contains information about platform users
| Variable | Description | Data Type |
|---|---|---|
| user_id | Unique identifier for user (Primary Key) | INTEGER |
| username | User's handle/username | TEXT |
| display_name | User's display name | TEXT |
| follower_count | Number of followers | INTEGER |
| following_count | Number of accounts followed | INTEGER |
| account_created | Date account was created | DATE |
| verified | Whether account is verified (1=Yes, 0=No) | BOOLEAN |
Contains posts/content published by users
| Variable | Description | Data Type |
|---|---|---|
| post_id | Unique identifier for post (Primary Key) | INTEGER |
| user_id | ID of user who created the post (Foreign Key) | INTEGER |
| content | Text content of the post | TEXT |
| post_date | Timestamp when post was created | DATETIME |
| likes_count | Number of likes/reactions | INTEGER |
| shares_count | Number of shares/reposts | INTEGER |
| comments_count | Number of comments/replies | INTEGER |
Records individual user interactions with posts
| Variable | Description | Data Type |
|---|---|---|
| interaction_id | Unique identifier for interaction (Primary Key) | INTEGER |
| post_id | ID of post being interacted with (Foreign Key) | INTEGER |
| user_id | ID of user performing interaction (Foreign Key) | INTEGER |
| interaction_type | Type of interaction (like, share, comment) | TEXT |
| interaction_date | Timestamp when interaction occurred | DATETIME |
Once you retrieve data with SQL, you'll often want to sort and analyze it. Here's an example of platform statistics that you can sort interactively:
| Platform | Total EU Users | Avg Daily Posts | Data Formats |
|---|---|---|---|
| TikTok | 150000000 | 45000000 | JSON |
| 120000000 | 38000000 | JSON | |
| 180000000 | 52000000 | CSV, JSON | |
| YouTube | 95000000 | 12000000 | JSON, XML |
| X (Twitter) | 75000000 | 28000000 | JSON |
Click column headers to sort. 5 total rows.
-- Find posts with the most engagement
SELECT p.post_id, p.content,
(p.likes_count + p.shares_count + p.comments_count) as total_engagement
FROM posts p
ORDER BY total_engagement DESC
LIMIT 10;-- Count posts by day
SELECT DATE(post_date) as day,
COUNT(*) as post_count
FROM posts
GROUP BY DATE(post_date)
ORDER BY day DESC;-- Find most active users
SELECT u.username,
COUNT(DISTINCT p.post_id) as posts_created,
COUNT(DISTINCT i.interaction_id) as interactions_made
FROM users u
LEFT JOIN posts p ON u.user_id = p.user_id
LEFT JOIN interactions i ON u.user_id = i.user_id
GROUP BY u.user_id, u.username
ORDER BY posts_created DESC;