Skip to main content
Show Me The Data
HomeIntro
About

Newsletter

Get insights on platform data and research

Subscribe

YouTube Channel

Video tutorials and insights

Subscribe

Support on Patreon

Help create more content

Become a Patron

Buy Me a Coffee

One-time support

Buy Coffee

Created by Matt Motyl

© 2025 Matt Motyl. All rights reserved.

On This Page

Submit Feedback

Glossary

Key terms and definitions for understanding platform data, the Digital Services Act, and research methods.

Tip: Throughout this guide, terms with dotted underlines have tooltip definitions. Hover over them to see a quick definition, or click "Learn more" to come here for the full explanation.

Regulation & Policy

Article 40

DSA provision requiring platforms to provide data access to vetted researchers.

Article 40 of the Digital Services Act establishes the legal framework for researcher access to platform data. It requires VLOPs and VLOSEs to provide access to data necessary for research that contributes to detecting, identifying, and understanding systemic risks in the EU. Researchers must be vetted and affiliated with research organizations.

Related: DSA, VLOP, Systemic Risks
External resource: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL... ↗

DSA (Digital Services Act)

Landmark EU legislation imposing transparency and accountability obligations on digital platforms.

The Digital Services Act is European Union legislation that creates a comprehensive regulatory framework for digital services. It imposes obligations on platforms regarding content moderation, algorithmic transparency, researcher data access, and risk assessments. VLOPs and VLOSEs face additional requirements including systemic risk assessments and independent audits.

Related: VLOP, VLOSE, Article 40, Systemic Risks
External resource: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL... ↗

Systemic Risks

Risks that online platforms can pose to society, democracy, and public health.

Under the DSA, systemic risks include: dissemination of illegal content, negative effects on fundamental rights (privacy, freedom of expression), negative effects on civic discourse and electoral processes, negative effects on public health and minors, and negative effects related to gender-based violence. VLOPs must assess and mitigate these risks.

Related: DSA, VLOP, Content Moderation
External resource: https://www.techpolicy.press/understanding-systemic-risks-un... ↗

Transparency Report

Public disclosure of platform enforcement actions and content moderation statistics.

Transparency reports are regular publications by platforms detailing their content moderation activities, including content removed, accounts suspended, and government requests received. The DSA requires VLOPs to publish transparency reports, though researchers have noted these often lack the detail needed for verification.

Related: DSA, Content Moderation, VLOP

VLOP (Very Large Online Platform)

Online platforms with more than 45 million monthly active users in the EU.

Very Large Online Platforms are designated by the European Commission under the DSA. They include social media platforms (Facebook, Instagram, TikTok, YouTube, X), marketplaces (Amazon, AliExpress), and other services that must comply with enhanced DSA obligations including risk assessments, transparency reports, and researcher data access.

Related: DSA, VLOSE, Article 40
External resource: https://www.eu-digital-services-act.com/Digital_Services_Act... ↗

VLOPSEs (Very Large Online Platforms and Search Engines)

Combined term for VLOPs and VLOSEs - platforms with over 45 million EU users.

VLOPSEs is a collective term referring to both Very Large Online Platforms (VLOPs) and Very Large Online Search Engines (VLOSEs) as designated under the Digital Services Act. These are digital services with more than 45 million monthly active users in the EU, subject to the most stringent DSA obligations including systemic risk assessments, independent audits, and researcher data access requirements.

Related: VLOP, VLOSE, DSA, Article 40

VLOSE (Very Large Online Search Engine)

Search engines with more than 45 million monthly active users in the EU.

Very Large Online Search Engines are search services designated under the DSA. Currently includes Google Search and Bing. They are subject to similar transparency and accountability requirements as VLOPs, including obligations around algorithmic transparency and researcher data access.

Related: DSA, VLOP, Article 40
External resource: https://www.eu-digital-services-act.com/Digital_Services_Act... ↗

Data & Databases

Data Warehouse

A central repository of integrated data from multiple sources for analysis.

A data warehouse stores large volumes of historical data optimized for querying and analysis rather than transaction processing. Platforms use data warehouses to store user data, content, engagement metrics, and other information. The structure can vary from chaotic (organically created tables) to standardized (strict frameworks).

Related: Star Schema, Fact Table, Dimension Table
External resource: https://en.wikipedia.org/wiki/Data_warehouse ↗

Dimension Table

A database table containing descriptive attributes about entities.

Dimension tables provide context for the data in fact tables. They contain attributes that are relatively stable over time, such as user demographics, post content, or location information. Dimension tables are in "wide" format with many columns but fewer rows. They connect to fact tables via primary keys.

Related: Fact Table, Star Schema, Primary Key

Fact Table

A database table containing measurable events or transactions.

Fact tables are the central tables in a star schema, containing quantitative data about events (views, likes, shares, logins). Each row represents a discrete event with foreign keys linking to dimension tables. They are typically in "long" format with fewer columns but many rows. Examples include user activity logs, post engagement metrics, and ad impressions.

Related: Dimension Table, Star Schema, Foreign Key

Partitioning

Dividing large database tables into smaller, more manageable segments.

Partitioning splits large tables into segments based on criteria like date, platform, or surface. This reduces query costs and processing time. For example, a table partitioned by day means querying one day costs 1/90th of querying an unpartitioned 90-day table. Researchers must include partition filters in their queries.

Related: Data Warehouse, Query

Retention Window

The time period during which data is stored before being deleted.

Platforms set retention windows for different data types based on legal requirements, privacy policies, and storage costs. For example, a table might retain data for 30, 60, or 90 days. Queries requesting data outside the retention window may fail. Some sensitive data may have shorter retention periods.

Related: Data Warehouse, Partitioning

Star Schema

A database design with a central fact table connected to dimension tables.

A star schema is a common data warehouse design pattern where a central fact table containing metrics is surrounded by dimension tables providing context. This design optimizes query performance and reduces data redundancy. Most major platforms (Facebook, TikTok, YouTube) use star schemas for their data warehouses.

Related: Fact Table, Dimension Table, Data Warehouse

Technical Terms

API (Application Programming Interface)

A set of protocols that allow different software applications to communicate.

An Application Programming Interface allows external applications to access platform data and functionality programmatically. Platform APIs typically provide endpoints for retrieving posts, user information, and engagement metrics. Researchers use APIs to collect data for studies, though API access is often restricted and rate-limited.

Related: JSON, Rate Limiting, Endpoint

Browser Cookies

Small data files stored by websites in a user's browser to remember information.

Cookies are small text files that websites place on users' devices to store information such as login status, preferences, and tracking identifiers. First-party cookies are set by the site being visited, while third-party cookies are set by external services (often for advertising). Cookies enable personalization and analytics but raise privacy concerns, leading to regulations like GDPR requirements for cookie consent.

Related: Device Data, Geolocation
External resource: https://en.wikipedia.org/wiki/HTTP_cookie ↗

Classifier

A machine learning model that categorizes content or users into predefined classes.

Classifiers are models trained to predict whether content or users belong to specific categories. Platforms use classifiers to predict policy violations (hate speech, spam, misinformation), user attributes (interests, location), and content quality. They generate probability scores that inform content moderation and recommendation decisions.

Related: Machine Learning, False Positive, False Negative

CSV (Comma-Separated Values)

A simple file format that stores tabular data in plain text.

CSV files store data as rows of values separated by commas, with each row representing a record. They are widely used for bulk data exports and can be easily opened in spreadsheet applications. CSV is less flexible than JSON but simpler for tabular data.

Related: JSON, XML

Foreign Key

A field that references the primary key of another table.

A foreign key creates a link between two tables by referencing the primary key of another table. In a star schema, fact tables contain foreign keys (like user_id, post_id) that link to primary keys in dimension tables. This allows you to join tables and retrieve related information.

Related: Primary Key, JOIN, Fact Table

Geolocation

The identification of the real-world geographic location of a device or user.

Geolocation uses various signals including IP addresses, GPS coordinates, WiFi networks, and cell tower data to determine where a user is located. Platforms use geolocation for ad targeting, content localization, and detecting location-based deceptive behavior.

Related: IP Address, GPS Coordinates

GPS Coordinates

Geographic location data from satellite positioning systems.

GPS coordinates provide precise location information accurate to within a few meters. Mobile devices share GPS data with platforms (when permitted) for location-based features. Unlike IP addresses, GPS coordinates are difficult to spoof, making them valuable for verifying user locations.

Related: IP Address, Geolocation, Device Data

IP Address

A numerical label assigned to devices connected to a computer network.

IP addresses identify devices on networks and provide location information including country, region, city, and approximate coordinates. Platforms extract IP addresses to verify user locations, detect suspicious activity, and comply with regional regulations. IP addresses can be masked using VPNs or Tor.

Related: VPN, Geolocation, Device Data

JOIN

SQL operation that combines rows from two or more tables based on related columns.

A JOIN operation links data from multiple tables using matching key columns. Types include INNER JOIN (only matching rows), LEFT JOIN (all rows from left table), and OUTER JOIN (all rows from both). JOINs are essential for combining fact and dimension tables to get complete information about events.

Related: SQL, Foreign Key, Primary Key

JSON (JavaScript Object Notation)

A lightweight data format commonly used for API responses.

JSON is a text-based data format that is easy for humans to read and machines to parse. It uses key-value pairs and arrays to structure data. Most platform APIs return data in JSON format. Example: {"user_id": 123, "username": "researcher", "followers": 1500}

Related: API, CSV, XML

MAC Address (Media Access Control Address)

A unique hardware identifier assigned to network interface controllers.

A MAC address is a unique identifier assigned to a network interface controller (NIC) for use as a network address. Unlike IP addresses which can change, MAC addresses are typically permanent and tied to the device hardware. Platforms may collect MAC addresses as part of device fingerprinting for security, fraud prevention, and user identification across sessions.

Related: IP Address, Device Data
External resource: https://en.wikipedia.org/wiki/MAC_address ↗

Machine Learning

Computer systems that learn from data to make predictions or decisions.

Machine learning encompasses algorithms that improve through experience. Platforms use various ML techniques including gradient boosted decision trees, random forests, support vector machines, and neural networks. These models power content recommendations, ad targeting, content moderation, and user behavior prediction.

Related: Classifier, Neural Network, Training Data

Prediction Models

Statistical or machine learning models that forecast outcomes based on input data.

Prediction models use historical data to forecast future outcomes or classify new observations. On platforms, these models predict user behavior (engagement, churn), content characteristics (policy violations, quality), and optimal actions (ad targeting, content ranking). Models are trained on labeled data and evaluated using metrics like precision, recall, and accuracy.

Related: Machine Learning, Classifier, Precision, Recall

Primary Key

A unique identifier for each row in a database table.

A primary key is a column (or combination of columns) that uniquely identifies each row in a table. In dimension tables, primary keys (like user_id or post_id) are used to join with foreign keys in fact tables. Primary keys must be unique and cannot be null.

Related: Foreign Key, JOIN, Dimension Table

Query

A request for data from a database, typically written in SQL.

A query is a command that retrieves, modifies, or analyzes data in a database. SQL queries allow researchers to select specific columns, filter rows based on conditions, join multiple tables, and calculate aggregations. Well-optimized queries are important when working with large platform datasets to minimize computational costs.

Related: SQL, SELECT, WHERE, JOIN

Rate Limiting

Restrictions on how many API requests can be made in a given time period.

Rate limits control how often users or applications can access an API. Platforms impose rate limits to prevent abuse and manage server load. Researchers often face strict rate limits (e.g., 500-1000 results per call, limited calls per day) that can make it difficult to collect sufficient data for studies.

Related: API, Quota

Relational Database

A type of database where there are many different tables containing data that can be connected to data in other tables using key id variables.

This is a type of database that organizes data into structured formats using tables that can be linked to data in other tables through key identifier variables. Relational databases use SQL for querying and managing data. They are widely used by platforms to store user data, content, and engagement metrics in an organized manner.

Related: Data Warehouse, Foreign Key, Primary Key, Star Schema

SQL (Structured Query Language)

A programming language for managing and querying relational databases.

SQL is the standard language for interacting with relational databases. It allows you to SELECT data, JOIN tables, filter with WHERE clauses, and aggregate with GROUP BY. Platform researchers use SQL to query large datasets, analyze user behavior, and extract insights from fact and dimension tables.

Related: Query, Fact Table, Dimension Table, JOIN

Training Data

Labeled data used to teach machine learning models.

Training data consists of examples with known outcomes used to train predictive models. For content classifiers, this includes posts that have been labeled as violating or non-violating by human reviewers. The quality and representativeness of training data significantly affects model accuracy.

Related: Machine Learning, Classifier, Ground Truth

VPN (Virtual Private Network)

A service that encrypts internet traffic and masks the user's IP address.

VPNs create encrypted connections that hide a user's actual IP address and location. While VPNs have legitimate privacy uses, they can also be used to evade geographic restrictions or conduct deceptive activities. Platforms can detect VPN usage as a signal for potential inauthentic behavior.

Related: IP Address, Tor Browser, Location Masking

XML (eXtensible Markup Language)

A markup language for encoding documents in a format readable by humans and machines.

XML uses tags to define elements and their hierarchical relationships. It was commonly used by older APIs before JSON became prevalent. Some platforms still provide data exports in XML format. It is more verbose than JSON but supports schemas for validation.

Related: JSON, API

Platform Concepts

Boost

Amplifying content's distribution or visibility in a ranked feed.

Boosting is the opposite of demotion - it increases content's visibility by ranking it higher in feeds and recommendations. Platforms may boost content that is high quality, authoritative, or aligns with platform goals (e.g., original content, content from verified sources). Paid promotion/advertising is a form of boosting where creators pay to increase their content's reach.

Related: Demotion, Recommendation System, Feed, Content Moderation

Content Moderation

The process of reviewing and managing user-generated content on platforms.

Content moderation involves monitoring, reviewing, and taking action on user-generated content to enforce platform policies. This includes removing content that violates terms of service, applying labels or warnings, reducing distribution of borderline content, and responding to user reports. Modern platforms use a combination of automated systems (classifiers) and human reviewers.

Related: Classifier, False Positive, False Negative, User Reports

Coordinated Inauthentic Behavior

Organized deceptive activity where accounts work together to mislead users.

Coordinated inauthentic behavior (CIB) involves groups of accounts working together while hiding their true identities or purposes. This can include fake accounts, bot networks, or real people operating multiple accounts. Platforms monitor for CIB by analyzing network patterns, device sharing, and behavioral similarities.

Related: FIMI, Bot, Inauthentic Behavior

Demotion

Reducing the distribution or visibility of content in a ranked feed.

Demotion is a content moderation action where content is not removed but its distribution is reduced. Demoted content appears lower in feeds and recommendations, reaching fewer users. Platforms use demotion for borderline content that doesn't clearly violate policies but may be low quality, misleading, or sensational. Demotion is less visible to users than removal but can significantly reduce content reach.

Related: Boost, Content Moderation, Recommendation System, Feed

Feed

A stream of content displayed to users, often personalized by algorithms.

A feed is a continuously updating list of content shown to users. Feeds can show content from accounts the user follows, algorithmically recommended content, or a mix. The composition of feeds is a major factor in what content users are exposed to and is central to debates about platform influence.

Related: Surface, Recommendation System, Algorithm
Mojo

Mojo

The cutest 16 year old puppy in the world.

Mojo is a beloved canine companion who, despite being 16 years old, maintains all the charm and spirit of a puppy. His endearing nature and unwavering cuteness serve as a reminder that age is just a number when it comes to being adorable.

Recommendation System

Algorithms that suggest content to users based on predicted interests.

Recommendation systems use machine learning to predict what content, products, or accounts a user will find engaging. They analyze user behavior, content attributes, and network connections. These systems are central to platform business models but raise concerns about filter bubbles and amplification of harmful content.

Related: Algorithm, Feed, Machine Learning

Surface

Different areas or interfaces where users interact with a platform.

Surfaces are the various places within a platform where content appears. Examples include: a feed of followed accounts, a recommended content feed, profile pages, shopping sections, messaging areas, and search results. Platforms often partition data by surface for efficient querying and analysis.

Related: Feed, Recommendation System, Partitioning

User Reports

Complaints submitted by users about potentially violating content or accounts.

User reports are a key input for content moderation systems. When users report content as violating policies, it signals to the platform that the content may need review. High report rates can increase a post's violation probability score and trigger automated or human review.

Related: Content Moderation, Classifier

Research Methods

Disinformation

False information deliberately created and spread to deceive.

Disinformation is intentionally false or misleading information spread to cause harm or achieve a goal. It differs from misinformation (unintentionally false) in its deliberate nature. Platforms use classifiers and human review to identify and limit disinformation, especially around elections and public health topics.

Related: Misinformation, FIMI, Content Moderation

Engagement

User interactions with content such as likes, comments, shares, and clicks.

Engagement encompasses all ways users interact with content beyond just viewing it. Common engagement metrics include likes/reactions, comments, shares/retweets, saves, and click-throughs. High engagement can indicate content resonance but also controversy. Platforms use engagement data to inform recommendation algorithms.

Related: Reach, Views, Metrics

False Negative

When a model incorrectly predicts a negative result (Type II error).

In content moderation, a false negative occurs when violating content is incorrectly allowed to remain. For example, failing to remove actual misinformation. False negatives can cause harm to users who are exposed to policy-violating content. Platforms must balance false positives and false negatives.

Related: False Positive, Classifier, Content Moderation
External resource: https://en.wikipedia.org/wiki/False_positives_and_false_nega... ↗

False Positive

When a model incorrectly predicts a positive result (Type I error).

In content moderation, a false positive occurs when non-violating content is incorrectly flagged as violating. For example, removing a legitimate news post as misinformation. Minimizing false positives is important because users get upset when their content is wrongly removed, but reducing them increases false negatives.

Related: False Negative, Classifier, Content Moderation
External resource: https://en.wikipedia.org/wiki/False_positives_and_false_nega... ↗

FIMI (Foreign Information Manipulation and Interference)

Coordinated campaigns by foreign actors to manipulate information environments.

FIMI refers to coordinated efforts by foreign state or non-state actors to spread disinformation, manipulate public discourse, or interfere with democratic processes. The Russian Internet Research Agency's activities during the 2016 US election are a prominent example. Detecting FIMI requires analyzing account authenticity and coordination patterns.

Related: Disinformation, Coordinated Inauthentic Behavior

Ground Truth

The actual, verified correct answer or label used to train and evaluate models.

Ground truth refers to the known correct classification or value for a data point, typically determined by human review or authoritative sources. In content moderation, ground truth might be the final human decision on whether content violates policy. Ground truth is essential for training supervised learning models and evaluating their performance through comparison of predictions against known correct answers.

Related: Training Data, Precision, Recall, Machine Learning
External resource: https://en.wikipedia.org/wiki/Ground_truth ↗

Precision

The proportion of positive predictions that are actually correct.

Precision measures the accuracy of positive predictions: of all items the model predicted as positive, what fraction actually are positive. In content moderation, high precision means when the system flags content as violating, it's usually correct. Precision = True Positives / (True Positives + False Positives). There is often a trade-off between precision and recall.

Related: Recall, False Positive, Classifier, Ground Truth
External resource: https://en.wikipedia.org/wiki/Precision_and_recall ↗

Prevalence

The proportion of content or users exhibiting a particular characteristic.

Prevalence measures how common something is within a population or dataset. For example, the prevalence of illegal drug sales might be 0.01% of all posts. Even low prevalence rates can represent millions of instances on platforms with billions of users, making prevalence studies important for understanding harm at scale.

Related: Reach, Engagement

Reach

The number of unique users who were exposed to a piece of content.

Reach measures how many different users saw a piece of content at least once. It is distinct from impressions (total views, including repeat views). Reach is important for understanding the potential impact of content, especially when studying the spread of harmful information.

Related: Engagement, Impressions, Views

Recall

The proportion of actual positives that were correctly identified.

Recall (also known as sensitivity) measures the model's ability to find all positive cases: of all items that actually are positive, what fraction did the model correctly identify. In content moderation, high recall means the system catches most of the violating content. Recall = True Positives / (True Positives + False Negatives). Increasing recall often decreases precision.

Related: Precision, False Negative, Classifier, Ground Truth
External resource: https://en.wikipedia.org/wiki/Precision_and_recall ↗
Previous: Other DataNext: Resources