Platform Data Types

💡 Tip: Terms with dotted underlines have tooltip definitions. Hover over them to see quick explanations!

Platforms collect an immense amount of data using many different methods. Understanding these data types and how the datapoints can be used to predict behavior is essential for effective research and analysis.

Six Categories of Data

Not every platform collects the same amount of data, and some of the data depends on the exact nature of the platform. In addition, not all data used by the platforms actually comes from their users. For example, web crawls can be used to create datasets, such as PageRank and web authority scores, which platforms could use.

Most platforms collect the following 6 main categories of data:

1. Data Explicitly Provided by the User

The most obvious of all of the data collected is that which the user explicitly provides when signing up for the platform or filling out their profile details.

More examples include:

Name, profile picture, email, phone number, address
Interests, pages, groups followed
Language preferences, about me sections
Birthday, relationship status, occupation

These data are useful, but not sufficient to fully understand users and their preferences because users don't always tell the truth. My dog, Mojo, has had a social media account for 16 years, and he is listed as my son on one of the platforms. I won't admit to playing any part in creating his account or in somehow birthing a different species, but it is unlikely that he filled out the profile information himself! So, platforms collect a lot of other data that may corroborate or contradict what users say about themselves.

2. Data Extracted from User Devices

Clipart depicting different types of device data collected from users

Some less obvious data collected includes a lot of information from the devices that the user uses to access the platform or search engine. As devices become more integrated in people's lives, the volume of data on them grows dramatically, making them a treasure trove for platforms wanting to better understand their users.

More examples include:

Devices used, phone number identification
GPS location, IP addresses
MAC address
Internet service provider
Device overlap between accounts
Device preferences and settings
VPN usage detection
Browser cookies, cross-app tracking
Whether contacts on your phone are also users on the platform
Operating system version, app version
Whether a device is used to access the platform by many different accounts

These data are much less likely to be manipulated by users trying to misrepresent themselves, making them very valuable for platforms trying to verify user authenticity and understand user behavior. For example, if Mojo the dog used his phone to access his social media account, the platform would know whether that same phone was also used by other accounts. If many accounts were using the same phone, that would be a red flag that Mojo's account might not be authentic!

3. Data Generated by User Behavior

Clipart depicting different types of user behavior data

These platforms and search engines also log users' behaviors, including most that a user does between opening and closing an app or website.

More examples include:

Posts, pictures, videos, comments, messages
Views, clicks, searches, scrolling behavior
Time spent on content, session length
Ad clicks, purchases, event RSVPs
Edits of posts/comments
Survey participation and responses

These behavioral data are extremely valuable for platforms and search engines because the companies can use these data to determine what kinds of content will keep users engaged and on their platform for longer periods of time. Time spent on platforms is often directly correlated with ad revenue, so these companies have a strong incentive to understand and optimize for user behavior. Additionally, many VLOPSEs' revenue is driven primarily by advertising, and if the company can predict which users are most likely to click on particular ads and buy something from an advertiser, they can charge advertisers more money to show ads to those users.

4. Data Generated About User from Others

Clipart depicting different types of data generated by other users about a user

For prediction models (described in more detail Data Category #5 below) to work, they often need external input to confirm that they are actually predicting what they are supposed to be predicting. For example, if the models based on an individual user's behaviors predict that that user creates content that other users want to see, but other users aren't engaging with it or are hiding it from their feeds, then it tells the platform that their model is failing in at least some cases. Therefore, companies also collect data on how other users engage with content from individual users to better understand what that content is. If most people who see a post submit a report saying that the content is violating some policy, then the platform might use that information to downrank or remove that content.

More examples include:

Got reported, got blocked
Views of their content, clicks on their profile
Friend requests sent but ignored/rejected
Searches for their account
Time other users spent viewing their content

Important prediction models are often built where subject matter experts label data to provide a "ground truth" for the models to be trained and evaluated against. For complex topics, like whether a particular post contains accurate medical information, it would be relatively expensive to pay experts who understand medicine well enough to determine if 500,000 medical posts are accurate or inaccurate. So, often platforms will use cheaper data to monitor the on-going validity of that model. Some of the cheapest data come from other users who are engaging with the content. For example, if a post contains dangerous medical misinformation (e.g., eating laundry detergent pods is safe for humans), the platform would hope that many users report that post as misinformation and as a dangerous challenge.

5. Inferences Based on Data (ML/AI)

Clipart depicting different types of data predicted about a user

Because of the sheer volume of content posted by millions or billions of users, it is impossible for platforms to manually review each post before distributing that post. Furthermore, many platforms and search engines “personalize” results for each user, meaning that VLOPSEs need to be able to predict what results are relevant to and desired by each user. For example, if someone living in London asks a search engine for “restaurants near me,” modern search engines would likely return restaurants located in London and not restaurants in New York City. Similarly, an online short-form video platform will show videos related to each user`s particular interests.

Advanced computational methods, like machine learning (e.g., gradient boosted decision trees, random forest models, support vector machines, neural networks), topic modeling (e.g., latent dirichlet allocation, latent semantic indexing), and mixture models, are used to characterize entities (like users and posts) on VLOPSEs. Don't worry; you don't need to know the math behind these complicated tools in order to get started in playing with platform data. The details of these specific models are far beyond the scope of this overview. However, you should understand the following basic points about these types of models:

These models rely on using data to predict some specific outcome (e.g., the likelihood of a user to click on an ad).
The models are typically trained on existing data with known behavioral outcomes (e.g., predicting which users clicked on an ad).
The models then integrate many other variables (also called features or predictor variables) that can be combined to predict the likelihood of the outcome.
These models will generate a score for each element that they are trying to characterize (e.g., each ad).
The models are evaluated based on how accurately the score they generated predicts the behavioral outcome of interest (i.e., if one ad receives a score of 0.80 and another ad receives a score of 0.10, the ad receiving a score of 0.80 should generate the more clicks than the ad receiving a score of 0.10; otherwise, the model would be deemed poor because it doesn`t improve the prediction of the behavior of interest).

These methods can be used to predict very concrete behaviors, like a click or posting a comment, but also more subjective characteristics of users or content, like the probability that a user is being deceptive about their location or that a piece of content may be harmful.

More examples include:

Gender, sex/sexual orientation, race/ethnicity predictions
Religion, politics, income estimates
Occupation, education level
Abusiveness scores, spamminess scores
Public figure status, authoritativeness
Whether account is fake or authentic
Whether a photo contains child sexual abuse material (CSAM)

6. External Data from Third Parties

Clipart depicting different types of third-party data collected about users

The companies who own many platforms and search engines also sometimes purchase data from third-party vendors to supplement what they are able to collect on their platform or from their users' devices.

More examples include:

Shopping behavior, monetizable value estimates
Credit score range, net worth
Home/property information
Political donation history
Charitable donation history

These types of data tend to be particularly effective at better targeting ads to users who are more likely to purchase products or services. For example, if a user has a lower credit score, they may be more interested in ads for products that could help them improve their credit scores. Similarly, if a user has a history of donating to particular political parties, they may be more likely to click on fundraising ads from that political party in the future.

Data Types & Examples in A Single Table

The table below provides a detailed breakdown of the six main categories of data that platforms collect from their users, showing specific examples of each type:

Note: For the sake of making the table easier to view on-screen, the Data Generated By User Behavior category was split across two columns where the first column is mostly comprised of 'creation' behaviors (e.g., creating a post, comment, etc.) and the second column is mostly comprised of more dynamic behaviors (e.g., friend request:friend acceptance ratio, reporting other users' content, editing previously created content).

Data Explicitly Provided by the User	Data Extracted from User Devices	Data Generated By User Behavior (a)	Data Generated By User Behavior (b)	Data Generated About User from Others	Inferences Based on Data (ML/AI)	External Data from Third Parties
Name	Devices used	Views	Clicks	Views of their content	Sex/Gender	Shopping behavior
Profile picture	GPS Location	Reels	Searches	Clicks on profile	Sexual orientation	Monetizable value
Phone number	IP addresses	Shorts	Edits of posts/Comments	Clicks on posts	Race/ethnicity	Credit score range
Email	Device preferences	Video posts	Time spent	Clicks on comments	Religion	Net worth
Address	Phone number	Time	Sessions	Got reported	Income	Income estimate
Interests (e.g., cat videos, political pages)	Proximity to other accounts	Figure	Session length	Blocks	Politics	Estimated value
Pictures	Use on mobile vs desktop	Groups	Search queries	Time viewing their content	Age	Politics
Video posts	Whether using a VPN	Pages	Scroll time	Searches for account	Occupation	Home information
Text posts	Overlap between device and other accounts	Events	Horizontal scrolls		Authoritative source	Property data
Comments		Notifications clicks	Reporting others		Authoritative health source	Charitable donation history
Language		Follows	Survey participation		Authoritative news source	Political donation history
Music		Likes	Provide lightweight negative feedback		Abusiveness
Movies		Reactions	Friend requests sent but ignored		Spamminess
About me		Hides	Friend requests received, accepted, rejected, and/or ignored		Fake
Payment information		Event RSVPs	Active last 7 days		Public figure
Birthday		Friend requests sent
Privacy/public settings		Blocks

Source: EDMO Report - Platform Datasets (Table 1: Types of Data Collected by Platforms)

📊 Comprehensive Data Overview:

For a deeper dive into this topic, refer to this somewhat more comprehensive spreadsheet that provides detailed examples of the types of data collected from: users, content, sessions, predictive models (machine learning, artificial intelligence, etc.), surveys, and networks.

Common Data Formats

JSON (JavaScript Object Notation)

Most common format for API responses. Human-readable and easy to parse.

{
  "id": "123456789",
  "username": "researcher",
  "text": "Sample post content",
  "created_at": "2024-10-21T10:30:00Z",
  "likes_count": 42,
  "hashtags": ["data", "research"]
}

CSV (Comma-Separated Values)

Common for bulk data downloads. Easy to import into spreadsheets and databases.

id,username,text,created_at,likes_count
123456789,researcher,"Sample post",2024-10-21T10:30:00Z,42

XML (eXtensible Markup Language)

Used by some older APIs and data exports.

<post>
  <id>123456789</id>
  <username>researcher</username>
  <text>Sample post content</text>
  <created_at>2024-10-21T10:30:00Z</created_at>
  <likes_count>42</likes_count>
</post>

Working with Platform Data

Once you understand the data types and formats, you can:

Use SQL to query and analyze structured data
Connect to APIs to retrieve real-time data
Transform and combine data from multiple sources
Visualize patterns and trends
Conduct statistical analysis and modeling

Learn SQL →Learn APIs →

Previous: Introduction Next: Mapping Data

Platform Data Types

💡 Tip: Terms with dotted underlines have tooltip definitions. Hover over them to see quick explanations!

Six Categories of Data

Most platforms collect the following 6 main categories of data:

1. Data Explicitly Provided by the User

The most obvious of all of the data collected is that which the user explicitly provides when signing up for the platform or filling out their profile details.

More examples include:

Name, profile picture, email, phone number, address
Interests, pages, groups followed
Language preferences, about me sections
Birthday, relationship status, occupation

2. Data Extracted from User Devices

More examples include:

Devices used, phone number identification
GPS location, IP addresses
MAC address
Internet service provider
Device overlap between accounts
Device preferences and settings
VPN usage detection
Browser cookies, cross-app tracking
Whether contacts on your phone are also users on the platform
Operating system version, app version
Whether a device is used to access the platform by many different accounts

3. Data Generated by User Behavior

These platforms and search engines also log users' behaviors, including most that a user does between opening and closing an app or website.

More examples include:

Posts, pictures, videos, comments, messages
Views, clicks, searches, scrolling behavior
Time spent on content, session length
Ad clicks, purchases, event RSVPs
Edits of posts/comments
Survey participation and responses

4. Data Generated About User from Others

More examples include:

Got reported, got blocked
Views of their content, clicks on their profile
Friend requests sent but ignored/rejected
Searches for their account
Time other users spent viewing their content

5. Inferences Based on Data (ML/AI)

These models rely on using data to predict some specific outcome (e.g., the likelihood of a user to click on an ad).
The models are typically trained on existing data with known behavioral outcomes (e.g., predicting which users clicked on an ad).
The models then integrate many other variables (also called features or predictor variables) that can be combined to predict the likelihood of the outcome.
These models will generate a score for each element that they are trying to characterize (e.g., each ad).
The models are evaluated based on how accurately the score they generated predicts the behavioral outcome of interest (i.e., if one ad receives a score of 0.80 and another ad receives a score of 0.10, the ad receiving a score of 0.80 should generate the more clicks than the ad receiving a score of 0.10; otherwise, the model would be deemed poor because it doesn`t improve the prediction of the behavior of interest).

More examples include:

Gender, sex/sexual orientation, race/ethnicity predictions
Religion, politics, income estimates
Occupation, education level
Abusiveness scores, spamminess scores
Public figure status, authoritativeness
Whether account is fake or authentic
Whether a photo contains child sexual abuse material (CSAM)

6. External Data from Third Parties

More examples include:

Shopping behavior, monetizable value estimates
Credit score range, net worth
Home/property information
Political donation history
Charitable donation history

Data Types & Examples in A Single Table

The table below provides a detailed breakdown of the six main categories of data that platforms collect from their users, showing specific examples of each type:

Data Explicitly Provided by the User	Data Extracted from User Devices	Data Generated By User Behavior (a)	Data Generated By User Behavior (b)	Data Generated About User from Others	Inferences Based on Data (ML/AI)	External Data from Third Parties
Name	Devices used	Views	Clicks	Views of their content	Sex/Gender	Shopping behavior
Profile picture	GPS Location	Reels	Searches	Clicks on profile	Sexual orientation	Monetizable value
Phone number	IP addresses	Shorts	Edits of posts/Comments	Clicks on posts	Race/ethnicity	Credit score range
Email	Device preferences	Video posts	Time spent	Clicks on comments	Religion	Net worth
Address	Phone number	Time	Sessions	Got reported	Income	Income estimate
Interests (e.g., cat videos, political pages)	Proximity to other accounts	Figure	Session length	Blocks	Politics	Estimated value
Pictures	Use on mobile vs desktop	Groups	Search queries	Time viewing their content	Age	Politics
Video posts	Whether using a VPN	Pages	Scroll time	Searches for account	Occupation	Home information
Text posts	Overlap between device and other accounts	Events	Horizontal scrolls		Authoritative source	Property data
Comments		Notifications clicks	Reporting others		Authoritative health source	Charitable donation history
Language		Follows	Survey participation		Authoritative news source	Political donation history
Music		Likes	Provide lightweight negative feedback		Abusiveness
Movies		Reactions	Friend requests sent but ignored		Spamminess
About me		Hides	Friend requests received, accepted, rejected, and/or ignored		Fake
Payment information		Event RSVPs	Active last 7 days		Public figure
Birthday		Friend requests sent
Privacy/public settings		Blocks

Source: EDMO Report - Platform Datasets (Table 1: Types of Data Collected by Platforms)

📊 Comprehensive Data Overview:

Common Data Formats

JSON (JavaScript Object Notation)

Most common format for API responses. Human-readable and easy to parse.

{
  "id": "123456789",
  "username": "researcher",
  "text": "Sample post content",
  "created_at": "2024-10-21T10:30:00Z",
  "likes_count": 42,
  "hashtags": ["data", "research"]
}

CSV (Comma-Separated Values)

Common for bulk data downloads. Easy to import into spreadsheets and databases.

id,username,text,created_at,likes_count
123456789,researcher,"Sample post",2024-10-21T10:30:00Z,42

XML (eXtensible Markup Language)

Used by some older APIs and data exports.

<post>
  <id>123456789</id>
  <username>researcher</username>
  <text>Sample post content</text>
  <created_at>2024-10-21T10:30:00Z</created_at>
  <likes_count>42</likes_count>
</post>

Working with Platform Data

Once you understand the data types and formats, you can:

Use SQL to query and analyze structured data
Connect to APIs to retrieve real-time data
Transform and combine data from multiple sources
Visualize patterns and trends
Conduct statistical analysis and modeling

Learn SQL →Learn APIs →