Platforms collect an immense amount of data using many different methods. Understanding these data types and how the datapoints can be used to predict behavior is essential for effective research and analysis.
Not every platform collects the same amount of data, and some of the data depends on the exact nature of the platform. In addition, not all data used by the platforms actually comes from their users. For example, web crawls can be used to create datasets, such as PageRank and web authority scores, which platforms could use.
Most platforms collect the following 6 main categories of data:

The most obvious of all of the data collected is that which the user explicitly provides when signing up for the platform or filling out their profile details.
More examples include:
These data are useful, but not sufficient to fully understand users and their preferences because users don't always tell the truth. My dog, Mojo, has had a social media account for 16 years, and he is listed as my son on one of the platforms. I won't admit to playing any part in creating his account or in somehow birthing a different species, but it is unlikely that he filled out the profile information himself! So, platforms collect a lot of other data that may corroborate or contradict what users say about themselves.

Some less obvious data collected includes a lot of information from the devices that the user uses to access the platform or search engine. As devices become more integrated in people's lives, the volume of data on them grows dramatically, making them a treasure trove for platforms wanting to better understand their users.
More examples include:
These data are much less likely to be manipulated by users trying to misrepresent themselves, making them very valuable for platforms trying to verify user authenticity and understand user behavior. For example, if Mojo the dog used his phone to access his social media account, the platform would know whether that same phone was also used by other accounts. If many accounts were using the same phone, that would be a red flag that Mojo's account might not be authentic!

These platforms and search engines also log users' behaviors, including most that a user does between opening and closing an app or website.
More examples include:
These behavioral data are extremely valuable for platforms and search engines because the companies can use these data to determine what kinds of content will keep users engaged and on their platform for longer periods of time. Time spent on platforms is often directly correlated with ad revenue, so these companies have a strong incentive to understand and optimize for user behavior. Additionally, many VLOPSEs' revenue is driven primarily by advertising, and if the company can predict which users are most likely to click on particular ads and buy something from an advertiser, they can charge advertisers more money to show ads to those users.

For prediction models (described in more detail Data Category #5 below) to work, they often need external input to confirm that they are actually predicting what they are supposed to be predicting. For example, if the models based on an individual user's behaviors predict that that user creates content that other users want to see, but other users aren't engaging with it or are hiding it from their feeds, then it tells the platform that their model is failing in at least some cases. Therefore, companies also collect data on how other users engage with content from individual users to better understand what that content is. If most people who see a post submit a report saying that the content is violating some policy, then the platform might use that information to downrank or remove that content.
More examples include:
Important prediction models are often built where subject matter experts label data to provide a "ground truth" for the models to be trained and evaluated against. For complex topics, like whether a particular post contains accurate medical information, it would be relatively expensive to pay experts who understand medicine well enough to determine if 500,000 medical posts are accurate or inaccurate. So, often platforms will use cheaper data to monitor the on-going validity of that model. Some of the cheapest data come from other users who are engaging with the content. For example, if a post contains dangerous medical misinformation (e.g., eating laundry detergent pods is safe for humans), the platform would hope that many users report that post as misinformation and as a dangerous challenge.

Because of the sheer volume of content posted by millions or billions of users, it is impossible for platforms to manually review each post before distributing that post. Furthermore, many platforms and search engines “personalize” results for each user, meaning that VLOPSEs need to be able to predict what results are relevant to and desired by each user. For example, if someone living in London asks a search engine for “restaurants near me,” modern search engines would likely return restaurants located in London and not restaurants in New York City. Similarly, an online short-form video platform will show videos related to each user`s particular interests.
Advanced computational methods, like machine learning (e.g., gradient boosted decision trees, random forest models, support vector machines, neural networks), topic modeling (e.g., latent dirichlet allocation, latent semantic indexing), and mixture models, are used to characterize entities (like users and posts) on VLOPSEs. Don't worry; you don't need to know the math behind these complicated tools in order to get started in playing with platform data. The details of these specific models are far beyond the scope of this overview. However, you should understand the following basic points about these types of models:
These methods can be used to predict very concrete behaviors, like a click or posting a comment, but also more subjective characteristics of users or content, like the probability that a user is being deceptive about their location or that a piece of content may be harmful.
More examples include:

The companies who own many platforms and search engines also sometimes purchase data from third-party vendors to supplement what they are able to collect on their platform or from their users' devices.
More examples include:
These types of data tend to be particularly effective at better targeting ads to users who are more likely to purchase products or services. For example, if a user has a lower credit score, they may be more interested in ads for products that could help them improve their credit scores. Similarly, if a user has a history of donating to particular political parties, they may be more likely to click on fundraising ads from that political party in the future.
The table below provides a detailed breakdown of the six main categories of data that platforms collect from their users, showing specific examples of each type:
| User Shared / Provided | Extracted | Created | Behaviors | Received from Other Users | Inferred | Purchased |
|---|---|---|---|---|---|---|
| Name | Devices used | Views | Clicks | Views of their content | Sex/Gender | Shopping behavior |
| Profile picture | GPS Location | Reels | Searches | Clicks on profile | Sexual orientation | Monetizable value |
| Phone number | IP addresses | Shorts | Edits of posts/Comments | Clicks on posts | Race/ethnicity | Credit score range |
| Device preferences | Video posts | Time spent | Clicks on comments | Religion | Net worth | |
| Address | Phone number | Time | Sessions | Got reported | Income | Income estimate |
| Interests (e.g., cat videos, political pages) | Proximity to other accounts | Figure | Session length | Blocks | Politics | Estimated value |
| Pictures | Use on mobile vs desktop | Groups | Search queries | Time viewing their content | Age | Politics |
| Video posts | Whether using a VPN | Pages | Scroll time | Searches for account | Occupation | Home information |
| Text posts | Overlap between device and other accounts | Events | Horizontal scrolls | Authoritative source | Property data | |
| Comments | Notifications clicks | Reporting others | Authoritative health source | Charitable donation history | ||
| Language | Follows | Survey participation | Authoritative news source | Political donation history | ||
| Music | Likes | Provide lightweight negative feedback | Abusiveness | |||
| Movies | Reactions | Friend requests sent but ignored | Spamminess | |||
| About me | Hides | Friend requests received, accepted, rejected, and/or ignored | Fake | |||
| Payment information | Event RSVPs | Active last 7 days | Public figure | |||
| Birthday | Friend requests sent | |||||
| Privacy/public settings | Blocks |
Source: EDMO Report - Platform Datasets (Table 1: Types of Data Collected by Platforms)
Note: For the sake of making the table easier to view on-screen, the Data Generated By User Behavior category was split across two columns where the first columns is mostly comprised of 'creation' behaviors (e.g., creating a post, comment, etc.) and the second column is mostly comprised of more dynamic behaviors (e.g., friend request:friend acceptance ratio, reporting other users' content, editing previously created content).
📊 Comprehensive Data Overview:
For a deeper dive into this topic, refer to this somewhat more comprehensive spreadsheet that provides detailed examples of the types of data collected from: users, content, sessions, predictive models (machine learning, artificial intelligence, etc.), surveys, and networks.
Most common format for API responses. Human-readable and easy to parse.
{
"id": "123456789",
"username": "researcher",
"text": "Sample post content",
"created_at": "2024-10-21T10:30:00Z",
"likes_count": 42,
"hashtags": ["data", "research"]
}Common for bulk data downloads. Easy to import into spreadsheets and databases.
id,username,text,created_at,likes_count
123456789,researcher,"Sample post",2024-10-21T10:30:00Z,42Used by some older APIs and data exports.
<post>
<id>123456789</id>
<username>researcher</username>
<text>Sample post content</text>
<created_at>2024-10-21T10:30:00Z</created_at>
<likes_count>42</likes_count>
</post>Once you understand the data types and formats, you can: