Skip to main content
Show Me The Data
HomeIntro
About

Newsletter

Get insights on platform data and research

Subscribe

YouTube Channel

Video tutorials and insights

Subscribe

Support on Patreon

Help create more content

Become a Patron

Buy Me a Coffee

One-time support

Buy Coffee

Created by Matt Motyl

© 2025 Matt Motyl. All rights reserved.

On This Page

Submit Feedback

Platform Data Types

Platforms collect an immense amount of data using many different methods. Understanding these data types and how the datapoints can be used to predict behavior is essential for effective research and analysis.

Six Categories of Data

Not every platform collects the same amount of data, and some of the data depends on the exact nature of the platform. In addition, not all data used by the platforms actually comes from their users. For example, web crawls can be used to create datasets, such as PageRank and web authority scores, which platforms could use.

Most platforms collect the following 6 main categories of data:

1. Data Explicitly Provided by the User

Example Bluesky Social profile showing user-provided data

The most obvious of all of the data collected is that which the user explicitly provides when signing up for the platform or filling out their profile details.

More examples include:

  • Name, profile picture, email, phone number, address
  • Interests, pages, groups followed
  • Language preferences, about me sections
  • Birthday, relationship status, occupation

These data are useful, but not sufficient to fully understand users and their preferences because users don't always tell the truth. My dog, Mojo, has had a social media account for 16 years, and he is listed as my son on one of the platforms. I won't admit to playing any part in creating his account or in somehow birthing a different species, but it is unlikely that he filled out the profile information himself! So, platforms collect a lot of other data that may corroborate or contradict what users say about themselves.

2. Data Extracted from User Devices

Clipart depicting different types of device data collected from users

Some less obvious data collected includes a lot of information from the devices that the user uses to access the platform or search engine. As devices become more integrated in people's lives, the volume of data on them grows dramatically, making them a treasure trove for platforms wanting to better understand their users.

More examples include:

  • Devices used, phone number identification
  • GPS location, IP addresses
  • MAC address
  • Internet service provider
  • Device overlap between accounts
  • Device preferences and settings
  • VPN usage detection
  • Browser cookies, cross-app tracking
  • Whether contacts on your phone are also users on the platform
  • Operating system version, app version
  • Whether a device is used to access the platform by many different accounts

These data are much less likely to be manipulated by users trying to misrepresent themselves, making them very valuable for platforms trying to verify user authenticity and understand user behavior. For example, if Mojo the dog used his phone to access his social media account, the platform would know whether that same phone was also used by other accounts. If many accounts were using the same phone, that would be a red flag that Mojo's account might not be authentic!

3. Data Generated by User Behavior

Clipart depicting different types of user behavior data

These platforms and search engines also log users' behaviors, including most that a user does between opening and closing an app or website.

More examples include:

  • Posts, pictures, videos, comments, messages
  • Views, clicks, searches, scrolling behavior
  • Time spent on content, session length
  • Ad clicks, purchases, event RSVPs
  • Edits of posts/comments
  • Survey participation and responses

These behavioral data are extremely valuable for platforms and search engines because the companies can use these data to determine what kinds of content will keep users engaged and on their platform for longer periods of time. Time spent on platforms is often directly correlated with ad revenue, so these companies have a strong incentive to understand and optimize for user behavior. Additionally, many VLOPSEs' revenue is driven primarily by advertising, and if the company can predict which users are most likely to click on particular ads and buy something from an advertiser, they can charge advertisers more money to show ads to those users.

4. Data Generated About User from Others

Clipart depicting different types of data generated by other users about a user

For prediction models (described in more detail Data Category #5 below) to work, they often need external input to confirm that they are actually predicting what they are supposed to be predicting. For example, if the models based on an individual user's behaviors predict that that user creates content that other users want to see, but other users aren't engaging with it or are hiding it from their feeds, then it tells the platform that their model is failing in at least some cases. Therefore, companies also collect data on how other users engage with content from individual users to better understand what that content is. If most people who see a post submit a report saying that the content is violating some policy, then the platform might use that information to downrank or remove that content.

More examples include:

  • Got reported, got blocked
  • Views of their content, clicks on their profile
  • Friend requests sent but ignored/rejected
  • Searches for their account
  • Time other users spent viewing their content

Important prediction models are often built where subject matter experts label data to provide a "ground truth" for the models to be trained and evaluated against. For complex topics, like whether a particular post contains accurate medical information, it would be relatively expensive to pay experts who understand medicine well enough to determine if 500,000 medical posts are accurate or inaccurate. So, often platforms will use cheaper data to monitor the on-going validity of that model. Some of the cheapest data come from other users who are engaging with the content. For example, if a post contains dangerous medical misinformation (e.g., eating laundry detergent pods is safe for humans), the platform would hope that many users report that post as misinformation and as a dangerous challenge.

5. Inferences Based on Data (ML/AI)

Clipart depicting different types of data predicted about a user

Because of the sheer volume of content posted by millions or billions of users, it is impossible for platforms to manually review each post before distributing that post. Furthermore, many platforms and search engines “personalize” results for each user, meaning that VLOPSEs need to be able to predict what results are relevant to and desired by each user. For example, if someone living in London asks a search engine for “restaurants near me,” modern search engines would likely return restaurants located in London and not restaurants in New York City. Similarly, an online short-form video platform will show videos related to each user`s particular interests.

Advanced computational methods, like machine learning (e.g., gradient boosted decision trees, random forest models, support vector machines, neural networks), topic modeling (e.g., latent dirichlet allocation, latent semantic indexing), and mixture models, are used to characterize entities (like users and posts) on VLOPSEs. Don't worry; you don't need to know the math behind these complicated tools in order to get started in playing with platform data. The details of these specific models are far beyond the scope of this overview. However, you should understand the following basic points about these types of models:

  • These models rely on using data to predict some specific outcome (e.g., the likelihood of a user to click on an ad).
  • The models are typically trained on existing data with known behavioral outcomes (e.g., predicting which users clicked on an ad).
  • The models then integrate many other variables (also called features or predictor variables) that can be combined to predict the likelihood of the outcome.
  • These models will generate a score for each element that they are trying to characterize (e.g., each ad).
  • The models are evaluated based on how accurately the score they generated predicts the behavioral outcome of interest (i.e., if one ad receives a score of 0.80 and another ad receives a score of 0.10, the ad receiving a score of 0.80 should generate the more clicks than the ad receiving a score of 0.10; otherwise, the model would be deemed poor because it doesn`t improve the prediction of the behavior of interest).

These methods can be used to predict very concrete behaviors, like a click or posting a comment, but also more subjective characteristics of users or content, like the probability that a user is being deceptive about their location or that a piece of content may be harmful.

More examples include:

  • Gender, sex/sexual orientation, race/ethnicity predictions
  • Religion, politics, income estimates
  • Occupation, education level
  • Abusiveness scores, spamminess scores
  • Public figure status, authoritativeness
  • Whether account is fake or authentic
  • Whether a photo contains child sexual abuse material (CSAM)

6. External Data from Third Parties

Clipart depicting different types of third-party data collected about users

The companies who own many platforms and search engines also sometimes purchase data from third-party vendors to supplement what they are able to collect on their platform or from their users' devices.

More examples include:

  • Shopping behavior, monetizable value estimates
  • Credit score range, net worth
  • Home/property information
  • Political donation history
  • Charitable donation history

These types of data tend to be particularly effective at better targeting ads to users who are more likely to purchase products or services. For example, if a user has a lower credit score, they may be more interested in ads for products that could help them improve their credit scores. Similarly, if a user has a history of donating to particular political parties, they may be more likely to click on fundraising ads from that political party in the future.

Data Types & Examples in A Single Table

The table below provides a detailed breakdown of the six main categories of data that platforms collect from their users, showing specific examples of each type:

User Shared / ProvidedExtractedCreatedBehaviorsReceived from Other UsersInferredPurchased
NameDevices usedViewsClicksViews of their contentSex/GenderShopping behavior
Profile pictureGPS LocationReelsSearchesClicks on profileSexual orientationMonetizable value
Phone numberIP addressesShortsEdits of posts/CommentsClicks on postsRace/ethnicityCredit score range
EmailDevice preferencesVideo postsTime spentClicks on commentsReligionNet worth
AddressPhone numberTimeSessionsGot reportedIncomeIncome estimate
Interests (e.g., cat videos, political pages)Proximity to other accountsFigureSession lengthBlocksPoliticsEstimated value
PicturesUse on mobile vs desktopGroupsSearch queriesTime viewing their contentAgePolitics
Video postsWhether using a VPNPagesScroll timeSearches for accountOccupationHome information
Text postsOverlap between device and other accountsEventsHorizontal scrollsAuthoritative sourceProperty data
CommentsNotifications clicksReporting othersAuthoritative health sourceCharitable donation history
LanguageFollowsSurvey participationAuthoritative news sourcePolitical donation history
MusicLikesProvide lightweight negative feedbackAbusiveness
MoviesReactionsFriend requests sent but ignoredSpamminess
About meHidesFriend requests received, accepted, rejected, and/or ignoredFake
Payment informationEvent RSVPsActive last 7 daysPublic figure
BirthdayFriend requests sent
Privacy/public settingsBlocks

Source: EDMO Report - Platform Datasets (Table 1: Types of Data Collected by Platforms)

Note: For the sake of making the table easier to view on-screen, the Data Generated By User Behavior category was split across two columns where the first columns is mostly comprised of 'creation' behaviors (e.g., creating a post, comment, etc.) and the second column is mostly comprised of more dynamic behaviors (e.g., friend request:friend acceptance ratio, reporting other users' content, editing previously created content).

📊 Comprehensive Data Overview:

For a deeper dive into this topic, refer to this somewhat more comprehensive spreadsheet that provides detailed examples of the types of data collected from: users, content, sessions, predictive models (machine learning, artificial intelligence, etc.), surveys, and networks.

Common Data Formats

JSON (JavaScript Object Notation)

Most common format for API responses. Human-readable and easy to parse.

{
  "id": "123456789",
  "username": "researcher",
  "text": "Sample post content",
  "created_at": "2024-10-21T10:30:00Z",
  "likes_count": 42,
  "hashtags": ["data", "research"]
}

CSV (Comma-Separated Values)

Common for bulk data downloads. Easy to import into spreadsheets and databases.

id,username,text,created_at,likes_count
123456789,researcher,"Sample post",2024-10-21T10:30:00Z,42

XML (eXtensible Markup Language)

Used by some older APIs and data exports.

<post>
  <id>123456789</id>
  <username>researcher</username>
  <text>Sample post content</text>
  <created_at>2024-10-21T10:30:00Z</created_at>
  <likes_count>42</likes_count>
</post>

Working with Platform Data

Once you understand the data types and formats, you can:

  • Use SQL to query and analyze structured data
  • Connect to APIs to retrieve real-time data
  • Transform and combine data from multiple sources
  • Visualize patterns and trends
  • Conduct statistical analysis and modeling
Learn SQL →Learn APIs →
Previous: IntroductionNext: Mapping Data