Mapping Data

How are these different categories of data combined and used to predict specific outcomes?

🔗 Introduction

In the previous section, we described the different types of data that Very Large Online Platforms and Search Engines (VLOPSEs) collect and store about their users and content. However, collecting and storing data is only the first step in using those data to make decisions about what content to recommend (or not recommend) and to whom. Much of this next step involves thinking about how to combine different types of data in ways that allow the platform to make better predictions about users and content without having to actually have a human look at the billions (or trillions!) of pieces of content that are created.

Companies have an incentive to make accurate predictions about content and users for several reasons. First, users get upset if their posts are incorrectly removed or filtered, so platforms want to minimize false positives when it comes to content moderation. Yet, by minimizing false positives, they are also increasing the number of false negatives (i.e., recommending or allowing harmful content to remain on the platform). This is a difficult balance to strike because platforms are successful when users create content that other users spend time engaging with, so removing too much content reduces the amount of content that will keep users on the platform (and, therefore reduce the number of ads users will see, reducing the revenue generated for the company). However, users may also leave the platform if they see too much content that they do not like or that upsets them. The decision-making process on these trade-offs is beyond the scope of this particular guide, but it is important to recognize and is something I discuss in my work as a technology consultant and expert witness.

On this page, I will walk through two examples of how platforms and search engines may combine the different types of data to make predictions about whether an account is trying to mislead about their location and whether a piece of content is likely to violate platform policies.

🔍 Examples

📍 Predicting Whether a User is Misleading About Their Location

VLOPSEs generally have an incentive to use machine learning models to predict the location of users. This is very useful for ad targeting purposes. For example, a user might report living in one location, but spend a lot of time in another location. Advertisers might want to target that user based on where they spend the most time, and so platforms will generally keep track of both where users report to live and also try to predict the locations of their users.

Sometimes, the signals received from users about their location might be contradictory. For example, the user may self-report living in one country, but only login to the service from IP addresses or provide GPS locations from another. Sometimes this is understandable, for example if the user lives under a government that doesn't respect human rights, they may need to mask their true location or use VPN services to hide their activity. But this can also be a signal that the user is trying to engage in deceptive activities.

🗳️ Use Case: Election Interference

One strategy used to interfere with a foreign country's elections involves an interested group creating fake accounts which appear to originate in the target country where the election is taking place. The foreign group will use these accounts to spread their propaganda, with the hopes of influencing or interfering with the election of the target country. Sometimes this is to advance a specific policy agenda, and sometimes this is to generally create chaos and confusion.

This is precisely one of the actions taken by the Russian Internet Research Agency (IRA) during the 2016 U.S. presidential election. The IRA created thousands of fake accounts on multiple platforms that pretended to be U.S. citizens, and used those accounts to post content that was intended to sow political discord among Americans and influence the election. Many of these accounts were crafted to appear as if they belonged to regular American citizens, complete with American-sounding names, profile pictures, and posts about everyday life in the U.S. However, upon investigation, the IP addresses and GPS coordinates indicated these accounts were based in Russia and belonged to a Kremlin-linked troll farm founded by Yevgeny Prigozhin, head of the Russian private military company Wagner.

These events triggered numerous investigations and led platforms to adopt significant changes in how they monitored accounts, particularly when those accounts appear to be coordinating with each other and creating content around sensitive events like elections. Therefore, VLOPSEs remain skeptical of the self-reported location that is provided when an account is created.

Using Data to Determine User Location Authenticity

Given the massive amount of data VLOPSEs collect, there are many variables that could be useful in trying to estimate where an account is actually based. What data would you want if you were tasked with determining whether an account is being truthful about their location?

I created the diagram below highlighting some of the types of data a platform might collect from their users, and highlighted several variables that I would want if I were trying to predict whether an account is being truthful about their location.

How might a company infer a user's location? This diagram shows the process from data collection through model training to prediction.

Device and Network Data

Platforms typically extract massive amounts of data from the device or devices that an account uses to access the platform. All of these devices should have an IP address. IP addresses include a lot of explicit information regarding location (e.g., country, region, city, approximate latitude and longitude, and telephone area code), along with more implicit indicators:

Internet service provider: Many providers are regional, so if someone claims to be from one region where that service provider is unavailable, that is suspicious
Network type: Whether the network is residential, commercial, governmental, mobile, etc. If a government network is being used for many accounts that are posting about an election in some faraway country without making their government affiliation clear, that is suspicious
Location masking: Whether the user is connected to a VPN or proxy, or are using a Tor browser. There are many valid reasons to use VPNs and Tor. However, they are also useful for illegal and deceptive activities
Time zone: If the IP-based time zone does not match the stated location time zone, or if the user's activity is at unusual times for the time zone
Language preferences: If the typical language for an IP differs from the dominant language in a region

GPS Coordinates

Because IP addresses can also be spoofed using VPNs, Tor browsers, and other tools, they shouldn't be relied on as the sole input in predicting someone's location. Fortunately for these platforms, the devices that accounts use to access the platforms increasingly include precise GPS coordinates that can be accurate enough to identify the device's location to within a few tens of meters (based on satellites, cellular towers, or wifi signals).

Content Analysis

VLOPSEs may also integrate other variables, like information about the types of content that a user posts:

Language: Is the post written in the predominant language of a given region?
Idioms: Are the idioms used consistent with the regional dialect?
Events: Do they post about events that occur near their supposed location?

Important Note: None of these signals can be used in isolation to make a definitive assessment of a location mismatch between stated and inferred country. A predictive model might look at many of these variables when examining accounts that had previously been confirmed to be inaccurately reporting their location, and then generate a score for each account based on these variables. It is important to combine many variables here (across many prediction models) because there are justifiable reasons why someone's device GPS may say that they are somewhere different from where they usually access the platform, like if they are on a vacation.

If this location prediction model is accurate, then accounts that score high may be subject to manual review by content moderators or receive rate limits applied. This can be justified because accounts that have a mismatch between stated and inferred location have significantly higher rates of posting policy violating content or engaging in policy violating behaviors. Location verification is certainly relevant to whether an account is authentic, trustworthy, and likely to contribute positive (or negative) value to a platform–but it is typically not against a platform's terms of service to not verify, and a location mismatch normally won't warrant deleting or suspending the account on its own.

⚠️ Predicting Whether Content is Policy Violating

Like countries have laws stating what behaviors are permissible within their borders, online platforms and search engines have policies for what behaviors and content are allowed on their sites and apps. Again, though, there's far too many pieces of content created every day for human moderators to review all of them and enforce those policies as police officer might. So, platforms have to build tools to help them determine whether a piece of content is likely to violate their policies or is likely to cause harm to their users.

Platforms develop many predictive models to identify the likelihood that a piece of content is violating any of their many different policies. The specific data used to predict whether content violates one policy or a different policy differs. For example, a model predicting the likelihood of a post being spam likely will include features such as the rate of posting of the account (e.g., do they post/comment too frequently?) and the diversity of the contents of the post (e.g., are they copying and pasting the same post/comment in lots of places?).

Yet, those data might not be particularly helpful in trying to determine whether a post contains child sexual abuse material or is inciting violence. A model predicting the likelihood of a post containing incitement to violence would likely include information about the language that the post is in, whether that post contains specific hateful phrases or words, and whether the author of the post has posted policy violating content in the past.

Key Predictive Variables

Let's return to our earlier example regarding a user's location. As stated in the previous section, if an account's inferred location differs from where the user says they are located, that is insufficient to justify taking action against the account most of the time. However, if the predicted or inferred location is substantially different from the self-reported location, that would likely increase the risk that that account may be more likely to create harm. It is critical to highlight the qualifier substantially different, because it's common for people to access platforms on their devices as they move about their community, and accessing a platform from 5 kilometers away from one's home is normal. If someone is always accessing the platform from 10,000 kilometers away from their home, then that is suspicious, and is an indicator of elevated riskiness of that user.

What data would you want if you were tasked with determining whether an account is lying about their location and likely to be a malicious actor? I've added to the previous diagram highlighting some additional variables that could help in evaluating whether an account is violating some policy.

How might a company predict whether a piece of content is policy violating? This diagram shows the process from feature collection through model training to content moderation.

How recently the account was created: Some people create new or "burner" accounts so that they can engage in behavior that they wouldn't want associated with their primary account. Therefore, low account age is a factor that increases the risk of an account being violative.
Have other violating accounts logged in using the same device: If an account accesses the platform on a device that is associated with other violating accounts, or with an older account that was terminated due to violations, then that new account on a known device is also likely to be riskier, because it could represent a user, which previously violated policies, that is making a new account.
User reports: If a post or an account is being reported by other users for violating the terms of service or for engaging in violative behavior (e.g., posting harmful misinformation, trying to incite violence, harassing protected groups), then that increases the likelihood that that account is engaging in violative behavior.

Therefore, features such as these may be entered into a predictive model to try to predict the likelihood of a post violating policies. Again, a predictive model like this would be trained on previously collected data where the outcome is clear (i.e., a post was evaluated and found to be violating or not). The model would then calculate a score based on some combination of these features and then try to predict whether a post was deemed violating vs. not violating. If the model seems to differentiate between violating and non-violating content well, it would then be put into operation and run on new, incoming content where it is not known whether the content is violating.

📉 Other Data Reduction Considerations

In addition to using fact and dimension tables, VLOPSEs further reduce the expense and time required to process the data by partitioning or splitting the data into tables by:

Platform: If a company owns multiple platforms, they may have separate tables for each platform.
Surface: Most platforms have different ways for users to interact with the platform, and these different ways are often referred to as "surfaces." Examples of surfaces could include, but are not limited to, a feed limited to accounts the user follows, a feed for recommended content for the user, an account profile page, a shopping section, a messaging section.
Date: In addition to the retention periods, VLOPSEs may also partition large tables by day. In other words, if a table has a 90 day retention period and is partitioned by day, then the cost of querying data from one day in the table is 1.1% (1 / 90) of what it would be if the table weren't partitioned by day.
Sensitivity: More sensitive data may be stored in separate tables where access is limited to people with the appropriate approval.

For Researchers

While external researchers accessing platform data likely won't need to do much data reduction with the data they obtain, they will need to be mindful of how the data are structured and partitioned to be able to query the data.

If a query requests data from a table outside of the partition window, the query may fail or provide an inaccurate result. Once a researcher knows that a table is partitioned on a variable, then it is straightforward to add a line to the query that filters the data based on the partition. For example, if a table is partitioned by date, the researcher could add a filter such as WHERE date = '2024-12-12'.

Previous: Data Types Next: SQL Guide