Common Pitfalls

Understanding the challenges researchers face when working with platform data can help you prepare for and navigate these obstacles more effectively.

Overview

When talking with external researchers across the EDMO hubs, we found several common themes in their experiences with platform data access. Understanding these challenges is crucial for researchers planning to work with VLOPSE data under Article 40 of the DSA.

Note: While some of these issues are systemic and require policy-level solutions, being aware of them can help you plan your research timeline, adjust methodologies, and set realistic expectations.

API vs. Internal Data Challenges

1. Gaining API Access Approval

Researchers are struggling to gain approval to use the data access APIs. While nearly every researcher we spoke to had previously had access to recently deprecated tools from VLOPSEs, almost none had access to the more recent replacements.

Many researchers report waiting months without hearing back from the VLOPSEs regarding their applications
Some researchers only received access after explicitly reminding VLOPSEs that they would be subject to fines under the DSA if access was not granted
Other researchers simply never received access at all

Impact: This delay can significantly impact research timelines and make it difficult to study time-sensitive events like elections.

2. Parameters That Don't Match Original Requests

Some researchers who were granted access received it for parameters that did not match their original request.

One researcher proposed a 6-month project but was only granted access for 3 months of data
Other researchers did not receive approval until after a significant global event (such as an election) that they proposed to study had already occurred

Impact: Timely access to datasets is key for monitoring risks around critical societal events. Delayed or reduced access undermines the research value.

3. Very Limited Quotas

Some APIs have very limited quotas so that a researcher might be limited to returning 500 or 1,000 observations of data per API call, and are limited to a small number of API calls per day.

When studying relatively rare events (e.g., with prevalence rates < 2%), it is probable that API calls might not return any relevant content, which makes it difficult to study the prevalence of harms.

Impact: Even though harmful or violating incidents might have low prevalence on VLOPSEs, because VLOPSEs have billions of users globally and tens of hundreds of millions of users in the EU, the total number of occurrences of harms can be significant and have significant impact on people and societies.

4. Data Accuracy Discrepancies

The numbers in the data do not reflect the numbers observed on the same piece of content on the platform. For example, a post might have 500 engagements in the data download, but only 10 engagements when looking at that same post live on the platform.

It is possible that there may be some logging delay between when the behavior happens on the platform and when it is logged into the data table connected to the API, but we cannot say that this explains differences between the data downloaded and the data observed on the platform.

Impact: Researchers obviously need accurate and trustworthy data to properly study online risks. Discrepancies undermine confidence in research findings.

5. Technical Knowledge Requirements

Using the APIs is rather technical, and requires knowledge that many researchers do not have (e.g., coding knowledge in Python).

Impact: The data and tools provided by the VLOPSEs need to be usable by the target audience. This barrier excludes many qualified researchers who lack specific technical skills.

6. Missing Variables in APIs

Many variables that researchers want are not available in the API. Generally, VLOPSEs only include simple data like:

Post/reply text or videos
Engagement metrics (views, reactions/favs/likes, shares, replies)
Date the content was created
Basic account information (number of followers, biography, verification status)

While these data are useful, they are lacking. If a researcher wants to verify the accuracy of a VLOPSE's transparency report, they would not be able to do so because the VLOPSEs do not provide access to the variables related to the systemic risks that researchers are supposed to be able to study.

Key Issue: There is considerable variability in the data available via the APIs. Platform APIs only offer a fraction of relevant data compared to what is available internally at the companies. Platform APIs likely don't offer enough data to enable researchers to meaningfully answer research questions related to systemic risks.

Recommendations for Researchers

Plan for Delays

Build significant buffer time into your research timeline for API access approval. Consider starting the application process 6-12 months before you need the data.

Document Everything

Keep detailed records of your API access requests, communications with platforms, and any discrepancies you find in the data. This documentation may be valuable for advocacy efforts.

Build Technical Skills

Invest in learning Python, SQL, and API interaction techniques. This guide provides introductory resources on the SQL Guide and API Guide pages.

Consider Alternative Data Sources

Explore publicly available datasets, academic data sharing programs, and transparency databases that may supplement or substitute for direct API access. See our Other Data page.

Collaborate with Others

Connect with other researchers who have successfully obtained API access. Sharing knowledge about application processes and technical challenges can help the entire research community.

Moving Forward

To achieve meaningful transparency into platform behavior, VLOPSEs need to provide the variables used to construct their transparency reports, which likely includes the predictive variables used to classify content. Until this level of transparency is achieved, researchers will be forced to do the content labeling work themselves, rather than utilizing the extensive readily-available methodology employed by platforms.

Despite these challenges, understanding how platforms structure and use data can help you make the most of the access you do receive. Continue exploring this guide to learn about data structures, SQL queries, and API interactions.

Previous: API Guide Next: Other Data