Understanding the challenges researchers face when working with platform data can help you prepare for and navigate these obstacles more effectively.
When talking with external researchers across the EDMO hubs, we found several common themes in their experiences with platform data access. Understanding these challenges is crucial for researchers planning to work with VLOPSE data under Article 40 of the DSA.
Note: While some of these issues are systemic and require policy-level solutions, being aware of them can help you plan your research timeline, adjust methodologies, and set realistic expectations.
Researchers are struggling to gain approval to use the data access APIs. While nearly every researcher we spoke to had previously had access to recently deprecated tools from VLOPSEs, almost none had access to the more recent replacements.
Some researchers who were granted access received it for parameters that did not match their original request.
Some APIs have very limited quotas so that a researcher might be limited to returning 500 or 1,000 observations of data per API call, and are limited to a small number of API calls per day.
When studying relatively rare events (e.g., with prevalence rates < 2%), it is probable that API calls might not return any relevant content, which makes it difficult to study the prevalence of harms.
The numbers in the data do not reflect the numbers observed on the same piece of content on the platform. For example, a post might have 500 engagements in the data download, but only 10 engagements when looking at that same post live on the platform.
It is possible that there may be some logging delay between when the behavior happens on the platform and when it is logged into the data table connected to the API, but we cannot say that this explains differences between the data downloaded and the data observed on the platform.
Using the APIs is rather technical, and requires knowledge that many researchers do not have (e.g., coding knowledge in Python).
Many variables that researchers want are not available in the API. Generally, VLOPSEs only include simple data like:
While these data are useful, they are lacking. If a researcher wants to verify the accuracy of a VLOPSE's transparency report, they would not be able to do so because the VLOPSEs do not provide access to the variables related to the systemic risks that researchers are supposed to be able to study.
Build significant buffer time into your research timeline for API access approval. Consider starting the application process 6-12 months before you need the data.
Keep detailed records of your API access requests, communications with platforms, and any discrepancies you find in the data. This documentation may be valuable for advocacy efforts.
Invest in learning Python, SQL, and API interaction techniques. This guide provides introductory resources on the SQL Guide and API Guide pages.
Explore publicly available datasets, academic data sharing programs, and transparency databases that may supplement or substitute for direct API access. See our Other Data page.
Connect with other researchers who have successfully obtained API access. Sharing knowledge about application processes and technical challenges can help the entire research community.
To achieve meaningful transparency into platform behavior, VLOPSEs need to provide the variables used to construct their transparency reports, which likely includes the predictive variables used to classify content. Until this level of transparency is achieved, researchers will be forced to do the content labeling work themselves, rather than utilizing the extensive readily-available methodology employed by platforms.
Despite these challenges, understanding how platforms structure and use data can help you make the most of the access you do receive. Continue exploring this guide to learn about data structures, SQL queries, and API interactions.