A new system for measuring data reuse and research impact

9 March 2026

DataSeer, in collaboration with The Michael J. Fox Foundation (MJFF), have developed a new LLM-based system that is designed to detect and quantify dataset reuse across the scholarly literature at scale. The DataSeer system removes a critical measurement bottleneck that has long prevented funders and institutions from capturing the downstream impact of shared research data.

For funders and institutions, data sharing is the cornerstone of accelerating discovery. MJFF has long stood at the forefront of this movement, serving as the implementation partner for the Aligning Science Across Parkinson’s (ASAP) initiative since 2020. Building on the success of ASAP’s progressive open science policies, MJFF began expanding on these open science practices across its entire research network in 2022. This commitment reached a new milestone in 2025 through a dataset reuse study commissioned alongside Strategies for Open Science (Stratos) providing the evidence-based framework for this pioneering new system.

Developed by DataSeer in conjunction with their Open Science Indicator partner PLOS (with input from the broader Open Science community), and piloted across a corpus of 6,000 MJFF-funded articles, the new Large-Language Model addresses this problem directly. Rather than relying on formal data citations or DOIs—which are often missing, incomplete, or inconsistently applied—the model analyzes the full text of articles to identify reuse even when datasets are referenced indirectly. This includes cases where authors cite accession numbers, URLs, repository names, or describe reused datasets narratively without any formal identifier.

“Detecting dataset re-use is genuinely hard,” said Tim Vines, founder and CEO of DataSeer. “Traditional approaches that depend on structured identifi ers typically fi nd evidence of reuse in only about two percent of articles. When we applied our LLM to the MJFF corpus, we found clear evidence of data reuse in forty-three percent of articles. That gap confi rms the broad perception that data reuse has always been happening but was effectively invisible.”

“For funders, there is growing interest in understanding not just what gets published, but how research outputs are used and reused over time,” said Josh Gottesman, Community Director for Research Data at MJFF. “The ability to systematically track data reuse gives us a new lens on openness, research integrity, and the downstream impact of our funding dollars—while also underscoring the critical contributions of researchers who generate data that enables future discoveries.”

Beyond improving measurement, the Data Reuse LLM enables a fundamental shift in how research impact can be understood. By tracing how datasets propagate through subsequent studies, the system makes it possible to quantify the influence of data itself, not just the papers that first describe it. This opens the door to treating datasets as fi rst-class research outputs, on a par with articles, whose contribution to scientific progress can be systematically assessed.

At scale, robust measurement of data reuse provides concrete evidence of the value of open science policies and investments. As the Data Reuse LLM continues to evolve, DataSeer expects further gains in sensitivity, accuracy, and transparency. Their technology has the potential to support routine, large-scale assessments of data impact across disciplines—bringing long-overdue visibility to how shared data accelerates research and amplifi es the return on research funding.