Agents R&D 16 Jul 2025

AI for literature reviews: what it searches, what it synthesises, what it gets wrong

A systematic literature review involves four distinct phases: finding relevant papers, screening them for inclusion, extracting the data that matters, and synthesising across the set. AI addresses each phase with ...

Researcher reviewing AI-generated literature review summary

A systematic literature review involves four distinct phases: finding relevant papers, screening them for inclusion, extracting the data or findings that matter, and synthesising across the set. Each phase has a different bottleneck, and AI addresses them with different levels of reliability. Understanding which phase you're automating — and what can go wrong — determines whether the tool helps or creates problems downstream.

Search and discovery

The search phase is where AI adds the most unambiguous value. Manual keyword searches miss synonyms, related concepts, and cross-disciplinary work that uses different terminology for the same idea. AI-assisted search uses semantic similarity rather than exact keyword matching, which means a search for papers on "machine learning for contract review" surfaces work published under "natural language processing for legal documents", "automated contract analysis", and related framings that a keyword search would miss.

The practical tooling here is mature. Semantic Scholar's API provides free programmatic access to over 200 million academic papers with semantic search and citation graph data. PubMed covers biomedical literature with MeSH term expansion. arXiv is the primary preprint server for computer science, physics, and mathematics, with full-text access. For a given research question, an AI agent can query multiple databases simultaneously, deduplicate results, and return a ranked candidate set in minutes — a process that takes researchers hours when done manually across separate interfaces.

Citation graph traversal is a particularly useful capability: given a set of seed papers you've identified as central to your question, the agent traces forward citations (papers that cite them) and backward citations (papers they cite), surfacing related work that a keyword search alone wouldn't find.

Screening

Abstract screening — deciding whether a paper is relevant enough to read in full — is the most tractable task for AI automation. Given your inclusion and exclusion criteria, a language model can process thousands of abstracts and flag candidates for full-text review. The main risk is false negatives: papers screened out that should have been included. For high-stakes systematic reviews (clinical guidelines, regulatory submissions), AI screening should be treated as a first pass that reduces the human screening load, not a replacement for it. For exploratory reviews where recall is less critical, AI-only screening of the first pass is generally acceptable.

Extraction

Data extraction — pulling specific variables from included papers (sample sizes, methods, effect sizes, limitations) — is where AI accuracy becomes more variable. For structured data presented in tables, extraction is reliable. For findings embedded in prose, particularly where authors hedge or qualify claims, extraction accuracy drops. The failure mode is subtle: the extracted data looks plausible but misrepresents what the paper actually found. Building in a spot-check protocol — human verification of a random sample of extractions — is important for any systematic review where extracted data will be used in meta-analysis or policy recommendations.

Synthesis

Synthesis is the phase where AI is most useful as an assistant and most dangerous as an autonomous actor. A language model can identify thematic patterns across a set of papers, surface apparent contradictions between findings, and draft a narrative summary that a researcher can then verify and refine. What it can't reliably do is assess the methodological quality of the studies it's summarising, weight findings appropriately by study design and sample size, or identify when an apparent consensus in the literature reflects genuine evidence rather than citation cascades where later papers cite earlier ones without independent verification.

The specific risk to flag: AI models will sometimes generate plausible-sounding citations that don't exist, or attribute findings to papers that don't contain them. This is the hallucination problem applied to academic work, and the consequences in a published review are significant. Every citation in an AI-assisted synthesis needs to be verified against the actual source before submission. This isn't optional.

Academic integrity considerations

Disclosure requirements for AI-assisted research are evolving rapidly and vary by journal and institution. As of 2026, most major publishers require disclosure of AI tool use in the methods section, and some prohibit listing AI systems as authors. Checking the specific policy of your target journal before submission is necessary — the landscape has changed substantially in the past two years and continues to evolve.

For corporate R&D teams and policy researchers using literature review methodology outside academic publishing, the integrity question is less about journal policy and more about traceability: can you show, if asked, which claims in your synthesis are grounded in which sources? Building an audit trail between AI-generated synthesis and the underlying papers is good practice regardless of whether it's formally required.

What the workflow looks like in practice

A useful setup for a research team: an AI agent configured to query Semantic Scholar, PubMed, and arXiv simultaneously against a defined research question, with citation graph expansion from seed papers; abstract screening against your inclusion criteria with human review of borderline cases; full-text extraction into a structured template; and a synthesis interface that lets you query across the extracted data and draft narrative sections with source citations that can be verified. The agent handles volume and consistency. Researchers handle judgement and verification.

Running literature reviews as part of R&D or policy work?

We build AI-assisted research workflows for teams that conduct systematic or scoping reviews — from multi-database search to structured extraction and verified synthesis. If you want to discuss what a practical setup looks like for your research context, a scoping conversation is the right starting point.

Let's talk about your research workflow →

Lino Moretto

RAAS Impact

Drawing from over 20 years of expertise as Fractional innovation Manager, I love bridging diverse knowledge areas while fostering seamless collaboration among internal departments, external agencies, and providers. My approach is characterized by a collaborative and engaging management style, strong negotiation skills, and a clear vision to preemptively address operational risks.

Free 30-minute call

No guesswork.
No slide decks.
Just impact.

Ready to move from AI hype to a working system? In a free 30-minute call we'll identify your highest-impact use case and tell you exactly what it takes to get there.

No upfront cost · Italy · Malta · Europe · English & Italian

Start Your Sprint →