New Pre-Print Published: Natural Language Querying of Biological Databases with Large Language Models

We are thrilled to announce that our colleague, Loes van den Biggelaar (The Hyve), together with Vladimir A. Makarov (Pistoia Alliance), Oleg Stroganov (Rancho Biosciences), Laura I. Furlong (MedBioInformatics Solutions), Brian Evarts (Crown Point Technologies), Alexandros Goulas (AbbVie Germany), Etzard Stolte (Roche), Derek Marren (AstraZeneca), and Lars Greiffenberg (AbbVie Germany), has contributed to an exciting new pre-print paper titled "Natural Language Querying of Biological Databases with Large Language Models." The paper touches a critical challenge in bioinformatics and pharmaceutical R&D: accurately translating natural language into structured query languages (NL2QL) without naive Large Language Models (LLMs) producing factually false outputs known as "hallucinations."

To address this, the research team conducted a systematic evaluation of 21 different NL2QL architectures and strategies across five diverse LLMs. The testing was performed on a knowledge graph representing a subset of the Open Targets database, embedded via BioCypher. The study evaluated a continuum of approaches, moving from inflexible template-based generation and standard Knowledge Graph Retrieval-Augmented Generation (KG-RAG) to dynamic, custom-built AI agent strategies and automated prompt optimization libraries like DSPy and MIPROv2.

The core finding of the paper highlights a significant breakthrough: the best balance between accuracy and flexibility is achieved through a multi-agent LLM strategy. By deploying multiple LLM agents that can actively challenge each other's outputs and interact with a human user to clarify ambiguities—such as complex entity recognition and synonyms—the system successfully mitigates common query syntax errors and schema misinterpretations. This collaborative agentic approach proved superior to other techniques, as it does not require prior hard-coded knowledge of the data structure, complex secondary databases, or rigid query templates.

For those interested in diving deeper into the methodology, error analysis, and the future of LLM benchmarks in life sciences, the complete pre-print is available to read here
Additionally, in the spirit of open science and reproducibility, the Pistoia Alliance team has made the accompanying public data and codebase available on GitHub.

Let's start collaborating

We offer:

  • Customized open source software without license costs, with little or no data transfer
  • Expertise in FAIRification of biomedical data
  • Tailored data analytics

Fill in the form and we will get in touch