Why Large Language Models can’t figure out scientific research yet

Victor Botev, CTO and Co-Founder at Iris.ai, explains why ChatGPT remains baffled by published technical papers
Victor Botev
A scientist

For everyone with even a moderate interest in tech, it’s hard to imagine you’ve not seen the phrase ‘generative AI’ numerous times over the last few months. This technology has got everybody excited: not just for the consumer or enterprise applications, but also for what it might do in the field of scientific research.

However, decision-makers interested in the scientific applications of Natural Language Processing should consider if the tools they’re applying are reliable, accurate, and focused on delivering value – because large models like ChatGPT simply aren’t there yet.

The problems at hand

There’s an enormous – and growing – amount of scientific research out there, and the explosion in volume is making it difficult for corporate and academic researchers to find relevant research from the specialised corners of their own field – let alone others.

In light of that, it’s easy to see how the potential for AI-powered research assistants, empowered with the ability to receive instructions and speak in natural language, has become such an eagerly anticipated technology.

The truth is that the enormous models (usually known as Large Language Models, or LLMs) produced by companies like Microsoft or Google are not suitable for this task. They’re often inaccurate, misinformed, or outright incorrect – especially when it comes to scientific research. Hence Meta’s scientific assistant model ‘Galactica’ being shut down within three days.

The importance of domain-specific training

Scientific fields are, by their nature, specialised domains with their own terminology and nuanced definitions. Organisations in each field need accurate results that understand the specificity and idiosyncrasies of said domain. LLMs are not, and will not be, able to capture these nuances within the next couple of years.

The immense running costs of LLMs and reliance on volumes of data that may not even exist in certain fields is compounded by the lack of fact-checking software advanced enough to measure their quality, let alone begin to fix them.

By improving fact-checking, we can unlock better training, better specialisation, and domain adaptation for these models, as well as drive down costs and make them more accessible. However, this takes time.

AI and scientific language

There’s a huge gap between natural speech and scientific text. ChatGPT and LLMs are fantastic at interpreting and interacting with regular human language, but they struggle to maintain accuracy when ‘autobiography’ becomes ‘autotroph’, and ‘eureka’ becomes ‘eukaryote’.

ChatGPT has been trained with millions upon millions of scraped text samples, but how many internet forums or blog posts contain obscure scientific terminology? And of those, how many even use them correctly?

For scientific research, these giant datasets just don’t exist, and we need to change our approach: going from quantity to quality. Bespoke, high-quality datasets of even 50-100 examples, selected and tailored with care, are enough to teach an AI engine the difference between its ‘oligotrophs’ and ‘oxalotrophies’.

Unfortunately, at present, this technique is feasible only with smaller, ‘smart’ language models, and as a community we’re still figuring out how to transfer it to LLMs.

Citation and AI decision-making

To be useful to researchers, LLMs need to be able to cite their sources. Any peer-reviewed scientific paper has pages upon pages dedicated to sources. Yet, as it stands today, an LLM-generated paper or summary struggles to generate a high volume of relevant sources – that’s if it doesn’t make some up entirely.

Whilst just citing sources may not be an enormous engineering challenge, and was likely not a priority for OpenAI’s developers, the bigger challenge will be to ensure that models are consistently selecting the right sources. They should be able to lay out to the end-user why any given source is the correct choice in factual terms that highlight its reasoning and driving factors in its decision-making.

If we want to build trust in AI’s decision-making process, it’s vital that we prioritise citations and transparency.

Smart language models

LLMs are not yet suitable for scientific research due to their reliance on large datasets and lack of accuracy when it comes to specialised terminology – let alone issues with citation.

Technological progress is a matter of marginal, iterative gains, and that's why focus is so important. We must create AI tools that tackle one area at a time, rather than a blanket solution that sounds good at everything whilst returning mistake-laden answers.

The answer is smart language models. Models that have been trained with bespoke, high-quality datasets can offer more accurate results in this area, as well as factual validation based on their training from papers and patents in the field in which they operate.

By taking a specialised approach to language modelling, scientists can benefit from AI-assisted research exploration whilst enjoying more accurate results, better adapted to their domain-specific needs. As all scientists know, saving time should never come at the expense of getting the right answers.

Written by
Victor Botev