Study reveals patterns that expose machine-generated text

What is the story

Recent studies have shown that large-scale language models (LLMs) like ChatGPT often over-represent certain words and may have a limited vocabulary.

The researchers likened this “excessive use of language” in the biomedical literature to the way doctors measure the impact of COVID-19 through “excess deaths.”

The study suggests that approximately 10% of abstracts in 2024 were processed by LLMs.

The LLM’s unprecedented impact on scientific language

The researchers noted that the impact of LLM use on scientific writing is “truly unprecedented, surpassing even the dramatic vocabulary changes caused by the COVID-19 pandemic.”

The researchers took a novel approach to measuring “excessive word use” in biomedical literature, similar to the way doctors track “excess deaths” in epidemiology.

The study conducted an in-depth analysis of 14 million biomedical paper abstracts published between 2010 and 2024.

LLM increases frequency of certain words

The research team used papers published before 2023 as a baseline and compared them to papers published during the widespread commercialization of LLM.

They found that there was a 25-fold increase in the frequency of less common words like “delves,” and a nine-fold increase in the frequency of words like “showcasing” and “underscores.”

Even common words like “potential,” “discovery,” and “important” saw increases in usage of up to 4%.

Excessive use of language: an indicator of AI’s impact

The researchers looked at overused words and phrases between 2013 and 2023 and identified terms related to global events, such as “Ebola,” “coronavirus,” and “lockdown.”

However, in 2024, the extra words were mostly style words rather than content words.

Of the 280 overstyle words identified that year, two-thirds were verbs and about one-fifth were adjectives.

AI-processed papers are more popular in non-English speaking countries

Using these excessive style words as an indicator of ChatGPT’s usage, the researchers estimated that roughly 15% of papers published in non-English-speaking countries such as China, Taiwan, and South Korea are now processed by AI.

This compares with 3% in English-speaking countries such as the UK.

They acknowledged that native English speakers may be better at hiding their use of LLMs.

Study reveals patterns that expose machine-generated text

Byautomateinsider

The LLM’s unprecedented impact on scientific language

LLM increases frequency of certain words

Excessive use of language: an indicator of AI’s impact

AI-processed papers are more popular in non-English speaking countries

By automateinsider

Related Post

Researchers from ETH Zurich and the University of California, Berkeley introduce MaxInfoRL: a new reinforcement learning framework for balancing endogenous and extrinsic exploration – MarkTechPost

Absci Bio releases IgDesign: A deep learning approach to transform antibody design with reverse folding – MarkTechPost

Excellence in Artificial Intelligence/Machine Learning: Olga Czabaj-Shetty, Bank of America – Markets Media

Introducing AI for customer service

You missed

Researchers from ETH Zurich and the University of California, Berkeley introduce MaxInfoRL: a new reinforcement learning framework for balancing endogenous and extrinsic exploration – MarkTechPost

4 ways artificial intelligence will reveal the unexpected in 2024 – CNN

Andrew Ng is betting big on agent AI – Fast Company

Absci Bio releases IgDesign: A deep learning approach to transform antibody design with reverse folding – MarkTechPost

Automate insider