× close
Answer method, score and answer time for each level. GPT4+ plugin scores should be considered an oracle, as plugins are manually selected according to the questions asked. Human score refers to the score obtained by an annotator when validating a question. credit: arXiv (2023). DOI: 10.48550/arxiv.2311.12983
A team of researchers from AI startups Gen AI, Meta, AutoGPT, HuggingFace, and Fair Meta will help manufacturers of AI assistants, especially those making products based on large-scale language models, test their applications for potential artificial intelligence. We have developed a benchmark tool to use for this purpose. General Intelligence (AGI) applications. They wrote a paper describing a tool they named GAIA and how to use it.the article is Posted in arXiv Preprint server.
Over the past year, researchers in the AI field have been debating the capabilities of AI systems, both in private and on social media. Some suggest that AI systems are very close to achieving AGI, while others suggest that the opposite is much closer to the truth. All agree that such systems will at some point match and even exceed human intelligence. The only question is when.
In this new effort, the research team believes that if a true AGI system emerges, a rating system must be put in place to measure the level of intelligence of AGI systems, both with each other and with humans, in order to reach consensus. points out. They further point out that such systems need to start with benchmarks, which is what they propose in their paper.
The benchmark the team created consists of a series of questions that will be posed to a future AI, and its answers will be compared to those provided by a random set of humans. When creating the benchmark, the team ensured that the questions were not typical AI queries that AI systems tend to score highly on.
Rather, they often pose questions that are easy for humans to answer but difficult for computers to answer. Often, finding answers to the questions researchers devised required multiple work steps or “thinkings.” For example, specific to content found on a particular website, such as “How high or low is the fat content in a pint of ice cream, based on USDA standards, as reported on Wikipedia?” may ask questions.
When the research team tested the AI products they were working on, they found that none of them were up to par with their benchmarks. This suggests that the industry may not be as close to developing a true he AGI as some think.
For more information:
Grégoire Mialon et al., GAIA: Benchmarking General AI Assistants, arXiv (2023). DOI: 10.48550/arxiv.2311.12983
© 2023 Science X Network