Scientists at OpenAI have developed a new set of tests called MLE-bench to evaluate the ability of artificial intelligence (AI) agents to modify their own code and enhance their capabilities without human intervention. This benchmark consists of 75 Kaggle tests that assess machine learning engineering tasks, such as training AI models, preparing datasets, and conducting scientific experiments. By measuring how well AI models perform at “autonomous machine learning engineering,” the scientists aim to determine their proficiency in handling complex challenges.
In a paper published on the arXiv preprint database, the researchers explained that any AI system excelling in the 75 tests of MLE-bench could be considered an artificial general intelligence (AGI) system, surpassing human intelligence. These tests have practical applications, including tasks like developing an mRNA vaccine for COVID-19 and deciphering ancient scrolls, demonstrating the real-world value of AI capabilities.
While autonomous AI research could accelerate progress in various fields like healthcare and climate science, the scientists cautioned against the potential risks of uncontrolled AI development. They highlighted the importance of understanding the impacts of AI advancements to prevent catastrophic harm or misuse. By enabling AI agents to improve their own training code and enhance frontier models faster than human researchers, there is a need to ensure proper alignment and control mechanisms for these powerful AI systems.
The researchers tested OpenAI’s advanced AI model, o1, on MLE-bench and found that it achieved at least a Kaggle bronze medal level on nearly 17% of the tests. This performance improved with more attempts, with o1 even surpassing human capabilities by earning an average of seven gold medals on the benchmark. The scientists emphasized the significance of open-sourcing MLE-bench to encourage further research in evaluating AI models’ machine learning engineering skills.
As technology continues to advance, understanding the capabilities and limitations of AI agents in autonomously executing complex tasks is crucial for the safe deployment of powerful AI models in the future. By fostering a deeper understanding of AI capabilities, researchers aim to mitigate potential risks associated with AI development and ensure responsible AI deployment for the benefit of society.