Overcoming Turing: A Rethinking of Evaluation in an Era of Large Language Models

by Dr. Megan Ma and Jay Mandal, Codex fellow

While headlines screamed about the ^[1] capabilities of OpenAI’s GPT-4 many people continue to wonder when the era for human-machine cooperation will begin. Yet, despite the excitement surrounding GPT-4’s passing of medical licensing exams, ^[2], bar exams, ^[3], and other forms standardized testing, there is still one question that remains unanswered. What does the ability of an AI model to pass these complex reasoning exams represent? Does this mean that LLMs are showing intelligence similar to humans? ^[4] One of the most common claims was that LLMs “broke the Turing Test.” What does this mean, and is it accurate?

In order to answer these questions we need to understand the Turing Test’s historical context and, more importantly, its original intentions. Alan Turing, a renowned mathematician/computer scientist, popularized the Turing test in 1950. Since then, it has been regarded as an operational test for intelligence. It tests a machine’s capability to behave in a way that is not distinguishable from a human. In this test, an evaluator attempts to distinguish between a machine and a human by engaging in a conversation using natural language. The machine passes the test if the human cannot tell.

The Turing test is a philosophical thought experiment, similar to another well-known and related “test,” the Chinese Room Argument. Turing’s test is better understood as a philosophical experiment. It is similar to the Chinese Room Argument, another well-known and related “test”. In addition to this clarification the Turing Test was never about whether machines can think. Instead, it was about understanding if machines could mimick a human. Turing describes the game as imitation in the beginning of the article.

Researchers at AI21 Labs conducted the largest social experiment in the world to reproduce the Turing test via the game “Human or Not”. Over 2 million people from all over the world participated in this experiment. Researchers asked participants to speak blindly for two minutes with an AI bot (based off leading LLMs like GPT-4) and a fellow participant. They then asked the participants to guess whether they spoke with a person or a machine.

The results were quite interesting. Only 68% correctly guessed whether they were speaking to another human or an AI robot. The most fascinating part was to see the different approaches and strategies people used to determine whether they were speaking with a real person or an AI bot. The common strategies used were based upon the perceived limitations in language models and the previous perceptions of human behavior on the internet. ^[6] Some of these included asking deeply personal questions or philosophical ones. The AI chatbot was also assumed to be politer than a person or incapable of forming slang. In many interactions, these assumptions proved to be false.

The Turing game results are not yet clear. It does not provide any further clarity as to whether the models are capable of performing certain tasks. Researchers were unable in each experiment to explain what was being measured. It was more of a suggestion that these machines were capable of engaging in conversation, both observably and conversatively.

What fascinates me about the Turing Test and its relationship to interpreting the capabilities of machines, in this case generative AI, is the inability to articulate what separates humanity from artificiality. The majority of Alan Turing’s paper is an argument against possible criticism of the Turing test. He takes “such care to point out the fallsacies of contrary views.” ^[7] The tension Alan Turing mentions is especially carried forward by LLMs.

It is only natural that a clinical psychologist would turn to the IQ test. ^[8] Eka Rouvainen administered the IQ Test to ChatGPT and found that it scored better than 99.9% test takers. While these models can score higher than humans on some aptitude tests in certain situations, they are not consistent. LLMs do not perform well on reasoning tests that are often given to children by developmental psychologists.

Roivainen points out that IQ tests cannot measure intelligence in all its aspects. What are we testing? What do we really want to prove and demonstrate with these models? How should we test them then?

We found that it was extremely difficult to assess the performance of such models. The problem is further complicated by the fact that benchmarking is not applicable to the LLM era. Benchmarking is useful for evaluating specific abilities (e.g. grammar in language), ^[12], but not when multiple skills are required. It is argued that human-designed academic and professional tests should be used as a remedy for the shortcomings of traditional benchmarking.

These exams have become a new benchmarking method. Problematically, the results of these exams for humans are determined by factors that are contextualized not only by our social environments but also reflect differences in learning. It is a further blurring of the distinction between human and machine. The performance of these tests has never been used as a sole metric to measure human capabilities. Many lawyers would agree that the bar examination is far from an assessment of their ability to perform quality legal work. The legal industry is complex because quality and value are often implicit. Contextualized by the experience of the client, and their industry knowledge.

Jack Stilgoe says that a Weizenbaum AI test is needed. Stilgoe argues in his article that the difficulty of testing AI models is due to a detachment with their application. The focus on machines to simulate intelligent is an old idea. ^[13] Instead, tests should be reframed to emphasize their public value and usefulness, “evaluating them based on their real-world implications.” ^[14] It is a good argument, but difficult to assess. It is not always possible to assess the usefulness of a test. It may be equally important to note that the shift from “intelligence” towards granular task based reasoning is just as significant. The evaluation should be based on identifying the best use cases for LLMs.

What are LLMs good for? In this regard, let’s begin to consider Moravec’s Paradox. Hans Moravec noted in 1988 that machines may have difficulty replicating tasks that humans can easily perform. Replication is the key word. The evaluation process is based on the assumption that machines should behave like humans. We must also remember that these models don’t think like humans. While they can produce linguistic output that is similar to human speech, they cannot feel human emotions. These models achieve the capabilities they display via their own paths (black boxes). While it is not yet known if their performance in a wide range of complex tasks can be considered “intelligence”, the desire to use these models for tasks that require a certain level of abstraction by humans encourages misuse. We should instead acknowledge and investigate the capabilities of LLMs that humans do not have.

In this field, we are starting to see more research. Specifically on LLMs that simulate ^[15] as well as generate hypothetical scenarios or offer alternative perspectives. LLMs are excellent tools for learning and training, because they can extend the imagination of humans. LLMs can recall memories at a rate that is faster than humans. This indicates that a division between machines and humans is already in place. Machines are capable of a similar level of specialization as humans. For example, we consider assessing if an LLM can be tailored to certain tasks via model adjustments (otherwise known as finetuning), or through in-context-learning (otherwise known as prompt engineering). These decisions and the method by which they are made require more evaluation and systematic guidance. We need to do a more thorough analysis of the metrics we use today to assess human talent and skill. Then we can confidently explore richer areas for human-machine interaction. We are still confined to vanity metrics, which only show a partial view of a model’s relative advantage. ^[16]

We will need a comprehensive set of assessments to not only be able to articulate a model’s performance technically, but also compare it with a person’s ability and reasoning for completing a task. The first step to this evaluative harness would be to characterize the human performance in a given task empirically and clearly, by mapping ontologies, conceptual hierarchies, and domains. This would not only provide a better understanding of the role of humans in the future, but also increase our sensitivity to nuance and finer details in performance. We may also be able distinguish between the behavior and perception differences of the machines’ world models and ours. As a result, we move from a world of generalists into one of agility.

——————————————————————————————————————————————————————————————————————–

[1] Bubeck et al., Sparks of Artificial General Intelligence: Early experiments with GPT-4 (March 2023), available at: https://www.microsoft.com/en-us/research/publication/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4.

[2] Tiffany Kung, Research Spotlight: Potential for AI-Assisted Medical Education Using Large Language Models, Massachusetts General Hospital (February 9, 2023), available at: https://www.massgeneral.org/news/research-spotlight/potential-for-ai-assisted-medical-education-using-large-language-models.

[3] Daniel Martin Katz et. al, GPT-4 Passes the Bar Exam, SSRN (April 5, 2023) available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233.

[4] Celeste Biever, ChatGPT broke the Turing test – the race is on for new ways to assess AI, Nature (July 25, 2023) available at: https://www.nature.com/articles/d41586-023-02361-7.

[5] See original discussion of the Chinese Room Experiment: John R. Searle, Minds, brains, and programs, Behavioral and Brain Sciences 3:417-424 (1980), available at: https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/minds-brains-and-programs/DC644B47A4299C637C89772FACC2706A.

[6] Daniel Jannai et al., Human or Not? A Gamified Approach to the Turing test, available at: https://arxiv.org/abs/2305.20010.

[7] Alan Turing, Computing Machinery and Intelligence, Mind 49: 433-460 (1950), available at: https://academic.oup.com/mind/article/LIX/236/433/986238.

I Gave ChatGPT a Test of IQ. Here’s What I Discovered, Scientific American (March 28, 2023), https://www.scientificamerican.com/article/i-gave-chatgpt-an-iq-test-heres-what-i-discovered/.

Will Douglas Heaven: AI hype is built around high test scores. Those tests are flawed, MIT Technology Review (August 30, 2023), https://www.technologyreview.com/2023/08/30/1078670/large-language-models-arent-people-lets-stop-testing-them-like-they-were/.

[10] Please note that the results have not yet been compared to newer multimodal models. See /a> Biever supra 4

[11] Id.

[12] Avijit Chatterjee, The Problem with LLM Benchmarks, AIM Research (September 14, 2023), available at:https://aimresearch.co/2023/09/14/leaders-opinion-the-problems-with-llm-benchmarks/.

[13] See reference by Stilgoe on Philip Ball, “LLMs signal that it’s time to stop making the human mind a measure of AI.” Jack Stilgoe, We need a Weizenbaum test for AI, Science (Aug 11, 2023) available at: https://www.science.org/doi/full/10.1126/science.adk0176.

[14] Id.

[15] Take for instance this idea to simulate conflict in order to train conflict resolution. See Omar Shaikh et al., Rehearsal: Simulating Conflict to Teach Conflict Resolution, https://arxiv.org/abs/2309.12309.

[16] Compare models based on the rate of hallucination. Recent commentary has highlighted the triviality and potential misinformation of this type of assessment. See for fact hallucination assessment: https://github.com/vectara/hallucination-leaderboard. See for commentary: https://www.linkedin.com/posts/drjimfan_please-see-update-below-a-recent-llm-hallucination-activity-7130230516246593536-mxAY/.

[17] For example, Yuling Gui, What are They Thinking? Do language models have coherent mental models of everyday things?, Medium (July 7, 2023) available at: https://blog.allenai.org/what-are-they-thinking-do-language-models-have-coherent-mental-models-of-everyday-things-cc73035a0ec8. See also Wes Gurnee and Max Tegmark, Language Models Represent Space and Time, https://arxiv.org/pdf/2310.02207.pdf; Omar Shaikh et al., Grounding or Guesswork? Large Language Models are Presumptive Grounders, https://arxiv.org/pdf/2311.09144.pdf; and Melanie Mitchell, AI’s challenge of understanding the world, Science (November 10, 2023) available at: https://www.science.org/doi/full/10.1126/science.adm8175.

Leave a Reply Cancel reply