AI talk: What’s next for LLMs? Self-correcting?

October 25, 2024 | "Juggy" Jagannathan, PHD

How do you get large language models (LLMs) to respond correctly to complex questions? They do an admirable job of summarizing and their ability to answer an almost bewildering array of questions is awe inspiring. But they hallucinate. They are biased. Complex reasoning eludes them. This week’s blog takes a snapshot of what is happening in the world of LLMs and artificial intelligence (AI) in general and its ever-changing nature.

OpenAI o1

Last month, Open AI released a new model called o1. The only information they shared about the model in their “technical research post,” was that the model was trained using large scale reinforcement learning utilizing chain-of-thought (CoT) reasoning. The reinforcement is to force the model to think step-by-step. OpenAI noted that they will not share the step-by-step reasoning that the model takes during inference. It will, in addition, cost you more and take longer to run the model (inference).

The model has so far produced great results in a variety of complex benchmarks. A viral video of the model solving a complex math problem was posted by a PhD student. He was blown away by the fact that by just reading the method section of his published paper, o1 recreated the complete code for solving the problem in about an hour! A feat that took him literally a year! That’s quite impressive.

Interestingly, this model is apparently the start of a new series – o1 series. Open AI is also suggesting that iterations of GPT will continue. Is GPT5 being outmoded? Or is it o2? Who knows how they name new systems!

To CoT or not to CoT?

CoT reasoning is prompting the o1 model to reason step-by-step. Researchers from the University of Texas Austin, Johns Hopkins and Princeton published new research exploring when it is and isn’t useful to use CoT reasoning. Their conclusion? CoT works best to solve math and symbolic problems. In other instances, such prompting delivers no additional benefit – in fact can degrade the output. The results were based on analyzing 14 different models on 20 different datasets. Check out their impressive infographics on meta-analysis of CoT improvements (or lack thereof) in their paper.

Self-correcting LLMs

Google DeepMind released a technical research report this month with the title: “Training Language Models to Self-Correct via Reinforcement Learning.” Here I am reviewing the highlights of this report, which in principle is trying to do the same as the OpenAI o1 release: Solve complex problems with LLMs. The method they propose is dubbed SCoRe – Self-Correction via Reinforcement Learning.

One of the results of this paper is that self-correction of LLMs does not work! Meaning asking an LLM to correct its own results is not an effective strategy. The reason, Google researchers point out, is twofold – 1) the LLM output suffers a mode collapse, meaning it gets stuck in a local non-optimal state in generating the next token; 2) there is a distribution mismatch between how the LLM is trained and how it is being used during self-correction.

The solution? A two-stage training strategy, incorporating reinforcement learning (RL) with self-generated content in solving complex problems. Stage one is to train the model that is resistant to mode collapse. For this, the training strategy is to not let the model deviate from its core capabilities for its first response using a statistical construct called KL-divergence. The second stage is done with RL to maximize error correction through reward shaping. Reward shaping is a RL process by which the model is encouraged (more reward) when it makes meaningful corrections. This multi-turn (they studied only two turn self-correction) starts with a prompt to solve a problem. The second stage has a generic prompt asking the model to examine its first stage output and correct if necessary. The model they trained does well with the self-correction approach.

To understand the full importance of their work, you can read their technical report. But you may also find this discussion in Discover AI entertaining and educational. Not only do they explain the whole technical paper in depth, but they also highlight the use of Open AI o1 to explain all the technical formulations in great detail. More than likely, the o1 is utilizing a strategy similar to what Google is presenting in this paper – and it is only fitting that o1 gets to explain the paper!

State of AI

Every year since 2018, Nathan Benaich (Air Street Capital) and his colleagues in the UK have been putting out a fairly comprehensive report chronicling the current state of AI. The 2024 edition dropped Oct. 10. It is an excellent summary of what is happening now and some prognostications about the immediate future.

“Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research. 

AI talk: What’s next for LLMs? Self-correcting?

OpenAI o1

To CoT or not to CoT?

Self-correcting LLMs

State of AI

About the author

V. Juggy Jagannathan

AI evangelist

Recommended for you

AI talk: What’s next for LLMs? Self-correcting?

AI talk: LLMs as judge and jury

Are LLMs biased, or are we asking the wrong questions?

Subscribe to Inside Angle

Our mission

Our company

Resources & education

Info

Follow Us