Superalignment through Chain-of-Thought?
Preface/Context
Dear readers,
This article is to be considered as a mere reading material of not proof and or established concepts but rather experimental tangents and reflections and ideas or chain-of-thought (if you will entertain my humor) I had while reading the famed - Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." In this article, I put across the questions that arose and the answers I framed (often with more questions than concrete answers) and I will constantly try to tie them back to the more pressing "AI alignment problem".
Reading
The fundamental question that comes across throughout the reading is
How does chain-of-thought prompting elicit reasoning? What is the reason behind it?
While, I am unable to answer this question succinctly and accurately at this point, I leave it to further research. But I wonder, is the LM (Language Model) eliciting reasoning or emulating the reasoning of it's annotators. One possible place to eliminate the possibility of 100% emulation of it's annotators is to look at the Fig. 6 of the Ablation study done in the paper. You can find it below -

The authors conclude that chain-of-thought does not depend on any particular linguistic style and or the annotator. But there still remains a question -
Although not emulating its annotator, emulation of the reasoning for a particular task still remains the same ergo, emulation of the chain-of-thought itself is possible.
One particular question that fascinated me through my study is, can LM reason beyond chain-of-thought prompting? As in, given a scaffold of chain-of-thought for an in-domain question, will the LM generalize the chain-of-thought itself to OOD (Out-Of-Domain) questions?
Summary of the thoughts on this paper
To summarize the crux of my ideations & thoughts about this paper, follow through on the following questions -
- Scale of the model seems to improve the reasoning ability & semantic understanding. Why is scale simply the parameter which affects it? Does it have any deeper meaning for why? (Check FAQ section of the paper).
- Does chain of thought at its core asking the model to reproduce the results in the same way that the input has been given? (Emulation problem).
- If chain of thought can indeed elicit reasoning, is this the "opening of the black box"? If so, to what extent can this box be opened - 100% understanding of everything that goes on? (Explain-ability Problem).
- Can this method be utilized in making the model aligned? - Make a smaller model aligned with chain-of-thought process and leverage weak-to-strong generalization to enforce alignment in bigger models?
Superalignment - What do I mean by the title of this article?
The final point takes me to my final tangent and the main subject of this article. Why do I think chain-of-thought process can be a beneficial step in eliciting inherent alignment principles as the LMs scale to enormous degrees.
I strongly encourage readers to understand Weak-to-Strong Generalization, at least at the surface level before reading further. Helpful resource for it - Open AI's Weak-to-Strong Generalization.
We know, at this point that we can steer a language model strictly in certain directions through intelligent prompting and guardrails. I propose to consider, eliciting factual information alongside reasoning from LM's through chain-of-thought prompting.
Before moving forward let's define the super-alignment problem as outlined by the Open AI's team -

Continuing our chain-of-thought, lets assume chain-of-thought prompts include reasoning for why certain behavior is good while certain other behaviors are bad. In theory if a smaller model (in our case - at least a 100B params - read my full annotation for why) is finetuned i.e., aligned using these chain-of-thought prompts.
Therefore in theory, the aligned chain-of-thought LM (CLM) can generalize its capabilities i.e., elicit reasoning and aligned behavior to much stronger and larger LMs through weak-to-strong generalization.
Since, scale helps with reasoning, bigger models therefore will in theory can elicit superior reasoning and alignment behaviors learned from the smaller model & can explain why certain behavior is good vs bad.
Footnote
I humbly request all readers to consider this article as a step to take an unbaked idea to perhaps polish or further experiment to prove me right or wrong. Any and all comments are welcome & I hope to learn from you all. I have attached the full annotated paper with more questions and tangents and perhaps some answers. Please feel free to read it. Chain-Of-Thought Annotated Thanks for taking the time. Cheers.