Superalignment through Chain-of-Thought?

Preface/Context

Dear readers,

This article is to be considered as a mere reading material of not proof and or established concepts but rather experimental tangents and reflections and ideas or chain-of-thought (if you will entertain my humor) I had while reading the famed - Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." In this article, I put across the questions that arose and the answers I framed (often with more questions than concrete answers) and I will constantly try to tie them back to the more pressing "AI alignment problem".

Reading

The fundamental question that comes across throughout the reading is

How does chain-of-thought prompting elicit reasoning? What is the reason behind it?

While, I am unable to answer this question succinctly and accurately at this point, I leave it to further research. But I wonder, is the LM (Language Model) eliciting reasoning or emulating the reasoning of it's annotators. One possible place to eliminate the possibility of 100% emulation of it's annotators is to look at the Fig. 6 of the Ablation study done in the paper. You can find it below -

The authors conclude that chain-of-thought does not depend on any particular linguistic style and or the annotator. But there still remains a question -

Although not emulating its annotator, emulation of the reasoning for a particular task still remains the same ergo, emulation of the chain-of-thought itself is possible.

One particular question that fascinated me through my study is, can LM reason beyond chain-of-thought prompting? As in, given a scaffold of chain-of-thought for an in-domain question, will the LM generalize the chain-of-thought itself to OOD (Out-Of-Domain) questions?

Summary of the thoughts on this paper

To summarize the crux of my ideations & thoughts about this paper, follow through on the following questions -

Scale of the model seems to improve the reasoning ability & semantic understanding. Why is scale simply the parameter which affects it? Does it have any deeper meaning for why? (Check FAQ section of the paper).
Does chain of thought at its core asking the model to reproduce the results in the same way that the input has been given? (Emulation problem).
If chain of thought can indeed elicit reasoning, is this the "opening of the black box"? If so, to what extent can this box be opened - 100% understanding of everything that goes on? (Explain-ability Problem).
Can this method be utilized in making the model aligned? - Make a smaller model aligned with chain-of-thought process and leverage weak-to-strong generalization to enforce alignment in bigger models?

Superalignment - What do I mean by the title of this article?

The final point takes me to my final tangent and the main subject of this article. Why do I think chain-of-thought process can be a beneficial step in eliciting inherent alignment principles as the LMs scale to enormous degrees.

I strongly encourage readers to understand Weak-to-Strong Generalization, at least at the surface level before reading further. Helpful resource for it - Open AI's Weak-to-Strong Generalization.

We know, at this point that we can steer a language model strictly in certain directions through intelligent prompting and guardrails. I propose to consider, eliciting factual information alongside reasoning from LM's through chain-of-thought prompting.

Before moving forward let's define the super-alignment problem as outlined by the Open AI's team -

Continuing our chain-of-thought, lets assume chain-of-thought prompts include reasoning for why certain behavior is good while certain other behaviors are bad. In theory if a smaller model (in our case - at least a 100B params - read my full annotation for why) is finetuned i.e., aligned using these chain-of-thought prompts.

Therefore in theory, the aligned chain-of-thought LM (CLM) can generalize its capabilities i.e., elicit reasoning and aligned behavior to much stronger and larger LMs through weak-to-strong generalization.

Since, scale helps with reasoning, bigger models therefore will in theory can elicit superior reasoning and alignment behaviors learned from the smaller model & can explain why certain behavior is good vs bad.

Footnote

I humbly request all readers to consider this article as a step to take an unbaked idea to perhaps polish or further experiment to prove me right or wrong. Any and all comments are welcome & I hope to learn from you all. I have attached the full annotated paper with more questions and tangents and perhaps some answers. Please feel free to read it. Chain-Of-Thought Annotated Thanks for taking the time. Cheers.

Preface/Context

Reading

Summary of the thoughts on this paper

Superalignment - What do I mean by the title of this article?

Footnote

Click here and spread the holy gospel of AI with your friends on X to be cool. Please?