ChatGPT Has a Human Team Train It to Be a Lot Better

The ChatGPT team has a 68-page paper that describes their training language models follow instructions with human feedback. Human labelers rank the ChatGPT outputs from best to worst. The result is a new labeled dataset, where the rankings are the labels. The size of this dataset is approximately 10 times bigger than the curated dataset…
ChatGPT Has a Human Team Train It to Be a Lot Better


The ChatGPT team has a 68-page paper that describes their training language models follow instructions with human feedback.

Human labelers rank the ChatGPT outputs from best to worst. The result is a new labeled dataset, where the rankings are the labels. The size of this dataset is approximately 10 times bigger than the curated dataset used for the SFT model.

Abstract


Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

ChatGPT use human feedback to attack the alignment problem?

#Reinforcement Learning from Human Feedback

The method overall consists of three distinct steps:

1. Supervised fine-tuning step: a pre-trained language model is fine-tuned on a relatively small amount of demonstration data curated by labelers, to learn a supervised policy (the SFT model) that generates outputs from a selected list of prompts. This represents the baseline model.

2. “Mimic human preferences” step: labelers are asked to vote on a relatively large number of the SFT model outputs, this way creating a new dataset consisting of comparison data. A new model is trained on this dataset. This is referred to as the reward model (RM).

3. Proximal Policy Optimization (PPO) step: the reward model is used to further fine-tune and improve the SFT model. The outcome of this step is the so-called policy model.

Lex Fridman on ChatGPT

ChatGPT 3 came out about two years ago and it was like impressive but dumb in a lot of ways it was like you would expect as a human being for it to generate certain kinds of text and it was like saying kind of dumb things that were off and you’re like all right this is really impressive but it’s not quite there.

What they did with GPT 3.5 is they started adding more and different kinds of data sets there one of them probably the smartest neural network currently is codex which is fine-tuned for programming like it was it was trained on code on programming code and when you train a programming code which chatGPT is also you’re teaching it something like reasoning because it’s no longer information and knowledge from the Internet it’s also reasoning like logic even though you’re looking at code programming code is you’re looking at me like what the [ __ ] is he talking about no no no no that’s not what I’m looking at I’m looking at you like oh my God but reasoning is a in order to b able to stitch together sentences that make sense you not only need to know the facts.

It was fine-tuned in a supervised Way by human labeling small data set by human labeling of here’s what we would like this network to do.

Read More

Total
0
Shares
Leave a Reply

Your email address will not be published.

Related Posts
Successfully Invading Taiwan is Militarily Impossible
Read More

Successfully Invading Taiwan is Militarily Impossible

Discourse Magazine has a summary of some of problems which make a military invasion of Taiwan impossible. I personally lived in Taiwan in 1995 and rode a bus through mountains and hills outside of the capital of Taipei. There were truck-sized tunnels leading off from the main tunnels. I believe over the decades Taiwan’s military…