GPT-4 Scores in Top 10% for Legal Bar Exam

The OpenAI GPT4 can score higher than 90% of law students writing the bar exam while the old version (GPT3) was in the bottom 10% of human bar exam test takers. GPT-4 can output 25000 words. GPT-4 can write a higher quality novel while GPT3.5 could only output a very short story. GPT-4 can score…
GPT-4 Scores in Top 10% for Legal Bar Exam


The OpenAI GPT4 can score higher than 90% of law students writing the bar exam while the old version (GPT3) was in the bottom 10% of human bar exam test takers.

GPT-4 can output 25000 words. GPT-4 can write a higher quality novel while GPT3.5 could only output a very short story.

GPT-4 can score 1410 on the SAT tests vs 1260 for GPT 3.5.


GPT-4 can score 161 on the LSAT vs 149 for GPT 3.5.


GPT-4 can score 99 percentil for GRE (high school equivalent) verbal test vs 63 percentile for GPT3.5.

Predictable Performance for GPT-4 and future GPT-X Systems

GPT-4 is a Transformer based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.

A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× –10, 000× less compute.

The final loss of properly-trained large language models is thought to be well approximated by power


laws in the amount of compute used to train the model.


To verify the scalability of our optimization infrastructure, OpenAU predicted GPT-4’s final loss on their internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term : L(C) = aCb + c, from models trained using the same methodology but using at most 10,000x less compute than GPT-4.

This prediction was made shortly after the run started, without use of any partial results. The fitted scaling law predicted GPT-4’s final loss with high accuracy.

They developed methodology to predict more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset, which measures the ability to synthesize Python functions of varying complexity. They successfully predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained with at most 1, 000× less compute.

Read More

Total
0
Shares
Leave a Reply

Your email address will not be published.

Related Posts
Who Picks Up the Pieces of Silicon Valley Bank?
Read More

Who Picks Up the Pieces of Silicon Valley Bank?

Silicon Valley Bank (SVB) failed on Friday. It was the 16th largest bank in the USA. SVB gave loans to startups and the terms of the loans required the startup to keep deposits with SVB. Venture Capitalists for decades would require that startups use SVB and other “standard” best practice legal and financial structures. SVB…
Fact Mix 878: Juba
Read More

Fact Mix 878: Juba

Juba ties together sounds and rhythms from the African diaspora and beyond in a mix that mirrors her ideal club night. British-Nigerian artist Juba has built a strong reputation as one of Europe’s most thrilling DJs, particularly in Berlin where she has been based since 2018. Juba’s sets and mixes draw on her Nigerian heritage,…