Alignment Newsletter #164: How well can language models write code?

FromAlignment Newsletter Podcast

Start listening View podcast show

Alignment Newsletter #164: How well can language models write code?

FromAlignment Newsletter Podcast

ratings:

Length:

19 minutes

Released:

Sep 15, 2021

Format:

Podcast episode

Description

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal. They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem. Some of their findings: 1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem). 2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise. 3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%). 4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%. 5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383. 6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check. 7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help. Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One way to

Released:

Sep 15, 2021

Format:

Podcast episode

Titles in the series (100)

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment. This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com) More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

Skip carousel

More Episodes from Alignment Newsletter Podcast

Skip carousel

Related podcast episodes

Skip carousel

Discover this podcast and so much more

Alignment Newsletter #164: How well can language models write code?

Alignment Newsletter #164: How well can language models write code?

Description

Titles in the series (100)

More Episodes from Alignment Newsletter Podcast

Related podcast episodes