Notebook Download use-cases/codegen.ipynb

Salesforce CodeGen

CodeGen models (350M, 2B, 6B, 16B) for Program Synthesis are developed by Salesforce Research and presented in: CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. This note 1 shows an example how to use CodeGen for programming synthesis under Kubeflow notebook environment.

Released Models

Various sizes trained models on various datasets are released. The models are named in the following format:

codegen-{model-size}-{data}

model-size has 4 options: 350M, 2B, 6B, 16B, which represent the number of parameters in each model.

data has 3 options: nl, multi, mono.

  • nl models are randomly initialized and trained on The Pile, a 825.18 GB English text corpus.

  • multi models are initialized from nl models and then trained on a corpus with code data consisting of multiple programming languages.

  • mono models are initialized from multi models and then trained on a corpus with Python code data.

Prerequisites

[1]:
!git clone https://github.com/salesforce/CodeGen
%cd CodeGen
!pip install --upgrade pip setuptools
!pip install -r requirements.txt
Cloning into 'CodeGen'...
remote: Enumerating objects: 171, done.
remote: Counting objects: 100% (169/169), done.
remote: Compressing objects: 100% (103/103), done.
remote: Total 171 (delta 77), reused 132 (delta 51), pack-reused 2
Receiving objects: 100% (171/171), 1.36 MiB | 1.40 MiB/s, done.
Resolving deltas: 100% (77/77), done.
/home/jovyan/CodeGen
Requirement already satisfied: pip in /opt/conda/lib/python3.8/site-packages (22.2.2)
Collecting pip
  Downloading pip-23.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 1.5 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (65.3.0)
Collecting setuptools
  Downloading setuptools-67.6.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 2.7 MB/s eta 0:00:0000:0100:01m
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 65.3.0
    Uninstalling setuptools-65.3.0:
      Successfully uninstalled setuptools-65.3.0
  Attempting uninstall: pip
    Found existing installation: pip 22.2.2
    Uninstalling pip-22.2.2:
      Successfully uninstalled pip-22.2.2
Successfully installed pip-23.1 setuptools-67.6.1
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.9.0+cu111 (from -r requirements.txt (line 2))
  Downloading https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl (2041.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 GB ? eta 0:00:00 0:00:0100:05m
Collecting transformers==4.16.2 (from -r requirements.txt (line 3))
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.5/3.5 MB 2.8 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch==1.9.0+cu111->-r requirements.txt (line 2)) (4.3.0)
Collecting filelock (from transformers==4.16.2->-r requirements.txt (line 3))
  Downloading filelock-3.11.0-py3-none-any.whl (10.0 kB)
Collecting huggingface-hub<1.0,>=0.1.0 (from transformers==4.16.2->-r requirements.txt (line 3))
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.1/200.1 kB 495.1 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (21.3)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (5.4.1)
Collecting regex!=2019.12.17 (from transformers==4.16.2->-r requirements.txt (line 3))
  Downloading regex-2023.3.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 kB 974.4 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (2.28.1)
Collecting sacremoses (from transformers==4.16.2->-r requirements.txt (line 3))
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 880.6/880.6 kB 920.6 kB/s eta 0:00:0000:0100:01
  Preparing metadata (setup.py) ... done
Collecting tokenizers!=0.11.3,>=0.10.1 (from transformers==4.16.2->-r requirements.txt (line 3))
  Downloading tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 4.6 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (4.64.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.16.2->-r requirements.txt (line 3)) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2022.6.15)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.16.0)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.1.0)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895241 sha256=3f2b8d261642b1de521eed16da25fc6c1ef5f5d68c11f1379606df72e4fba31c
  Stored in directory: /home/jovyan/.cache/pip/wheels/82/ab/9b/c15899bf659ba74f623ac776e861cf2eb8608c1825ddec66a4
Successfully built sacremoses
Installing collected packages: tokenizers, torch, regex, filelock, sacremoses, huggingface-hub, transformers
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1+cu111
    Uninstalling torch-1.8.1+cu111:
      Successfully uninstalled torch-1.8.1+cu111
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 0.8.1 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.
torchvision 0.9.1+cu111 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.
Successfully installed filelock-3.11.0 huggingface-hub-0.13.4 regex-2023.3.23 sacremoses-0.0.53 tokenizers-0.13.3 torch-1.9.0+cu111 transformers-4.16.2

Load model and tokenizer

[2]:
chosen_model = "codegen-350M-mono" #@param ["codegen-350M-nl", "codegen-350M-multi", "codegen-350M-mono", "codegen-2B-nl", "codegen-2B-multi", "codegen-2B-mono", "codegen-6B-nl", "codegen-6B-multi", "codegen-6B-mono", "codegen-16B-nl", "codegen-16B-multi", "codegen-16B-mono"]
fp16 = True #@param {type:"boolean"}

import os

if not os.path.exists(f'./checkpoints/{chosen_model}'):
  !wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/{chosen_model}.tar.gz && tar -xvf checkpoints/{chosen_model}.tar.gz -C checkpoints/


import torch
from jaxformer.hf.sample import truncate as do_truncate
from jaxformer.hf.sample import set_env, set_seed, print_time, create_model, create_custom_gpt2_tokenizer, create_tokenizer, sample

# (0) constants

models_nl = ['codegen-350M-nl', 'codegen-2B-nl', 'codegen-6B-nl', 'codegen-16B-nl']
models_pl = ['codegen-350M-multi', 'codegen-2B-multi', 'codegen-6B-multi', 'codegen-16B-multi', 'codegen-350M-mono', 'codegen-2B-mono', 'codegen-6B-mono', 'codegen-16B-mono']
models = models_nl + models_pl


# (2) preamble

set_env()

pad = 50256
device = torch.device('cuda:0')
ckpt = f'./checkpoints/{chosen_model}'

if device.type == "cpu":
  print()
  print("force full precision for cpu!!")
  print()
  fp16 = False


# (3) load

with print_time('loading parameters'):
  model = create_model(ckpt=ckpt, fp16=fp16).to(device)


with print_time('loading tokenizer'):
  if chosen_model in models_pl:
    tokenizer = create_custom_gpt2_tokenizer()
  else:
    tokenizer = create_tokenizer()
  tokenizer.padding_side = 'left'
  tokenizer.pad_token = pad
--2023-04-16 15:38:22--  https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz
Resolving proxy.liuqi.me (proxy.liuqi.me)... 10.185.248.180
Connecting to proxy.liuqi.me (proxy.liuqi.me)|10.185.248.180|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 656148604 (626M) [application/x-tar]
Saving to: ‘checkpoints/codegen-350M-mono.tar.gz’

codegen-350M-mono.t 100%[===================>] 625.75M  11.9MB/s    in 47s

2023-04-16 15:39:10 (13.2 MB/s) - ‘checkpoints/codegen-350M-mono.tar.gz’ saved [656148604/656148604]

codegen-350M-mono/
codegen-350M-mono/config.json
codegen-350M-mono/pytorch_model.bin
loading parameters
loading parameters took 24.35s
loading tokenizer
loading tokenizer took 36.72s

Try out the model

[3]:
rng_seed = 42 #@param {type:"integer"}
rng_deterministic = True #@param {type:"boolean"}
p = 0.95 #@param {type:"number"}
t = 0.2 #@param {type:"number"}
max_length = 128 #@param {type:"integer"}
batch_size = 1 #@param {type:"integer"}
context = "def hello_world():" #@param {type:"string"}

set_seed(rng_seed, deterministic=rng_deterministic)

# (4) sample

with print_time('sampling'):
  completion = sample(device=device, model=model, tokenizer=tokenizer, context=context, pad_token_id=pad, num_return_sequences=batch_size, temp=t, top_p=p, max_length_sample=max_length)[0]
  truncation = do_truncate(completion)

  print('=' * 100)
  print(completion)
  print('=' * 100)
  print(context+truncation)
  print('=' * 100)


# !python -m jaxformer.hf.sample --model $chosen_model \
#                  --rng-seed $rng_seed \
#                  --p $p \
#                  --t $t \
#                  --max-length $max_length \
#                  --batch-size $batch_size \
#                  --context '$context'
sampling
====================================================================================================

    print("Hello World")

hello_world()

#
====================================================================================================
def hello_world():
    print("Hello World")

hello_world()


====================================================================================================
sampling took 0.96s

Training and Fine-tuning

The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:

https://github.com/salesforce/jaxformer