Notebook Download use-cases/codegen.ipynb
Salesforce CodeGen¶
CodeGen models (350M
, 2B
, 6B
, 16B
) for Program Synthesis are developed by Salesforce Research and presented in: CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. This note 1 shows an example how to use CodeGen for programming synthesis under Kubeflow notebook environment.
Released Models¶
Various sizes trained models on various datasets are released. The models are named in the following format:
codegen-{model-size}-{data}
model-size
has 4 options: 350M
, 2B
, 6B
, 16B
, which represent the number of parameters in each model.
data
has 3 options: nl
, multi
, mono
.
nl
models are randomly initialized and trained on The Pile, a 825.18 GB English text corpus.multi
models are initialized fromnl
models and then trained on a corpus with code data consisting of multiple programming languages.mono
models are initialized frommulti
models and then trained on a corpus with Python code data.
Prerequisites¶
[1]:
!git clone https://github.com/salesforce/CodeGen
%cd CodeGen
!pip install --upgrade pip setuptools
!pip install -r requirements.txt
Cloning into 'CodeGen'...
remote: Enumerating objects: 171, done.
remote: Counting objects: 100% (169/169), done.
remote: Compressing objects: 100% (103/103), done.
remote: Total 171 (delta 77), reused 132 (delta 51), pack-reused 2
Receiving objects: 100% (171/171), 1.36 MiB | 1.40 MiB/s, done.
Resolving deltas: 100% (77/77), done.
/home/jovyan/CodeGen
Requirement already satisfied: pip in /opt/conda/lib/python3.8/site-packages (22.2.2)
Collecting pip
Downloading pip-23.1-py3-none-any.whl (2.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 1.5 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (65.3.0)
Collecting setuptools
Downloading setuptools-67.6.1-py3-none-any.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 2.7 MB/s eta 0:00:0000:0100:01m
Installing collected packages: setuptools, pip
Attempting uninstall: setuptools
Found existing installation: setuptools 65.3.0
Uninstalling setuptools-65.3.0:
Successfully uninstalled setuptools-65.3.0
Attempting uninstall: pip
Found existing installation: pip 22.2.2
Uninstalling pip-22.2.2:
Successfully uninstalled pip-22.2.2
Successfully installed pip-23.1 setuptools-67.6.1
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.9.0+cu111 (from -r requirements.txt (line 2))
Downloading https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl (2041.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 GB ? eta 0:00:00 0:00:0100:05m
Collecting transformers==4.16.2 (from -r requirements.txt (line 3))
Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.5/3.5 MB 2.8 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch==1.9.0+cu111->-r requirements.txt (line 2)) (4.3.0)
Collecting filelock (from transformers==4.16.2->-r requirements.txt (line 3))
Downloading filelock-3.11.0-py3-none-any.whl (10.0 kB)
Collecting huggingface-hub<1.0,>=0.1.0 (from transformers==4.16.2->-r requirements.txt (line 3))
Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.1/200.1 kB 495.1 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (21.3)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (5.4.1)
Collecting regex!=2019.12.17 (from transformers==4.16.2->-r requirements.txt (line 3))
Downloading regex-2023.3.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 kB 974.4 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (2.28.1)
Collecting sacremoses (from transformers==4.16.2->-r requirements.txt (line 3))
Downloading sacremoses-0.0.53.tar.gz (880 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 880.6/880.6 kB 920.6 kB/s eta 0:00:0000:0100:01
Preparing metadata (setup.py) ... done
Collecting tokenizers!=0.11.3,>=0.10.1 (from transformers==4.16.2->-r requirements.txt (line 3))
Downloading tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 4.6 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (4.64.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.16.2->-r requirements.txt (line 3)) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2022.6.15)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.16.0)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.1.0)
Building wheels for collected packages: sacremoses
Building wheel for sacremoses (setup.py) ... done
Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895241 sha256=3f2b8d261642b1de521eed16da25fc6c1ef5f5d68c11f1379606df72e4fba31c
Stored in directory: /home/jovyan/.cache/pip/wheels/82/ab/9b/c15899bf659ba74f623ac776e861cf2eb8608c1825ddec66a4
Successfully built sacremoses
Installing collected packages: tokenizers, torch, regex, filelock, sacremoses, huggingface-hub, transformers
Attempting uninstall: torch
Found existing installation: torch 1.8.1+cu111
Uninstalling torch-1.8.1+cu111:
Successfully uninstalled torch-1.8.1+cu111
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 0.8.1 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.
torchvision 0.9.1+cu111 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.
Successfully installed filelock-3.11.0 huggingface-hub-0.13.4 regex-2023.3.23 sacremoses-0.0.53 tokenizers-0.13.3 torch-1.9.0+cu111 transformers-4.16.2
Load model and tokenizer¶
[2]:
chosen_model = "codegen-350M-mono" #@param ["codegen-350M-nl", "codegen-350M-multi", "codegen-350M-mono", "codegen-2B-nl", "codegen-2B-multi", "codegen-2B-mono", "codegen-6B-nl", "codegen-6B-multi", "codegen-6B-mono", "codegen-16B-nl", "codegen-16B-multi", "codegen-16B-mono"]
fp16 = True #@param {type:"boolean"}
import os
if not os.path.exists(f'./checkpoints/{chosen_model}'):
!wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/{chosen_model}.tar.gz && tar -xvf checkpoints/{chosen_model}.tar.gz -C checkpoints/
import torch
from jaxformer.hf.sample import truncate as do_truncate
from jaxformer.hf.sample import set_env, set_seed, print_time, create_model, create_custom_gpt2_tokenizer, create_tokenizer, sample
# (0) constants
models_nl = ['codegen-350M-nl', 'codegen-2B-nl', 'codegen-6B-nl', 'codegen-16B-nl']
models_pl = ['codegen-350M-multi', 'codegen-2B-multi', 'codegen-6B-multi', 'codegen-16B-multi', 'codegen-350M-mono', 'codegen-2B-mono', 'codegen-6B-mono', 'codegen-16B-mono']
models = models_nl + models_pl
# (2) preamble
set_env()
pad = 50256
device = torch.device('cuda:0')
ckpt = f'./checkpoints/{chosen_model}'
if device.type == "cpu":
print()
print("force full precision for cpu!!")
print()
fp16 = False
# (3) load
with print_time('loading parameters'):
model = create_model(ckpt=ckpt, fp16=fp16).to(device)
with print_time('loading tokenizer'):
if chosen_model in models_pl:
tokenizer = create_custom_gpt2_tokenizer()
else:
tokenizer = create_tokenizer()
tokenizer.padding_side = 'left'
tokenizer.pad_token = pad
--2023-04-16 15:38:22-- https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz
Resolving proxy.liuqi.me (proxy.liuqi.me)... 10.185.248.180
Connecting to proxy.liuqi.me (proxy.liuqi.me)|10.185.248.180|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 656148604 (626M) [application/x-tar]
Saving to: ‘checkpoints/codegen-350M-mono.tar.gz’
codegen-350M-mono.t 100%[===================>] 625.75M 11.9MB/s in 47s
2023-04-16 15:39:10 (13.2 MB/s) - ‘checkpoints/codegen-350M-mono.tar.gz’ saved [656148604/656148604]
codegen-350M-mono/
codegen-350M-mono/config.json
codegen-350M-mono/pytorch_model.bin
loading parameters
loading parameters took 24.35s
loading tokenizer
loading tokenizer took 36.72s
Try out the model¶
[3]:
rng_seed = 42 #@param {type:"integer"}
rng_deterministic = True #@param {type:"boolean"}
p = 0.95 #@param {type:"number"}
t = 0.2 #@param {type:"number"}
max_length = 128 #@param {type:"integer"}
batch_size = 1 #@param {type:"integer"}
context = "def hello_world():" #@param {type:"string"}
set_seed(rng_seed, deterministic=rng_deterministic)
# (4) sample
with print_time('sampling'):
completion = sample(device=device, model=model, tokenizer=tokenizer, context=context, pad_token_id=pad, num_return_sequences=batch_size, temp=t, top_p=p, max_length_sample=max_length)[0]
truncation = do_truncate(completion)
print('=' * 100)
print(completion)
print('=' * 100)
print(context+truncation)
print('=' * 100)
# !python -m jaxformer.hf.sample --model $chosen_model \
# --rng-seed $rng_seed \
# --p $p \
# --t $t \
# --max-length $max_length \
# --batch-size $batch_size \
# --context '$context'
sampling
====================================================================================================
print("Hello World")
hello_world()
#
====================================================================================================
def hello_world():
print("Hello World")
hello_world()
====================================================================================================
sampling took 0.96s
Training and Fine-tuning¶
The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here: