StarCoder：愿源头与你同在！

摘要

BigCode 社区是一个致力于代码大语言模型（Code 大语言模型）负责任开发的开放科学合作组织，推出了 StarCoder 和 StarCoderBase：具有 8K 上下文长度、填充功能和快速大批量推理功能的 15.5B 参数模型通过多查询注意力。 StarCoderBase 使用来自 The Stack （Kocetkov 等人，2022）的 1 万亿个 Token 进行训练，The Stack 是大量经过许可许可的 GitHub 存储库，具有检查工具和选择退出流程。我们在 35B Python Token 上对 StarCoderBase 进行了微调，从而创建了 StarCoder。我们对 Code 大语言模型进行了迄今为止最全面的评估，结果表明 StarCoderBase 优于所有支持多种编程语言的开放 Code 大语言模型，并且匹配或优于 OpenAI code-cushman-001 模型。此外，StarCoder 的性能优于在 Python 上进行微调的所有模型，并且仍然保留其在其他编程语言上的性能。我们采取了几个重要步骤来实现安全的开放获取模型发布，包括改进的 PII 编辑管道和新颖的归因跟踪工具，并在更具商业可行性的开放负责任 AI 模型许可证版本下公开提供 StarCoder 模型。

Raymond Li ${}^{2}$ Loubna Ben Allal ${}^{1}$ Yangtian Zi ${}^{4}$ Niklas Muennighoff ${}^{1}$ Denis Kocetkov ${}^{2}$ Chenghao Mou ${}^{5}$ Marc Marone ${}^{8}$ Christopher Akiki ${}^{9,10}$ Jia Li ${}^{5}$ Jenny Chim ${}^{11}$ Qian Liu ${}^{13}$ Evgenii Zheltonozhskii ${}^{14}$ Terry Yue Zhuo ${}^{15,16}$ Thomas Wang ${}^{1}$ Olivier Dehaene ${}^{1}$ Mishig Davaadorj ${}^{1}$ Joel Lamy-Poirier ${}^{2}$ João Monteiro ${}^{2}$ Oleh Shliazhko ${}^{2}$ Nicolas Gontier ${}^{2}$ Nicholas Meade ${}^{6,17}$ Armel Zebaze ${}^{1}$ Ming-Ho Yee ${}^{4}$ Logesh Kumar Umapathi ${}^{18}$ Jian Zhu ${}^{19}$ Benjamin Lipkin ${}^{20}$ Muhtasham Oblokulov ${}^{21}$ Zhiruo Wang ${}^{7}$ Rudra Murthy ${}^{22}$ Jason Stillerman ${}^{23}$ Siva Sankalp Patel ${}^{22}$ Dmitry Abulkhanov ${}^{5}$ Marco Zocca ${}^{24}$ Manan Dey ${}^{25}$ Zhihan Zhang ${}^{26}$ Nour Fahmy ${}^{27}$ Urvashi Bhattacharyya ${}^{28}$ Wenhao Yu ${}^{26}$ Swayam Singh ${}^{30}$ Sasha Luccioni ${}^{1}$ Paulo Villegas ${}^{31}$ Maxim Kunakov ${}^{32}$ Fedor Zhdanov ${}^{32}$ Manuel Romero ${}^{5}$ Tony Lee ${}^{33}$ Nadav Timor ${}^{34}$ Jennifer Ding ${}^{35}$ Claire Schlesinger ${}^{4}$

Hailey Schoelkopf ${}^{37}$ Jan Ebert ${}^{38}$ Tri Dao ${}^{33}$ Mayank Mishra ${}^{22}$ Alex Gu ${}^{20}$ Jennifer Robinson ${}^{3}$ Carolyn Jane Anderson ${}^{36}$ Brendan Dolan-Gavitt ${}^{29}$ Danish Contractor ${}^{5}$ Siva Reddy ${}^{2,6}$ Daniel Fried ${}^{7}$ Dzmitry Bahdanau ${}^{2}$ Yacine Jernite ${}^{1}$ Carlos Muñoz Ferrandis ${}^{1}$ Sean Hughes ${}^{3}$ Thomas Wolf ${}^{1}$ Arjun Guha ${}^{4,12}$

莱昂德罗·冯·韦拉 ${}^{1,\star}$ 哈姆·德·弗里斯 ${}^{2,\star}$

${}^{1}$ Hugging Face ${}^{2}$ ServiceNow Research ${}^{3}$ ServiceNow ${}^{4}$ 东北大学 ${}^{5}$ 独立 ${}^{6}$ Mila ${}^{7}$ 卡内基梅隆大学 ${}^{8}$ 约翰霍普金斯大学 ${}^{9}$ 莱比锡大学 ${}^{10}$ ScaDS。 ${}^{11}$ Queen Mary University of London ${}^{12}$ Roblox ${}^{13}$ Sea AI Lab ${}^{14}$ Technion - Israel Institute of Technology ${}^{15}$ Monash University ${}^{16}$ CSIRO's Data61 ${}^{17}$ 麦吉尔大学 ${}^{18}$ Saama人工智能研究实验室 ${}^{19}$ 不列颠哥伦比亚大学 ${}^{20}$ 麻省理工学院 ${}^{21}$ 慕尼黑工业大学 ${}^{22}$ IBM研究院 ${}^{23}$ 佛蒙特大学 ${}^{24}$ UnfoldML ${}^{25}$ SAP ${}^{26}$ 圣母大学 ${}^{27}$ 哥伦比亚大学 ${}^{28}$ Discover Dollar Pvt Ltd ${}^{29}$ 纽约大学 ${}^{30}$ 阿拉哈巴德大学 ${}^{31}$ Telefonica I+D ${}^{32}$ 托洛卡 ${}^{33}$ 斯坦福大学 ${}^{34}$ 魏兹曼科学研究院 ${}^{35}$ 艾伦-图灵研究所 ${}^{36}$ 韦尔斯利学院 ${}^{37}$ 埃利希人工智能 ${}^{38}$ 尤利希研究中心

可以通过 contact@bigcode-project.org 联系通讯作者 ( $\star$ )

OpenReview 上的评论： https://openreview.net/forum?id=KoFOg41haE

1简介

生成式人工智能和大语言模型（大语言模型；Brown 等人，2020；Chen 等人，2021；Chowdhery 等人，2022；Zhang 等人，2022；OpenAI，2023a）预计将显着通过提高工人生产力来影响未来几年的劳动力（Eloundou 等人，2023；Bommasani 等人，2021；世界经济论坛，2023）。基于代码训练的大语言模型（Code 大语言模型）的采用速度特别快：微软的 Copilot 吸引了超过 100 万专业开发人员（Euronews，2023），并且 GitHub 报告称 Copilot 用户依靠它来生产他们为某些语言编写的代码中有 35% （Thompson，2022）。然而，大语言模型的开发和使用引起了版权、隐私和开放性的担忧。

许多司法管辖区都存在版权问题，包括美国和欧盟。，关于其公共数据用于训练语言模型的内容创建者的权利。有人质疑在美国使用此类数据训练的机器学习模型是否属于合理使用原则（Kuhn，2022；Butterick，2022；Rothchild & Rothchild，2022），合理使用最有可能当模型生成与任何受版权保护的训练数据不同的新颖内容时（Lemley & Casey，2020；Levendowski，2018）。因此，Henderson 等人 (2023) 建议大语言模型开发者应提供额外的工具来确保这些模型符合现行版权法。值得一提的是，这些法律问题不仅是学术辩论的主题：已经针对 GitHub Copilot （DOE 1 v. and GitHub, Inc., 2022）以及 Stable 提起诉讼扩散（Andersen 等人诉 Stability AI 等人，2023）。

对个人信息的担忧导致意大利暂时禁止 ChatGPT，并对 OpenAI 是否遵守欧盟《通用数据保护条例》(GDPR) 的情况展开持续调查（BBC，2023 年）。根据这些规定（欧洲理事会，2018；洛马斯，2022），处理个人信息的组织必须具有有效的法律依据。这些法律可能会影响从互联网收集大量公共数据（其中可能包括个人信息）的大语言模型开发商。在这种规模下，获得数据创建者的明确同意是很困难的，并且不确定处理这些个人信息是否存在其他法律依据。此外，即使有有效的法律依据，GDPR 也要求数据处理者告知个人其数据的处理方式并提供数据访问控制，例如删除数据或修改错误数据的权利。这将要求大语言模型提供商对其收集的数据保持透明，并为个人提供工具来检查其数据并可以删除数据。

生成人工智能模型的开发过程缺乏透明度和开放性也引起了科学界的担忧。许多模型在不同程度上都是封闭访问的：从仅在开发它们的组织内可用（Chowdhery 等人，2022；Hoffmann 等人，2022）到可通过付费 API 公开访问，但需要使用其开发过程的许多细节被隐藏（Brown等人，2020；OpenAI，2023a）。虽然 API 访问允许研究人员试验这些模型，但它限制了他们研究大语言模型安全（Perez 等人，2022）、检查模型内部运作（Olsson 等人， 2022），并为模型改进做出贡献（Togelius 和 Yannakakis，2023）。

我们使用“开放获取”来指代权重公开的模型。尽管存在其他开放获取模式，但这些项目的开放程度仍然存在差异；部分发布权重的模型对模型分布有限制(Touvron 等人, 2023)，或者不发布训练数据集(Nijkamp 等人, 2023;Zhang 等人, 2022;Fried等人，2022）。即使在模型和训练数据均被允许发布的情况下（Raffel 等人，2020；Tay 等人，2022），外部研究人员通常没有机会参与指导行业发展——生产的模型。相比之下，其他大语言模型开发项目则采取完全开放的方式，旨在允许社区参与模型开发、发布训练数据并在整个开发过程中进行外部审核（Solaiman，2023）。一个例子是 BigScience 研究研讨会（BigScience Workshop，2022），这是一项开放的科学合作（Akiki 等人，2022），由数百名研究人员合作发布 BLOOM， -语言大语言模型(Scao 等人, 2022; Muennighoff 等人, 2022)。同样，草根转型的非营利研究项目 EleutherAI 也发布了开放获取的大语言模型，包括 GPT-NeoX （Black 等人，2022）、GPT-J （Wang & Komatsuzaki） , 2021) 和 Pythia (Biderman 等人, 2023)，以及相关的训练数据 (Gao 等人, 2021a)。

在本文中，我们描述了由 BigCode 社区开发和发布的开放访问代码大语言模型 StarCoder 和 StarCoderBase，重点是尊重版权、隐私、透明度和社区驱动的模型开发。该项目是一个开放式科学合作，专注于代码大语言模型的负责任开发。它由两个行业研究实验室共同管理，由来自不同学术机构和行业实验室的 600 多名成员组成。 The Stack （Kocetkov 等人，2022）是一个公开的代码大语言模型预训练数据集，具有透明的数据治理框架。该堆栈由 384 种编程语言的 6.4 TB 许可源代码组成，并包含 54 GB 的 GitHub 问题和数据集 v1.2 版本中的存储库级元数据。该数据集附带“Am I in The Stack”，这是一种治理工具，供开发人员检查其源代码是否是数据集的一部分，以及为那些希望从数据集中删除代码的人提供的选择退出流程。

StarCoder 和 StarCoderBase 都是 15.5B 参数模型，基于 The Stack 的许可数据进行训练。我们使用来自 80 多种编程语言、GitHub 问题、Git 提交和 Jupyter Notebook 的 1 万亿个 Token 对 StarCoderBase 进行了训练。我们在另一个 35B Python Token 上对 StarCoderBase 进行了微调，形成了 StarCoder 模型。两种 StarCoder 模型都具有新颖的架构特征组合，例如 8K 词符上下文长度（Dao 等人，2022），通过 Fill-in-the-Middle （FIM ; Bavarian 等人，2022），以及通过多查询注意进行快速大批量推理（MQA；Shazeer，2019）。我们对 StarCoder 模型进行了广泛的评估，并发布了一个演示以及一个集成的归因工具，该工具可以帮助用户找到可能从训练集中复制的模型代。总的来说，我们的贡献可以总结如下。

•

我们发布了 StarCoderBase 和 StarCoder，这是经过 80 多种编程语言训练的开放访问代码大语言模型，支持其他开放代码大语言模型中不具备的功能和架构特性的新颖组合。
•
我们使用一组不同的基准对代码大语言模型进行了迄今为止最全面的评估（Lai 等人, 2022; Cassano 等人, 2023; Pearce 等人, 2022; Fried 等人, 2022; Yee & Guha , 2023; Chen 等人, 2021; Ben Allal 等人, 2020; Reddy 等人, 2021; 2020；Liang等人，2022），并证明：
- –
  
  StarCoder 在支持多种编程语言的代码方面优于所有开放的大语言模型 （Nijkamp 等人，2023；Zheng 等人，2023）；
- –
  
  StarCoder 匹配或优于 OpenAI code-cushman-001 模型；和
- –
  
  当在 Python 上进行微调时，StarCoder 的性能大大优于也在 Python 上进行微调的现有大语言模型。
•
我们为安全开放模型发布采取了重要步骤：
- –
  
  我们根据 OpenRAIL-M 许可协议发布 StarCoder，该协议允许免版税访问、使用和分发模型，同时在已确定的关键场景中嵌入一组使用限制。我们制定了许可协议的一个版本，该版本：(i) 对于希望使用和分发该模型的公司来说更具商业可行性，并且 (ii) 通过共享模型卡等人工智能文档来促进透明度和理解(米切尔等人, 2019);
- –
  
  我们在 VSCode 演示中加入了新的归因工具，可以帮助用户检测和定位可能从训练集中复制的模型生成。这是通过一个两步过程实现的，其中包括轻量级成员资格检查，然后是对 BM25 索引的搜索（第 9 节）；和
- –
  
  我们通过收集包含 12,000 个文件和 22,950 个带注释实体的 PII 数据集，显着改进了 PII 编辑管道。我们在此数据集上微调了我们自己的编码器模型 (StarEncoder)，从而形成了强大的 PII 检测模型（第 4 节）。

2相关工作

语言模型

构建大规模语言模型的早期工作使用 n-gram 和简单的平滑技术（Brants 等人，2007；Heafield 等人，2013；Buck 等人，2014）。其他方法应用了各种类型的神经网络架构，例如前馈网络（Bengio等人，2000）和循环网络（Mikolov等人，2010；Jozefowicz等人，2016），到语言建模任务。 Transformer 架构（Vaswani 等人，2017）导致了高度可扩展的语言模型（Radford 等人，2019；Brown 等人，2020）的开发，这表明语言建模损失和缩放因子（例如模型大小、训练标记数量和计算预算）之间的可预测关系（Kaplan 等人，2020；Hoffmann 等人，2022）。

代码语言模型

语言模型最初由 Hindle 等人 (2012) 应用于代码，但依赖于规模相对较小的 n-gram 模型。 NLP 中开发的许多神经架构也成功应用于代码，包括用于生成代码表示的纯编码器模型（Feng 等人，2020；Kanade 等人，2020）和用于翻译的编码器-解码器模型，编辑、摘要和语言到代码任务（Wang 等人，2021；Ahmad 等人，2021；Li 等人，2022）。仅解码器的 Transformer 架构已经产生了强大的代码生成模型，通常是通过对来自 GitHub 的文本和代码的混合进行训练（Chen 等人，2021；Austin 等人，2021；Fried 等人，2022；Zheng 等人， 2023；Nijkamp 等人，2023）。这些模型大多数尚未完全开放，但 PolyCoder (Xu 等人，2022) 和 SantaCoder (Ben Allal 等人，2023) 是值得注意的例外，并且均已开放模型和训练数据。然而，这些模型相对较小（分别为 2.7B 和 1.1B 参数），并且训练的数据( $<300$ GB 代码）比我们在这项工作中探索的要少。

封闭式大语言模型

几家大型科技公司开发了性能顶尖的大语言模型，但没有发布。示例包括 Google 的 PaLM (Chowdhery 等人，2022) 和 LaMDA (Thoppilan 等人，2022)、DeepMind 的 Chinchilla (Hoffmann 等人，2022) 和 Gopher （Rae 等人，2021），以及 NVIDIA 的 Megatron-Turing NLG （Smith 等人，2022）。 OpenAI 和其他人工智能初创公司，包括 Cohere¹¹1https://cohere.com/，人类²²2https://www.anthropic.com/ 和 Aleph Alpha^{35>³3https://www.aleph-alpha.com/} 这些公司没有发布模型权重，也没有提供有关创建这些模型所用方法的全面信息。 OpenAI 发布了多份 GPT 系列模型的技术报告（Brown 等人，2020；Chen 等人，2021；OpenAI，2023a），展示了其模型的能力。

开放获取大语言模型

许多开放式大语言模型已经发布到人工智能社区，尽管它们通常不如封闭式大语言模型那么强大。在本文中，当模型权重公开时，我们使用术语“开放访问大语言模型”。我们仍然注意到，开放获取模型之间在训练数据和过滤技术的透明度方面存在显着差异。例如，EleutherAI发布了GPT-NeoX-20B （Black等人，2022）和GPT-J-6B （Wang & Komatsuzaki，2021），以及数据集这些模型在（Gao等人，2021a）上进行训练。 Google 发布了 UL2-20B (Tay 等人，2022)，这是一种在公开的 C4 (Raffel 等人，2020) 上训练的编码器-解码器模型。清华大学发布汉英大语言模型GLM-130B（曾等人，2022）和CodeGeeX-13B（郑等人，2023）权重，一个用于编码应用程序的大语言模型，无需发布训练集。 Salesforce 发布了 CodeGen-Mono-16B (Nijkamp 等人，2023)，但未披露专有的 Python 数据集。 Meta发布了OPT（Zhang等人，2022）、LLaMA（Touvron等人，2023）和InCoder模型（Fried等人，2022） t2> 根据非商业许可，仅提供有关数据收集和过滤过程的高级详细信息。

3数据管理和清理

本节介绍我们如何处理 StarCoderBase 的训练数据。我们将训练集限制为 The Stack v1.2 (Kocetkov 等人, 2022)，其中专门包含来自许可许可的数据⁴⁴4请参阅 https://blueoakcouncil.org/ 了解有关宽松许可证的更多信息并访问此类许可证的全面集合。 GitHub 存储库。在数据处理时，有 44 人选择退出 The Stack。下面，我们将描述如何通过结合启发式过滤和手动检查来进一步清理数据。

3.1编程语言

编程语言的选择

从 The Stack 的 358 种编程语言中，我们选择了 86 种语言。仅根据文件扩展名（Kocetkov等人，2022）将数据分配给编程语言。我们纳入了所有数据超过 500 MB 的编程语言，以及在 Githut 2.0 或2022 年 12 月 TIOBE 指数编程语言上排名前 50 的语言受欢迎程度。此外，我们还包含了已选择的编程语言的方言（例如，Racket 和 Lisp 的 Scheme）。我们排除了配置语言（Nix、Puppet 等）和不再主动支持的语言（ActionScript）。我们还包含了 JSON 和 YAML 等数据格式，但限制了其数据量（有关详细信息，请参阅“JSON 和 YAML”段落）。所选编程语言的完整列表可以在表 1 和 2 中找到。在 MultiPL-E (Cassano 等人, 2023) 中存在的语言中，只有 D 和 Swift 未包含在训练集中。对于 D，文件的语言错误分类导致 The Stack 中的数据少于 2MB （Kocetkov 等人，2022）。由于人为错误，Swift 被排除在最终语言列表之外。

视力检查

我们进行了目视检查，以确保只保留高质量的数据。为了实现这一目标，我们从 The Stack 中为每种编程语言随机选择 30,000 个文件，按扩展名对它们进行分类，并为每个扩展名保留最多 1,000 个文件。然后，我们向社区寻求数据检查方面的帮助。我们指示注释者检查 50-100 个文件，并确认这些数据是否是人类编写的正常代码，而不是文本、数据或一长串自动生成的代码。我们还要求注释者确定是否应该对给定的文件扩展名使用默认的字母数字过滤器（要求超过 25% 的字母数字符号）和长行过滤器（要求行少于 1,000 个字符）。 18 位社区注释者评估了 300 种编程语言扩展。经过检查，我们排除了 36 个分机，并消除了 27 个分机的长线过滤器。数据检查的完整结果（包括注释者备注）可以在此 Google 表中找到。

XML过滤器

当我们检查数据时，我们注意到某些扩展通常由 XML 文件组成。例如，.sld 扩展名中超过 50% 的文件采用 XML 格式。为了解决这个问题，我们实现了一个简单的 XML 过滤器，用于检查文件的前 100 个字符中是否存在“”。该过滤器被证明是有效的并且几乎没有产生误报。因此，我们将其应用于除 XSLT 之外的所有编程语言，XSLT 使用 XML 语法。

阿尔法滤镜

在我们的调查过程中，我们发现某些扩展程序（例如 MATLAB）包含大量经常存储大型张量的数据文件。为了识别这些文件，我们开发了一个 alpha 过滤器，可以删除字母字符少于 25% 的文件。然而，当我们在一小部分数据上测试此过滤器时，我们发现某些编程语言（例如汇编语言）的误报率很高。为了解决这个问题，我们重点关注了检测次数最多的 25 个扩展，并手动验证是否应该应用 alpha 过滤器。

超文本标记语言

我们设计了一个自定义 HTML 过滤器，针对过多的 HTML 样板和链接。我们考虑了每个文件中可见文本的比例，只保留可见文本至少占 HTML 代码 20% 且最小长度为 100 个字符的文件。

JSON 和 YAML

JSON 和 YAML 文件自然比 The Stack 中的其他语言包含更多数据。为了删除大部分数据文件，我们应用了以下过滤器。对于 YAML，我们保留 50-5000 个字符的文件，平均行长度小于 100，最大行长度小于 1000，并且字母字符超过 50%。这些过滤器删除了大约 20% 的文件和 90% 的卷。对于 JSON，我们保留包含 50-5000 个字符和超过 50% 字母字符的文件，这减少了大约 70% 的文件和 98% 的体积。

Language	After dedup		After filters and decont.		Weight	Percentage
Language	Num. files	Volume (GB)	Num. files	Volume (GB)	Weight	Percentage
ada	31,291	0.30	30,934	0.26	0.26	0.034
agda	17,608	0.07	17,554	0.07	0.07	0.009
alloy	5,374	0.01	5,368	0.01	0.01	0.001
antlr	7,983	0.05	7,917	0.05	0.05	0.007
applescript	4,906	0.01	4,737	0.01	0.01	0.001
assembly	248,396	1.58	247,919	1.56	1.56	0.203
augeas	195	0.00	180	0.00	0.00	0
awk	10,430	0.02	10,289	0.02	0.02	0.003
batchfile	252,514	0.29	239,568	0.23	0.23	0.03
bluespec	5,940	0.03	5,928	0.03	0.03	0.004
c	8,625,559	57.43	8,536,791	53.89	53.89	7.027
c-sharp	10,839,399	46.29	10,801,285	44.66	44.66	5.823
clojure	126,191	0.49	125,163	0.46	0.46	0.06
cmake	186,517	0.45	186,375	0.45	0.45	0.059
coffeescript	227,889	0.69	226,209	0.64	0.64	0.083
common-lisp	101,370	1.68	98,733	1.40	1.40	0.183
cpp	6,377,914	50.89	6,353,527	48.92	48.92	6.379
css	2,994,829	22.61	2,721,616	11.93	3.00	0.391
cuda	58,355	0.59	58,151	0.56	0.56	0.073
dart	932,583	3.86	928,415	3.66	3.66	0.477
dockerfile	572,186	0.42	571,506	0.42	0.42	0.055
elixir	282,110	0.74	281,016	0.71	0.71	0.093
elm	62,861	0.34	62,033	0.30	0.30	0.039
emacs-lisp	54,768	0.43	52,838	0.41	0.41	0.053
erlang	99,368	0.73	98,447	0.70	0.70	0.091
f-sharp	127,161	0.90	124,066	0.61	0.61	0.08
fortran	165,446	1.84	158,792	1.78	1.78	0.232
glsl	175,576	0.57	167,701	0.40	0.40	0.052
go	4,730,461	25.74	4,700,526	23.78	23.78	3.101
groovy	251,627	0.94	250,834	0.91	0.91	0.119
haskell	544,969	2.36	541,454	2.23	2.23	0.291
html	9,533,367	146.76	3,299,965	29.36	29.36	3.828
idris	8,060	0.03	8,042	0.03	0.03	0.004
isabelle	5,086	0.09	5,001	0.08	0.08	0.01
java	20,151,565	89.30	20,071,773	86.94	86.94	11.336
java-server-pages	214,133	1.03	210,816	0.98	0.98	0.128
javascript	21,108,587	141.65	19,544,285	64.71	64.71	8.437
json	17,012,912	338.34	4,751,547	5.62	1.00	0.13
julia	298,672	1.54	295,364	1.31	1.31	0.171
kotlin	2,242,771	5.77	2,239,354	5.68	5.68	0.741
lean	16,891	0.10	16,870	0.09	0.09	0.012
literate-agda	523	0.01	523	0.01	0.01	0.001
literate-coffeescript	1,138	0.01	1,133	0.01	0.01	0.001
literate-haskell	6,135	0.05	6,104	0.05	0.05	0.007
lua	558,861	3.28	549,459	2.87	2.87	0.374
makefile	661,424	1.49	657,349	1.31	1.31	0.171
maple	1,259	0.01	1,152	0.01	0.01	0.001
markdown	21,045,171	75.25	21,029,287	74.93	74.93	9.77
mathematica	26,895	1.72	22,653	1.25	1.25	0.163
matlab	967	0.04	93	0.00	0.00	0

表1： StarCoder 训练数据概述。对于所选的编程语言，我们显示近重复数据删除以及过滤后的文件数量和数据量。另请参见表2。

Language	After dedup		After filters and decont.		Weight	Percentage
Language	Num. files	Volume (GB)	Num. files	Volume (GB)	Weight	Percentage
ocaml	159,734	1.11	158,356	1.03	1.03	0.134
pascal	118,675	1.71	110,981	1.68	1.68	0.219
perl	392,108	2.63	365,491	2.23	2.23	0.291
php	15,904,518	66.84	15,683,017	60.89	60.89	7.939
powershell	271,487	1.25	267,627	1.12	1.12	0.146
prolog	1,023	0.01	968	0.01	0.01	0.001
protocol-buffer	98,246	0.44	97,167	0.31	0.31	0.04
python	12,962,249	64.30	12,866,649	60.40	60.40	7.875
r	39,194	0.30	39,042	0.30	0.30	0.039
racket	4,201	0.04	3,688	0.03	0.03	0.004
restructuredtext	905,679	3.42	896,880	3.32	3.32	0.433
rmarkdown	5,389	0.06	5,386	0.06	0.06	0.008
ruby	3,405,374	7.14	3,390,320	6.81	6.81	0.888
rust	1,386,585	9.53	1,380,468	9.11	9.11	1.188
sas	9,772	0.13	9,226	0.12	0.12	0.016
scala	1,362,426	4.86	1,355,788	4.69	4.69	0.612
scheme	44,261	0.30	41,890	0.20	0.20	0.026
shell	2,236,434	3.38	2,206,327	3.09	3.09	0.403
smalltalk	592,999	0.74	587,748	0.58	0.58	0.076
solidity	164,242	1.21	153,194	0.85	0.85	0.111
sparql	14,173	0.04	13,716	0.04	0.04	0.005
sql	994,019	12.22	975,420	11.09	11.09	1.446
stan	5,441	0.01	5,429	0.01	0.01	0.001
standard-ml	48,995	0.52	19,630	0.19	0.19	0.025
stata	31,282	0.41	24,208	0.33	0.33	0.043
systemverilog	46,915	0.41	46,270	0.39	0.39	0.051
tcl	50,579	0.40	49,335	0.35	0.35	0.046
tcsh	4,911	0.02	4,806	0.02	0.02	0.003
tex	547,888	5.44	522,778	5.20	5.20	0.678
thrift	4,663	0.01	4,661	0.01	0.01	0.001
typescript	10,637,070	28.82	10,547,331	26.52	26.52	3.458
verilog	77	0.001	75	0.001	0.001	0
vhdl	60,027	1.12	58,208	0.94	0.94	0.123
visual-basic	163,291	1.49	161,239	1.42	1.42	0.185
xslt	43,095	0.56	6,513	0.05	0.05	0.007
yacc	25,775	0.41	7,451	0.11	0.11	0.014
yaml	5,282,081	28.36	3,995,948	3.76	1.00	0.13
zig	15,913	0.18	15,850	0.18	0.18	0.023
GitHub issues			$\sim$ 30,900,000	54.40	54.40	7.093
Git commits			7,674,345	64.00	32.00	4.172
notebook scripts			914,000	7.12	7.12	0.928
notebook structured			668,743	6.00	6.00	0.782
			305,929,658	815.68	799.37	100

表2： StarCoder 训练数据概述。对于所选的编程语言，我们显示近重复数据删除以及过滤后的文件数量和数据量。另请参见表1。

3.2Jupyter 笔记本

所有 Jupyter 笔记本均从堆栈中检索。我们将 Jupyter 笔记本转换为两个不同的数据集：Jupyter - 脚本和 Jupyter - 结构化。

Jupyter – 脚本

我们利用 Jupytext⁵⁵5https://jupytext.readthedocs.io/ 将笔记本转换为脚本。它是一款积极维护的软件，目前支持 31 种编程语言。为了启动转换过程，Jupytext 需要识别每个笔记本中的特定编程语言。我们从每个笔记本的元数据中提取此信息。然而，超过 30,000 个笔记本缺乏任何编程语言信息，因此很难将它们转换为脚本格式。为了解决这个问题，我们引入了 Guesslang 的使用，⁶⁶6https://guesslang.readthedocs.io/ 一个开源库，它采用机器学习技术来识别源代码的编程语言。通过应用大于或等于 0.5 的概率阈值，我们使用 Guesslang 成功将未识别笔记本的数量减少到 6,400 个。最终，我们通过 Jupytext 收集了 1,432,992 个脚本。表3列出了这些脚本中编程语言的分布。我们通过从转换后的脚本中随机选择 100 个文件来评估语言覆盖率，确保此示例中包含所有编程语言。

Language	Num files	Percentage
python	1,392,432	97.170
julia	16,730	1.167
r	11,034	0.77
scala	1,899	0.133
bash	1,441	0.101
java	1,319	0.092
q-sharp	1,273	0.089
cpp	1,081	0.075
c-sharp	1,048	0.073
matlab	908	0.063
powershell	769	0.054
javascript	592	0.041
haskell	535	0.037
scheme	484	0.034
groovy	432	0.03
f-sharp	385	0.027
ocaml	279	0.019
rust	134	0.009
clojure	96	0.007
typescript	72	0.005
maxima	31	0.002
coconut	6	0
markdown	5	0
wolfram language	4	0
tcl	3	0
Total	1,432,992	100

表3：最初收集的 Jupyter 脚本概述，包括文件数量和百分比。

Jupyter – 结构化

为了创建这个数据集，我们首先过滤掉不包含任何 Python 代码或 Markdown 文本的笔记本。每个笔记本的元数据中有关编程语言的信息被用作过滤非Python笔记本的标准。仅保留在元数据中明确标记为“Python”的笔记本。然后对于每个笔记本，连续的 Markdown 块或代码块分别合并为一个大的 Markdown 或代码块。最终，我们得到了按时间顺序按每个笔记本分组的连续代码文本对。一般来说，每个 Jupyter 代码文本对都包含紧邻代码块之前的 Markdown 文本和 Python 代码，形成一个自然的指令对。如果输出单元格不为空，我们还将代码块的格式化输出包括在内；否则，将用特殊的 <empty_output> 词符标记。如果连续的代码块在合并之前有多个输出单元，我们只保留最后一个代码块的输出。经过这些预处理步骤后，我们最终得到了 1,045,605 个结构化 Jupyter 笔记本。

3.3GitHub 问题

我们使用来自 GitHub 问题和拉取请求的自然语言对话，这些对话是作为 The Stack v1.2 的组件收集的。每个对话都包含一系列带有操作的事件，例如打开问题、创建评论或关闭问题。每个事件都包括作者的用户名、消息、操作和创建日期。我们按如下方式过滤这些数据： 1）首先，我们删除了用户通过电子邮件回复问题时自动生成的文本。请参阅附录A了解我们使用的正则表达式。我们还删除了短消息（少于 200 个字符）的问题，并将中间的长评论截断为最多 100 行，同时保留最后 20 行。这减少了 18% 的体积。 2）接下来，我们排除了来自机器人的评论。为此，我们在评论作者的用户名中搜索机器人关键字（有关更多信息，请参阅附录 A)。此步骤消除了总事件的 17%，并导致 14.7% 的问题被清空。我们观察到，机器人生成的问题往往很长，并且包含大量日志和链接。 3）我们使用参与对话的用户数量作为质量指标。我们的标准是包含有两个或更多用户的对话。但是，如果评论中的总文本少于 7,000 个字符（第 96 个百分位），我们也会保留涉及单个用户的对话。此外，如果单个用户编写的问题包含十个以上事件，我们会排除这些问题，因为它们往往质量较差或源自被忽视的机器人。通过实施这些过滤器，我们额外消除了 14% 的问题。 4）最后，我们使用了 fasttext 库中的模型⁷⁷7The lid.176.bin version of this language identification model: https://fasttext.cc/docs/en/language-identification.html 用于过滤非英语问题。为了使用 PII 检测模型准确编辑名称，此步骤是必要的（请参阅第 4.3 节）。最后，我们想指出，我们通过将对话中的用户名替换为对话中的参与者计数器来匿名化用户名。请参阅第 4.3 和 5.1 节了解更多详细信息。

3.4 Git 提交

Git 提交数据是从 BigQuery⁸⁸8https://cloud.google.com/bigquery/public-data/ 并且仅包含具有与 The Stack 中使用的相同许可证和文件扩展名的存储库的单文件提交（Kocetkov 等人，2022）。我们从选择退出 The Stack 的用户中删除了所有存储库。原始数据集大小约为 4 TB。我们对 50% 的文件进行了采样，并利用启发式方法过滤了剩余的数据，以构建高质量的数据集。我们在表4中列出并描述了所有过滤器。

与文件大小相比，提交中的行更改数量可能非常少。为了避免在学习复制文件内容上花费过多的计算预算，我们仅在 20% 的时间内使用完整文件，而在其余 80% 的时间内，对第一个和最后一个更改行周围的 0 到 32 行之间的窗口进行采样。生成的数据集包含 64 GB 的提交数据。

Description	Details
Maximum characters	Remove code files with $>$ 100k characters.
Small changes	Subsample changes with $\leq$ 2 lines with 50% probability.
Long-range refactorings	Subsample changes spanning $\geq$ 200 lines with 10% probability.
Empty commit message	Remove commits with empty commit subject.
Automatic commits	Remove commits that either contain or are equal to a list of stop words.
Hash messages	Remove commits with whitespace-separated words-to-character ratio $>$ 20.
Data files	Subsample data formats (JSON, YAML, XML, HTML) with 50% probability.

表4： Git 提交过滤器。

3.5重复数据删除

我们遵循 Ben Allal 等人 (2023) 的重复数据删除管道，其中包括计算所有源代码文件的 MinHashes (Broder, 2000)，然后进行本地敏感哈希(LSH) 将相似的代码文件映射到同一个存储桶。我们使用 5-grams 和 Jaccard 相似度 0.7。有关管道的更多详细信息，请参阅此博文。

我们将这种近乎重复数据删除的过程应用于所有编程语言和 Jupyter 笔记本。然而，由于时间限制，我们无法将此过程应用于 Git 提交。此外，我们认为不太可能在 Github 问题中发现重复项，因此我们没有对它们应用该流程。

3.6 数据源的权重

社区内就是否对某些编程语言进行上采样或下采样进行了多次讨论，因为分配给给定语言的数据源的计算预算量可能会显着影响模型在该语言中的性能。然而，我们意识到最大量的可用数据来自流行的编程语言，因此将使更多的最终用户受益。此外，经过重复数据删除处理后，我们发现多种高资源编程语言（例如 C、C++、C#、Java、Javascript、Python 和 PHP）的数据量相似，范围为 44-87 GB。这进一步坚定了我们的信念，即我们不需要彻底重新权衡现有的数据分布。因此，在这项工作中，我们遵循训练期间数据的自然分布，并根据数据源的数量按比例采样。不过，我们确实对 JSON、YAML 和 CSS 进行了例外处理，因为我们只希望大语言模型能够学习数据格式，而不是浪费计算资源来记住这些文件中的数据。因此，我们重新权衡数据源的大小，JSON 和 YAML 为 1 GB，CSS 为 3 GB。

4 PII 修订

本节概述了我们从训练数据中删除个人身份信息 (PII) 的努力。在第 4.1 节中，我们首先描述我们如何收集大量 PII 注释。我们在 4.3 节中使用这些注释来探索各种技术来训练 PII 检测模型，该模型构建在我们在 4.2 节中开发的编码器模型之上。

4.1数据收集

我们利用了 Toloka 平台⁹⁹9https://toloka.ai/ 让来自 35 个国家/地区的 1,399 名众包工作者参与注释源代码中的 PII 数据集。参与者平均完成 206 项任务，赚取约 27 美元，工作时间为 3.1 小时。我们的目标是识别各种形式的 PII，例如姓名、用户名、电子邮件、IP 地址、密钥、密码和 ID。为了确保众包工作者获得公平的报酬，考虑到各国不同的最低工资标准及其相应的购买力，我们将每小时工资定为 7.30 美元。我们将标注资格限制在每小时工资为 7.30 美元、按购买力平价计算相当于美国最高最低工资（16.50 美元）的国家。参加标注的国家的完整列表可以在附录B的表B.1中找到。托洛卡的人群工作者可以随时随地执行任务；没有义务完成某项任务或花费固定的时间。因此，他们在完成任务时利用自由选择。在 1,399 名众包工作者中，695 人填写了任务质量调查，519 人完成了调查。对于询问参与者是否愿意为另一个这样的项目做出贡献的问题，按 1-5 分制评分，平均分为 4.92 分。

该数据集包含 12,000 个文件，每个文件包含用 31 种编程语言编写的大约 50 行代码。图1显示了数据集中编程语言的分布。为了增加罕见 PII 类型（例如密钥和 IP 地址）的代表性，从更大的样本中预先过滤了 7,100 个文件。我们利用了检测秘密工具¹⁰¹⁰10https://github.com/Yelp/detect-secrets 已激活所有默认插件，以及用于检测的 Ben Allal 等人 (2023) 的正则表达式电子邮件、IPv4 和 IPv6 地址。为了防止标注过多地偏向这些检测工具，剩余的 5,100 个文件是从数据集中随机选择的，没有进行预过滤。

Refer to caption — 图1：带注释的 PII 数据集中编程语言的分布。

在标注期间，我们根据 PII 出现的具体上下文来区分各种类型的 PII。具体来说，我们区分了 PII 是否存在于代码的许可证标头中、用作占位符或构成机密数据。这种分类是必要的，因为许可证标头中的 PII 通常由作者自愿提供以用于代码归属，并且可能不需要屏蔽。同样，占位符也不是真正的秘密，不需要屏蔽。我们将此分类应用于姓名、电子邮件和用户名。有关所有 PII 实体的概述，请参阅表 5。

注释者在数据集中总共检测到 22,950 个 PII 实体。为了评估数据集的质量，我们手动检查了包含各种 PII 类型的 300 个文件，并计算了每种类型的查全率和查准率，如表 5 所示。我们发现注释秘密 ID 特别具有挑战性，因为注释者往往会产生许多误报和漏报。因此，我们决定从 PII 检测模型训练中排除此类别。

PII type	Count	Recall	Precision
IP_ADDRESS	2526	85%	97%
KEY	308	91%	78%
PASSWORD	598	91%	86%
ID	1702	53%	51%
EMAIL	5470	99%	97%
EMAIL_EXAMPLE	1407
EMAIL_LICENSE	3141
NAME	2477	89%	94%
NAME_EXAMPLE	318
NAME_LICENSE	3105
USERNAME	780	74%	86%
USERNAME_EXAMPLE	328
USERNAME_LICENSE	503
AMBIGUOUS	287

表5： PII 类型和收集的注释数量的概述。我们通过报告 300 个文件的手动检查的精确度和召回率来调查标注质量。每个子类别都被映射回其相应的 PII 类型以供检查。

4.2星编码器

作为 PII 检测工作的一部分，我们训练了一个仅编码器的模型（即双向自注意力 Transformer），该模型可以针对代码和文本相关任务进行有效的微调。我们使用 BERT （Devlin 等人，2019；Liu 等人，2019）的屏蔽语言模型（MLM）和下一句预测（NSP）目标，并从输入句子和预测中预测屏蔽标记一对句子是否作为相邻句子出现在文档中。

我们将输入中的代码片段分开如下： [CLS] Snippet-1 [SEP] Snippet-2，其中两个代码片段是随机选择的，或者来自同一个代码片段源文件或来自两个不同的文档。对于传销损失，我们以 15% 的概率独立屏蔽输入中的标记。对于 NSP 损失，我们使用线性分类器应用于 [CLS] 词符处的表示输出。我们训练 100,000 个步骤，全局批量大小为 4,096 个序列，最大长度为 1,024，以便观察到大约 400B 个标记。使用 64 个 NVIDIA A100 GPU 大约需要两天时间。表6报告了有关模型架构的详细信息。

Hyperparameter	Value
Hidden size	768
Intermediate size	3072
Max. position embeddings	1024
Num. of attention heads	12
Num. of hidden layers	12
Attention	Multi-head
Num. of parameters	$\approx$ 125M

表6： StarEncoder的模型架构。

4.3PII检测模型

我们在带注释的 PII 数据集上针对命名实体识别 (NER) 任务对 StarEncoder 进行了微调。我们在模型顶部添加了一个线性层作为词符分类头，包含 6 个目标类别：姓名、电子邮件、密钥、密码、IP 地址和用户名。由于标签质量较低，我们排除了 ID，并且由于模型在区分 PII 实体（许可证标头、占位符）方面的性能较差，因此我们没有区分它们的分类。我们将数据集拆分为包含 7,878 个示例的训练集和包含 4,000 个示例的测试集，确保两个拆分都具有不同 PII 类型的平衡表示。参见表7。我们在 https://hf.co/BigCode 的门控访问下提供训练和评估拆分。

Entity type	Train	Test
EMAIL	4721	1742
NAME	3847	1298
IP_ADDRESS	1941	521
USERNAME	1320	346
PASSWORD	390	148
KEY	171	118

表7：带注释的 PII 数据集的训练测试分割。

微调基线

我们在 PII 训练集上使用了 StarEncoder，以及来自 Ben Allal 等人 (2023) 的 400 个带注释的文件。我们在姓名、电子邮件和 IP 地址方面的 F1 分数超过 90%，在密码方面的 F1 分数超过 73.39%。该模型在密钥和用户名上的性能相对较低，F1 得分分别仅为 56.66% 和 59.39%。我们将密钥性能低下归因于此类 PII 的标签数量有限，因为只有 308 个实例可用。对于用户名，我们观察到模型经常将它们与路径中的装饰器和值混淆。这很可能是因为我们在社交媒体平台的链接内注释了用户名。

伪标签

为了改进密钥和密码实体的检测，我们采用了 Lee (2013) 所描述的伪标记技术。该方法涉及在一小部分标记数据上训练模型，然后为更大的未标记数据集生成预测。具体来说，我们使用两个编码器模型的集合注释了 18,000 个文件，这些模型在 Ben Allal 等人 (2023) 的 400 个文件 PII 数据集上进行了微调。为了识别可靠的伪标签，我们根据模型计算了平均概率 logits 并应用了过滤标准。具体来说，我们为所有实体设置了 0.5 的最小阈值，但名称和用户名除外，我们使用了更高的阈值 0.6。然而，在审查结果后，我们发现密钥和密码存在大量误报。因此，我们决定仅保留前面带有触发词的实体，例如 key、auth 或 pwd。前 100 个字符。在对带注释的数据集进行微调之前，对该合成数据集进行训练对于所有 PII 类别都产生了优异的结果，如表 8 和 9 所示。只有检测用户名的性能没有显示出显着的改进，因此我们决定将其从 PII 编辑过程中排除。

与正则表达式基线的比较

我们将我们的 PII 检测模型与 Ben Allal 等人 (2023) 中使用的正则表达式 (regexes) 进行了比较。正则表达式仅支持检测电子邮件、IP 地址和密钥。请注意，我们增强了电子邮件正则表达式，如附录中所述，以解决我们在此基准评估期间发现的误报。此修改将正则表达式的 F1 分数从 81.8% 提高到 96.83%。尽管如此，我们的 PII 检测模型在检测所有三个实体方面仍然优于正则表达式方法，如表 8 所示。我们注意到，按键上的性能差异特别大，并发现 detect-secrets 工具生成了许多误报，特别是在 Go 和 C-sharp 等特定编程语言中，这些语言在正则表达式评估。因此，该工具的整体精度低于 4%。

Method	Email address			IP address			Key
Method	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
Regex	96.20%	97.47%	96.83%	71.29%	87.71%	78.65%	3.62%	49.15%	6.74%
NER	94.01%	98.10%	96.01%	88.95%	94.43%	91.61%	60.37%	53.38%	56.66%
+ pseudo labels	97.73%	98.94%	98.15%	90.10%	93.86%	91.94%	62.38%	80.81%	70.41%

表8：比较 PII 检测性能：正则表达式、带注释数据的 NER 管道以及带注释数据 + 伪标签的 NER 管道

Method	Name			Username			Password
Method	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
NER	83.66%	95.52%	89.19%	48.93%	75.55%	59.39%	59.16%	96.62%	73.39%
+ pseudo labels	86.45%	97.38%	91.59%	52.20%	74.81%	61.49%	70.94%	95.96%	81.57%

表 9： PII 检测性能比较：带有注释数据的 NER 管道与注释数据 + 伪标签

后期处理

在将最佳 PII 检测模型应用于完整数据集之前，我们观察到一些常见的检测错误。我们添加了以下后处理技术来减少误报数量：

•

忽略少于 4 个字符的秘密。
•

仅通过要求名称中至少有一个空格来检测全名。
•

使用乱码检测器忽略检测到的少于 9 个字符的键或非乱码的键。¹¹¹¹11https://github.com/domanchi/gibberish-detector
•

使用 ipaddress python 包忽略无效或私有（非面向 Internet）的 IP 地址。我们还忽略来自流行 DNS 服务器的 IP 地址。我们使用与 Ben Allal 等人 (2023) 中相同的列表。

PII 占位符

我们用以下标记替换了检测到的 PII 实体：

<NAME>, <EMAIL>, <KEY>, <PASSWORD>

为了屏蔽 IP 地址，我们从 5 个合成的、私有的、非面向互联网的相同类型的 IP 地址中随机选择了一个 IP 地址，这些 IP 地址可以在附录 C 中找到。

Github 问题

我们已经采用正则表达式方法来检测 Github 问题中的密钥、IP 地址和电子邮件，因此我们仅使用 PII 检测模型来编辑名称。我们通过将作者的用户名替换为对话中的参与者计数器来匿名化，例如username_1 指第二个参与者（有关格式详细信息，请参阅第 5.1 节）。我们将这些假名添加到每个评论的开头，以便保留作者的发言者身份。此外，我们会编辑消息中所有提及这些用户名的内容。请注意，我们仅屏蔽对话中活跃参与者的用户名，并且未匿名提及非参与用户。

计算资源

我们使用 PII 检测模型来识别训练数据集中所有编程语言的 PII，包括 GitHub 问题（仅名称）、Git 提交和 Jupyter Notebook。数据集总大小为 815 GB。我们在多个 NVIDIA A100 80 GB GPU 上运行推理，这需要 800 个 GPU 小时。

5 模型训练

本节介绍有关 StarCoder 模型训练过程的信息。在继续之前，我们首先澄清两个模型之间的差异：

星码库: 是第一个基于 1 万亿个 Token 进行训练的模型，这些 Token 源自第 3 节中描述的精选数据集。
星码器: 是 StarCoderBase 的微调版本，在另外 35B 个 Python Token （大约 2 个 epoch）上进行训练。

在下文中，我们将展示如何格式化训练数据（第 5.1 节）、如何净化训练数据（第 5.2 节），并提供有关词符生成器（第 5.3节）、模型架构（5.4节）、训练过程（5.5节）、多节点 GPU 设置（5.6节）和二氧化碳排放（5.7节）。

5.1 数据格式化

我们为下面的每个数据源提供了格式指南。我们提供以下模板，其中<词符>指标记词符，元数据和数据分别指数据字段的占位符。

代码

我们将存储库名称、文件名和星星数量添加到代码文件的上下文中。为了不过度拟合确切的星星数量，我们将 GitHub 星星分为五个类别：0、1-10、10-100、100-1000、1000+。为了使模型能够在推理过程中在没有此元数据的情况下运行，我们随机地独立地为存储库名称、文件名和星号添加前缀，每个前缀的概率为 $0.2$ 。

<reponame>reponame<filename>filename<gh_stars>stars\ncode<|endoftext|>

对此模板中的源代码（即代码)，我们应用填充中间转换 (FIM; Bavarian 等人, 2022). 更准确地说，我们将字符级的 FIM 应用于 FIM-rate 为 0.5 的源代码文件，并使用概率为 0.5 的 PSM 模式和概率为 0.5 的 SPMv2 模式。

问题

我们使用哨兵 Token 来标记问题的开始，并随后包含其标题。 We separate the sequence of comments by a <issue_comment> token and include a anonymized speaker identifier before the comment. 具体来说，我们通过对话中的参与者计数器来引用作者，例如username_1 指问题中的第二个参与者。为了区分不同的回合，我们使用 comment1 和 id1 分别引用第二条评论及其匿名发言者 ID。

<issue_start>Title：titlenusername_id0:comment0<issue_comment>username_id1:comment1... <issue_closed（可选）><|endoftext|>

Jupyter – 脚本

Jupyter 脚本的格式与代码的格式相同。

Jupyter – 结构化

解析后的 Jupyter 笔记本以文本、代码和输出链的形式出现，我们用哨兵标记将它们分开。请注意，我们使用 text2、code2、output2 来引用笔记本中的第三个三元组。

<jupyter_start><jupyter_text>text0<jupyter_code>code0<jupyter_output>output0<jupyter_text> ....<|endoftext|>。

Git 提交

我们使用哨兵标记将提交前的代码、提交消息和提交后的代码分开。如 3.4 节中所述，我们以 20% 的概率使用完整文件，否则在更改的行周围使用一个小窗口（0-32 行）。

<commit_before>code_before<commit_msg>message<commit_after>code_after<|endoftext|>

我们在表 10 中总结了所有哨兵 Token 。

Token	Description
`<\|endoftext\|>`	end of text/sequence
`<fim_prefix>`	FIM prefix
`<fim_middle>`	FIM middle
`<fim_suffix>`	FIM suffix
`<fim_pad>`	FIM pad
`<reponame>`	repository name
`<filename>`	file name
`<gh_stars>`	GitHub stars
`<issue_start>`	start of GitHub issue
`<issue_comment>`	start of GitHub issue comment
`<issue_closed>`	GitHub issue closed event
`<jupyter_start>`	start of Jupyter notebook
`<jupyter_text>`	start of Jupyter text cell
`<jupyter_code>`	start of Jupyter code cell
`<jupyter_output>`	start of Jupyter output cell
`<empty_output>`	output cell without content
`<commit_before>`	code snippet before commit
`<commit_msg>`	commit message
`<commit_after>`	code snippet after commit

表10：哨兵 Token 概述。

5.2训练数据去污

通过删除包含 HumanEval 和 MBPP 中的文档字符串或解决方案、APPS 中的文档字符串、GSM8K 中的问题或 DS1000 中的提示的文件，对代码训练数据进行了净化。（这些基准在第 6 节中有进一步描述。）为了表明净化删除的数据量，Python 是匹配数量最多的语言，删除了 558 个文件。

5.3 分词器

该模型的分词器遵循我们在 Ben Allal 等人 (2023) 中提出的见解，并使用相同的设计选择：我们使用 Hugging Face 分词器库 (MOI 等人, 2022)训练词汇大小为 49,152 个标记的字节级字节对编码，包括表 10 中的哨兵标记。预标记化步骤包括来自 GPT-2 预标记器的数字分割器和正则表达式分割器。

5.4模型架构

我们训练了一个 15.5B 参数模型，其架构与 SantaCoder (Ben Allal 等人, 2023) 相同。它是一个仅解码器的 Transformer，具有多查询注意力（MQA；Shazeer，2019），并且学习了绝对位置嵌入。我们还将中间填充 (FIM; Bavarian 等人, 2022) 转换应用于训练数据，请参阅第 5.1 节。我们使用 FlashAttention (Dao 等人, 2022) 来加速注意力计算并减少其内存占用，使我们能够扩展到 8K 上下文长度。为了使 FlashAttention 在训练期间与 MQA 配合使用，我们只需在调用注意力内核之前扩展键和值。表11给出了架构超参数。此外，我们还纳入了 SantaCoder(Ben Allal 等人, 2023) 的超参数进行比较。

5.5培训详情

星码库

该模型经过 25 万次迭代训练，批量大小为 4M 个 Token ，总共 1 万亿个 Token 。我们使用 Adam (Kingma & Ba, 2015) 和 $\beta_{1}=0.9$ 、 $\beta_{2}=0.95$ 、 $\epsilon=10^{-8}$ 以及权重衰减 $0.1$ . 经过 2,000 次迭代的线性预热后，学习率遵循从 $3\times 10^{-4}$ 到 $3\times 10^{-5}$ 的余弦衰减。

星码器

从 StarCoderBase 开始，我们在训练数据的 Python 子集上对模型的 Python 变体进行了 2 个 epoch 的微调。我们使用与 StarCoderBase 相同的设置，只是我们使用 $5\times 10^{-5}$ 的学习率，并在 1,000 次线性预热迭代后将其衰减到 $5\times 10^{-6}$ 。我们训练了 8,500 步。

Hyperparameter	SantaCoder	StarCoder
Hidden size	2048	6144
Intermediate size	8192	24576
Max. position embeddings	2048	8192
Num. of attention heads	16	48
Num. of hidden layers	24	40
Attention	Multi-query	Multi-query
Num. of parameters	$\approx$ 1.1B	$\approx$ 15.5B

表11： StarCoder 的模型架构。我们还包括 SantaCoder（社区之前的工作）。

5.6 多节点 GPU 设置

我们在 GPU 集群上训练我们的模型，该集群具有分布在 64 个节点上的 512 个 A100 80 GB GPU。我们使用 3D 并行布局对模型进行分区，该布局使用张量和管道并行度 4 级对模型进行分片，一个副本需要 16 个 GPU（两个节点）。为了充分利用集群的功能，我们使用了 32 倍数据并行度。为了优化 GPU 利用率并减少空闲计算气泡，我们将微批量大小保持为 1，并累积 16 个步骤，最终得到全局批量大小为 512（相当于 4M Token ）。我们使用 Megatron-LM 的分布式优化器，因为我们发现它会导致此配置中的吞吐量稍高。由于 FP32 中需要梯度缩减步骤，BF16 中的训练导致吞吐量比 FP16 低 10%，但我们无论如何都使用它来避免训练不稳定。

除了几次重新启动之外，我们没有遇到明显的训练不稳定问题。

5.7二氧化碳排放量

星码库

我们报告训练 StarCoderBase 的碳足迹(Lacoste 等人, 2019)。根据训练所花费的 GPU 总小时数（320,256）和每个 GPU 的平均功耗 280W，训练过程中消耗的电量总计为 89671.68 kWh。乘以 us-west-2 AWS 位置的能源碳强度（每千瓦时 0.15495 kgCO2e）和整个 AWS 数据中心的平均用电效率 1.2，得出 16.68 吨 CO2eq 排放量。

星码器

经过微调的模型增加了 3.5% 的训练时间，这意味着预计额外排放 0.58 吨二氧化碳当量。

6评估

在本节中，我们首先概述除 StarCoder 和 StarCoderBase 之外我们评估的模型。然后，我们报告了所有模型在 HumanEval （Chen 等人，2021 年）、MBPP （Austin 等人，2021 年）和 DS-1000 （Lai 等人，2022 年）评估基准上的 Python 语言性能。然后我们将介绍使用各种基准和任务的多语言评估。

A Code LM 评估工具

为了实现 StarCoder 和其他 Code 大语言模型的可重复和集中评估，我们开发了 Code LM 评估工具 (Ben Allal 等人, 2022)，其灵感来自于 LM 评估工具 (Gao等人，2021b)。该工具为高效评估代码模型提供了一个框架，利用数据并行性和 Docker 容器来执行。它支持多种基准测试，包括 HumanEval、MultiPL-E 和 DS-1000。

评估的其他模型

我们将 StarCoder 和 StarCoderBase 与以下模型进行比较。

1.

CodeGen-16B-Multi (Nijkamp 等人, 2023) 是一个开放访问的 16B 参数模型，在 Pile 上进行训练 (Gao 等人, 2021a )，然后使用来自 GitHub BigQuery 数据集的 C、C++、Go、Java、JavaScript 和 Python 编写的其他代码（Smith，2016）。
2.

CodeGen-16B-Mono 是 CodeGen-16B-Multi 的一个版本，它在 GitHub 上的其他 Python 代码上进行了微调，但数据集并未公开提供。
3.

CodeGeeX (Zheng 等人, 2023) 是一个开放访问的 13B 参数模型，使用从 CodeParrot 数据集 Pile 中选择的 23 种编程语言进行训练 (Wolf 等人，2020），以及 Python、Java 和 C++ 的其他数据。 CodeGeeX 还包括自己的多语言基准测试套件 HumanEval-X，我们将在下面讨论。
4.

code-cushman-001 是 OpenAI 的 12B 参数模型，是 GitHub Copilot （Chen 等人，2021）的初始模型。其训练内容的详细情况尚不清楚。该模型已被 OpenAI 弃用，但在撰写本文时可从 Microsoft Azure OpenAI 服务获取。¹²¹²12曾经有一个代码-cushman-002，但在撰写本文时尚不可用。
5.

最后，虽然它们没有经过专门的代码生成训练，但我们包含了 LLaMA (Touvron 等人，2023)、PaLM (Chowdhery 等人，2022) 的一些结果，和 LaMDA (Thoppilan 等人, 2022) 论文。 LLaMA 的许可证禁止商业使用，PaLM 和 LaMDA 不公开。

6.1 StarCoder：Python 评估

在本节中，我们评估 StarCoder 在 Python 上的性能，并将其与开放访问和封闭访问模型进行比较。我们首先报告 HumanEval (Chen 等人, 2021) 和 MBPP (Austin 等人, 2021) 的性能，这是两个广泛使用的 Python 性能基准。不过，我们还测量了 DS-1000 （Lai 等人，2022）的性能，这是一个基于 StackOverflow 问题的 1,000 个 Python 数据科学问题的代码完成基准。

Model	Size	HumanEval	MBPP
Open-access
LLaMA	7B	10.5	17.7
LLaMA	13B	15.8	22.0
SantaCoder	1.1B	18.0	35.0
CodeGen-Multi	16B	18.3	20.9
LLaMA	33B	21.7	30.2
CodeGeeX	13B	22.9	24.4
LLaMA-65B	65B	23.7	37.7
CodeGen-Mono	16B	29.3	35.3
StarCoderBase	15.5B	30.4	49.0
StarCoder	15.5B	33.6	52.7
Closed-access
LaMDA	137B	14.0	14.8
PaLM	540B	26.2	36.8
code-cushman-001	12B	33.5	45.9
code-davinci-002	175B	45.9	60.3

表12：将 StarCoder 在 HumanEval 和 MBPP Python 上的性能 (pass@1) 与其他几个模型进行比较。 StarCoder 和 StarCoder base 获得了开放访问模型的最高性能，并且与 code-cushman-001 封闭访问模型的性能相当。

6.1.1 HumanEval 和 MBPP 基准

HumanEval (Chen 等人, 2021) 和 MBPP (Austin 等人, 2021) 是广泛使用的 Code 大语言模型基准测试，由数百个 Python 编程问题组成，使用测试用例来验证代码大语言模型生成的代码。代码大语言模型通过从输出分布中采样来生成代码。我们使用 pass@ $k$ 指标 (Chen 等人, 2021) 来报告性能：已解决的基准问题的总分数，如果 $k$ 代码示例通过了每个测试用例。与 Chen 等人 (2021) 一样，我们对 pass@1 使用采样温度 $0.2$ ，对 $k>1$ 使用温度 $0.8$ 。我们使用开放访问模型为所有实验生成 $n=200$ 样本。对于API模型，我们使用 $n=20$ 样本，这足以估计pass@1。我们关注 pass@ $k$ 最简单的版本，即 pass@1：模型一次尝试解决问题的可能性。

表 12 将 HumanEval 和 MBPP 上的 StarCoder（和 StarCoderBase）与几种开放访问和封闭访问模型进行了比较：

1.

StarCoder 是这两个基准测试中性能最高的开放访问模型。
2.

StarCoder 的性能优于最大的模型，包括 PaLM、LaMDA 和 LLaMA，尽管其尺寸要小得多。
3.

StarCoderBase 在 Python 上的能力也很强，并且与 CodeGen-16B-Mono 具有竞争力，CodeGen-16B-Mono 是一种在 Python 上进行了微调的类似大小的开放访问模型。
4.

StarCoder 的性能优于 OpenAI 的 code-cushman-001 (12B) 模型。

6.1.2 DS-1000 Python 数据科学基准

Format	Model	Matplotlib	NumPy	Pandas	PyTorch	SciPy	Scikit-Learn	TensorFlow	Overall
	Number of problems:	155	220	291	68	106	115	45	1,000
Completion	SantaCoder-1B	21.6	4.6	0.9	2.6	2.4	4.8	3.1	5.7
Completion	InCoder-6B	28.3	4.4	3.1	4.4	2.8	2.8	3.8	7.4
Completion	CodeGen-16B-Mono	31.7	10.9	3.4	7.0	9.0	10.8	15.2	11.7
Completion	code-cushman-001	40.7	21.8	7.9	12.4	11.3	18.0	12.2	18.1
Completion	StarCoderBase	47.0	27.1	10.1	19.5	21.7	27.0	20.5	23.8
Completion	StarCoder	51.7	29.7	11.4	21.4	20.2	29.5	24.5	26.0
Insertion	SantaCoder-1B	21.6 ${}^{*}$	13.8	2.0	3.8	5.7	6.9	14.8	9.3
Insertion	InCoder-6B	28.3 ${}^{*}$	4.6	2.9	4.4	2.8	3.1	7.8	7.5
Insertion	StarCoderBase	47.0 ${}^{*}$	26.3	10.9	16.6	20.2	30.2	22.3	24.0
Insertion	StarCoder	51.7*	30.8	10.3	21.0	20.2	27.4	20.0	25.4

表13： DS-1000 上开放式访问和封闭式访问模型的性能。基准如下。所有模型均在温度=0.2、top_p=0.5、max_length=1024 下进行评估。分数反映了 40 个样本的平均 pass@1 准确率。

{}^{*}

：Matplotlib 任务没有右侧上下文，因此插入和完成格式相同。

HumanEval 和 MBPP 的一个主要限制是它们是简单的编程难题，不能代表大多数程序员编写的代码。相比之下，DS-1000 基准测试（Lai 等人，2022）拥有跨七个库的一套 1,000 个现实且实用的数据科学工作流程，并根据测试用例评估执行中的代数。

DS-1000 支持两种评估模式：完成和插入（通过 FIM）。我们报告所有模型的完成分数，但仅报告支持它的模型的插入分数：StarCoder 模型和 InCoder-6B （Fried 等人，2022）。 DS-1000 还根据所使用的库对问题进行分类：Matplotlib、NumPy、Pandas、SciPy、Scikit-Learn、PyTorch 和 TensorFlow。我们在表 13 中报告每个库的 pass@1 和总体得分，并得出以下结论：

1.

从 DS-1000 基准来看，StarCoder 在数据科学问题上的表现远远优于所有其他模型。此外，对于各种数据科学图书馆来说都是如此。
2.

StarCoderBase 的性能也优于其他所有模型，但在 DS-1000 上略落后于 StarCoder。
3.

我们证实了 Lai 等人 (2022) 的发现：HumanEval 和 MBPP 基准上的模型性能并不总是与更现实的 DS-1000 基准上的性能相关。例如，CodeGen-Mono 在 HumanEval 和 MBPP 上的性能略优于 code-cushman-001 和 StarCoder 模型，但在 DS-1000 上明显较差。这证明了在一系列基准上评估模型的重要性。

6.1.3 ODEX 开放域编码基准

我们之前的评估要么集中在封闭领域（即主要是内置 Python 函数，如 MBPP 和 HumanEval 中），要么是特定领域（例如数据科学，如 DS-1000 中）。为了评估模型在更广泛的 Python 库上生成代码的能力，我们使用 ODEX 基准（Wang 等人，2022），其中包含 505 个开放域和 440 个闭域 Python 编码查询，分为四个部分自然语言——英语、西班牙语、日语和俄语——以及基于测试用例的执行评估。

我们报告了 StarCoder 和基线模型的 pass@1 指标，包括 Codex (code-davinci-001)、CodeGen-16B-Mono 和 SantaCoder。除了整体执行精度之外，我们还按语言和领域对问题进行分类，分别是：（1）封闭域（仅使用内置Python函数）和开放域中的查询-domain（使用导入库中的函数），以及 (2) 分别用英语、西班牙语、日语和俄语编写的指令进行查询。我们在表14中报告了总体得分以及不同领域和语言的得分，并得出以下结论：

1.

根据 ODEX 基准测试，StarCoder 在开放域编码查询方面的性能明显优于所有其他模型。
2.

StarCoderBase 的性能也优于其他所有模型，甚至在 ODEX 英语子集中优于 StarCoder，但在其他语言中稍稍落后。
3.

尽管总体执行精度更高，但与其他基线模型相比，StarCoder 和 StarCoderBase 模型在开放域和封闭域查询之间的差距通常较小。这一结果表明，StarCoder 模型获得了关于开放域中编码查询的更通用的技能（即，涉及不同的 Python 库），而其他模型在从封闭域迁移到开放域时表现出更大的性能下降。

Model	English			Spanish			Japanese			Russian
Model	overall	open	closed	overall	open	closed	overall	open	closed	overall	open	closed
CodeGen-16B-Mono	33.7	25.2	43.1	30.0	25.0	43.1	37.8	26.6	62.8	46.8	30.4	60.1
code-cushman-001	31.9	24.4	40.2	31.9	27.7	36.7	25.7	21.2	35.5	40.0	26.0	51.6
code-davinci-001	33.6	26.9	41.0	36.9	31.7	42.9	31.0	23.7	47.3	43.2	28.9	55.1
SantaCoder	37.7	30.9	45.1	32.1	26.0	39.1	28.1	23.0	39.4	36.9	23.0	48.3
StarCoderBase	46.5	40.7	53.0	30.1	25.4	35.5	41.2	37.6	49.2	46.1	34.0	56.1
StarCoder	44.7	37.0	53.1	37.6	32.9	42.9	44.2	39.6	54.5	50.4	33.8	64.1

表14：按指令语言和代码域划分的 ODEX 基准测试性能：开放问题使用库，而封闭仅使用内置 Python 函数。

6.2 StarCoder 和 StarCoderBase：多语言评估

在本节中，我们主要关注 StarCoderBase，并评估其在各种编程语言和编程任务上的性能，包括从自然语言描述生成代码、记录代码、预测类型注释等。本节还表明，尽管 StarCoder 在 Python 上进行了微调，但它仍然是一个非常强大的多语言代码大语言模型，甚至在某些语言上优于 StarCoderBase。

6.2.1 使用 MultiPL-E 评估 19 种编程语言

我们使用 MultiPL-E （Cassano 等人，2023）（翻译了 HumanEval （Chen 等人，2021））评估 StarCoder 将自然语言转换为多种编程语言的工作代码的能力和 MBPP （Austin 等人，2021）对其他 18 种编程语言进行 Python 基准测试，如下所示。

MultiPL-E 拥有一组基于规则的编译器，可将 Python 基准测试转换为每种目标编程语言。每个编译器都需要 HumanEval 格式的基准：1) 自然语言描述（在文档字符串中），2) 函数签名（名称、参数，以及可能的类型），以及 3) 一组隐藏断言。 MultiPL-E 编译器将函数签名、断言和文档字符串（可能有文档测试）翻译成目标语言。因此，MultiPL-E 为我们提供了一组源自 HumanEval 和 MBPP 的并行基准测试，用于比较跨编程语言的模型性能。¹³¹³13MultiPL-E 提示与原始 HumanEval 和 MBPP 提示略有不同。例如，在 HumanEval 中，文档字符串中的一些临时示例被重新格式化为文档测试，以便它们可以翻译为每种目标语言的示例。 MultiPL-E 还省略了三个不符合上述格式的 HumanEval 基准测试。这些变化对通过率影响很小。 MultiPL-E 语言包括高资源语言和低资源语言、静态和动态类型语言以及各种其他编程语言功能。

Language	CodeGen-16B-Multi	CodeGeeX	code-cushman-001	StarCoder	StarCoderBase
cpp	21.00	16.87	30.59	31.55	30.56
c-sharp	8.24	8.49	22.06	21.01	20.56
d	7.68	9.15	6.73	13.57	10.01
go	13.54	11.04	19.68	17.61	21.47
java	22.20	19.14	31.90	30.22	28.53
julia	0.00	0.29	1.54	23.02	21.09
javascript	19.15	16.92	31.27	30.79	31.70
lua	8.50	10.96	26.24	23.89	26.61
php	8.37	13.51	28.94	26.08	26.75
perl	3.42	8.09	19.29	17.34	16.32
python	19.26	21.62	30.71	33.57	30.35
r	6.45	3.92	10.99	15.50	10.18
ruby	0.00	3.34	28.63	1.24	17.25
racket	0.66	3.31	7.05	0.07	11.77
rust	4.21	7.88	25.22	21.84	24.46
scala	2.37	8.95	27.62	27.61	28.79
bash	0.61	2.75	11.74	10.46	11.02
swift	1.25	7.26	22.12	22.74	16.74
typescript	20.07	10.11	31.26	32.29	32.15

表15：将 StarCoder 与 19 种编程语言的多语言开放访问（例如 CodeGen-16B-Multi）和封闭访问模型（例如 code-cushman-001）进行比较。我们在 HumanEval (Chen 等人, 2021) 上报告 pass@1，我们使用 MultiPL-E (Cassano 等人, 2023) 将其从 Python 翻译为其他语言。

表15显示了这些模型在 19 种编程语言上的表现，从中我们得出以下结论：

1.

在所有 19 种编程语言中，StarCoderBase 的性能优于其他开放访问模型，有时表现出超过 2 $\times$ 的性能。
2.

在我们评估的大多数语言上，StarCoderBase 与 code-cushman-001 具有竞争力。有一些例外。例如，code-cushman-001 在 C++、Java、Ruby 和 Swift 上的性能优于 StarCoderBase 5% 以上，而 StarCoder 在 Julia 上的性能优于 code-cushman-001 5% 以上。
3.

尽管在 Python 上进行了微调，StarCoder 在大多数语言上仍然具有竞争力，并且也优于其他开放模型。更令人惊讶的是，尽管在 Python 上进行了微调，但StarCoder 在某些语言上的性能略优于 StarCoderBase。目前，我们只能推测为什么会出现这种情况，对开放训练数据的进一步调查可能有助于阐明这一发现。

我们还可以从表中得出其他几个结论。例如，CodeGen-16B-Multi 在一些据说不属于其集合的语言（包括 C#、Lua、PHP 和 TypeScript）上的表现比人们预期的要好。它在 TypeScript 上的性能并不令人意外，因为简单的 JavaScript 函数通常会在设计上使用 TypeScript 进行类型检查。同样，StarCoder 在 Swift 上显示出高性能，即使它没有包含在其训练集中，如 3.1 节中所述。

6.2.2 “在键盘上睡着”安全基准

Format	Model	Valid ( $\uparrow$ )	Insecure ( $\downarrow$ )
Completion	StarCoderBase	855/1000 (85.50%)	340/855 (39.77%)
Insertion	StarCoderBase	987/1000 (98.70%)	354/987 (35.87%)
Completion	InCoder-6B	871/1000 (87.10%)	309/871 (35.48%)
Insertion	InCoder-6B	854/1000 (85.40%)	293/854 (34.31%)
Completion	CodeGen-16B-Multi	955/1000 (95.50%)	413/955 (43.25%)
Completion	code-cushman-001	964/1000 (96.40%)	408/964 (42.32%)

表16： Asleep at the Keyboard 安全基准性能（Pearce 等人，2022 年）。

Code大语言模型的一个局限性是它们可以生成存在安全漏洞的代码（Pearce等人，2022）。 Pearce 等人 (2022) 的 Asleep at the Keyboard 基准测试在三个评估轴上有 89 个安全敏感场景： (1) 弱点多样性 (DoW) 涵盖 MITRE 常见弱点枚举 (CWE) 分类法中的 18 个不同漏洞类别，场景取自 2021 年 CWE 25 个最危险软件弱点列表米特雷； (2) 提示多样性 (DoP) 评估模型对单个漏洞类（SQL 注入）提示变化的敏感度； (3) 领域多样性 (DoD) 包含硬件描述语言 Verilog 中的安全场景。我们重点关注 DoW，其中包含 18 个 CWE 中的 54 个场景（25 个 C 语言和 29 个 Python 语言）。我们排除了缺乏自动化测试的场景，留下 40 个场景（23 个使用 C 语言，17 个使用 Python 语言）。

Pearce 等人 (2022) 之前曾评估过 GitHub Copilot 的安全性（截至 2021 年 8 月），在本文中，我们使用相同的方法来评估 StarCoderBase、InCoder-6B、CodeGen-16B—— Multi，以及OpenAI的代码-cushman-001。我们使用原始的基准测试方法：在温度 0.2 时每个场景生成 25 个完成（每个模型 1,000 个完成）。该数据集支持中间填充，因此我们在支持它的模型上包含此配置。结果如表16所示； Valid 给出语法上有效的解决方案的百分比（对于 Python 使用 py_compile，对于 C 使用 gcc)，以及 Insecure 显示包含场景测试漏洞的有效解决方案的百分比。从这个表中，我们得出以下结论。

1.

StarCoderBase 有效代码率最高。
2.

InCoder-6B 的不安全代码生成率略低，但这可能是由于其有效完成率较低。
3.

在有效代码超过 95% 的模型中，StarCoder 的不安全完成率最低。

Model	Java	JavaScript	Python
InCoder-6B	0.49	0.51	0.31
SantaCoder	0.62	0.60	0.44
StarCoder	0.73	0.74	0.62

表17： Ben Allal 等人 (2023) 在 FIM 基准上单行中间填充的性能。

6.2.3 填写中间基准

StarCoder 模型支持中间填充 (FIM) 或填充，这允许模型生成以插入点周围的前缀和后缀代码为条件的代码。只有少数最新模型支持 FIM：来自 OpenAI （Bavarian 等人，2022）、InCoder （Fried 等人，2022），以及我们之前在 SantaCoder （Ben Allal 等人，2023）。 FIM 开辟了执行从左到右代码完成之外的各种任务的可能性。我们根据下面四个已建立的 FIM 基准评估 StarCoderBase。

Python、Java 和 JavaScript 的单行填充

Fried 等人 (2022) 提出了一个 Python 的单行中间填充任务，该任务屏蔽 HumanEval 解决方案中的一行代码，并对模型完成该功能的能力进行评分。他们通过将解决方案主体中的每个非空白、非注释代码行屏蔽为一个中间填充任务，将每个 HumanEval 解决方案变成几个中间填充问题。 Ben Allal 等人 (2023) 使用来自 MultiPL-E 翻译的模型生成的解决方案，将此基准概括为支持 Java 和 JavaScript。我们比较了 StarCoderBase、SantaCoder 和 InCoder 在此任务上的性能，并使用行精确匹配进行评估（表 17)。 StarCoderBase 的性能明显优于两个较小的模型。

Python 返回类型预测

Model	Non-None F1	All F1
InCoder-6B	59.1	46.8
SantaCoder	66.9	78.5
StarCoderBase	77.4	86.6
StarCoder	77.1	86.4

表 18： Python 返回类型预测的准确性，使用 Fried 等人 (2022) 改编的 Pradel 等人 (2020) 基准。我们报告总体 F1 分数（包括简单的 None 类型预测）和非 None 类型的 F1 分数。

Pradel 等人 (2020) 介绍评估 Python 类型注释的方法和数据集。 Fried 等人 (2022) 改编并过滤这项工作中的一个数据集，其中包含来自 GitHub 的 Python 函数，并用它来评估函数返回类型预测的填充模型。我们使用此数据集将 StarCoder、StarCoderBase 和 SantaCoder 与 InCoder 在函数返回类型预测方面进行比较。我们的设置遵循 Fried 等人 (2022)：每个模型都使用贪婪生成来填充返回类型，同时对每个函数的导入、主体和签名进行调节。我们按照 Fried 等人 (2022) 报告了评估集中所有函数的标准化注释的精确匹配精度，并且仅报告了那些具有非 None 注释的函数。我们发现StarCoder 和 StarCoderBase 在 Python 返回类型预测方面优于现有方法（表18)。但是，我们注意到，由于此评估训练集中的函数取自 GitHub 存储库，因此它们可能与 SantaCoder 和 StarCoder 模型的数据重叠。

	Packages type check			Files with no errors			Trivial annotations
	$\checkmark$	Total	%	$\checkmark$	Total	%	$\checkmark$	Total	%
InCoder	30	128	23.4	571	760	75.1	56	117	47.9
StarCoderBase	49	128	38.3	593	760	78.0	135	299	45.2

表19：使用 Yee & Guha (2023) 的数据集和方法进行 TypeScript 类型预测性能。我们仅评估从未转换为 TypeScript 的 JavaScript 包，并将 StarCoder 与 Yee & Guha (2023) 提出的性能最佳模型 InCoder 进行比较。 StarCoder 在多个方面优于 InCoder。

TypeScript 类型预测

Yee & Guha (2023) 评估 TypeScript 的神经类型预测方法。然而，他们认为基准测试应该衡量有多少项目或文件不存在预测类型的类型错误，而不是衡量准确性。这种方法使得评估从未转换为 TypeScript 的 JavaScript 程序的类型预测成为可能，从而降低了数据集污染的可能性。我们将 StarCoderBase 添加到他们的评估框架中，并将其与 InCoder 进行比较，InCoder 在原始工作中在类型预测方面表现最好。表 19 显示 StarCoderBase 优于 InCoder：(1) 它生成更多类型检查的包，(2) 在所有包中，它生成更多类型检查的文件，(3) 它生成更少的琐碎类型注释比 InCoder 强。

Python 文档字符串生成

Model	BLEU
InCoder-6B	18.27
SantaCoder	19.74
StarCoderBase	21.38
StarCoder	21.99

表20： CodeXGLUE 代码摘要任务的 Python 部分的性能，评估函数文档字符串生成。使用模型的填充能力对零样本进行评估。

为了评估模型生成函数文档的能力，我们使用 CodeXGLUE 代码摘要基准的 Python 子集（Lu等人，2021）。该基准测试是根据 CodeSearchNet 数据集（Husain 等人，2019）构建的，包含来自公共 GitHub 存储库的函数。模型使用贪婪解码填充每个函数的文档字符串（docstring），以函数签名和主体为条件。我们遵循过去工作的评估方案：使用平滑的 4 克 BLEU (Papineni 等人, 2002) 根据原始函数的参考文档字符串来评估文档字符串，仅使用生成的第一行和参考文档字符串（例如，删除可能出现在后面几行中的函数参数和返回类型的描述）。在Table 20中，我们看到StarCoder和StarCoderBase比过去在文档字符串生成方面获得了更高的性能。但是，我们注意到此评估训练数据集与用于 SantaCoder 和 StarCoder 模型的数据之间可能存在重叠。

6.3 通过训练过程提高绩效

我们在 1000B 总数中每看到 200B Token 后，在几个训练检查点评估 StarCoderBase 的性能。图2（右）显示了 MultiPL-E 支持的每种编程语言在训练期间性能 (pass@1) 的变化情况。几种高资源编程语言的性能曲线表明，更长的训练时间可能会进一步提高其性能。

然而，一些低资源语言在训练过程中的进步有限，甚至 pass@1 下降。例如，R 的 pass@1 率在 800B 和 1000B（最终）检查点之间显着下降。 pass@1 对数据大小的依赖性（图2，左）进一步支持了这一假设，即这与可用数据量有关。线性拟合的斜率在 800B 和 1000B 检查点之间增加，而截距减小，即仅对于具有足够大数据量 ( $\gtrsim 1$ GB) 的语言，性能才会提高。

我们通过多个检查点手动检查 R 生成的完成情况，以更好地了解模型性能。人们可能会假设某些问题比其他问题更难，因此模型在 R 中通过 600B、800B 和 1000B 检查点获得或失去解决这些问题的能力，但我们发现这不是案件。相反，我们发现几个问题的每个问题的成功率存在显着差异（表D.3)。对于这些问题，不同检查点之间的通过率有所不同，看起来完全不相关。而且，通过人工检查发现，失败的原因是一些小错误，例如计算GCD时没有取绝对值、没有将字符串转换为字符数组、或者没有检查边缘情况等。

6.4 长上下文的困惑

StarCoderBase 使用 8K 词符窗口进行训练，允许调节并生成长代码文件。为了评估模型从更大的上下文中受益的能力，我们比较了使用 8K Token 的完整窗口大小与 2K Token 的窗口大小时的困惑度（Bahl 等人，1983）（如在许多先前的代码模型中使用）。

为了确保 StarCoderBase 的训练数据和困惑度计算数据之间不重叠，我们从 GitHub 下载了Table 21 中每种语言的 10 个 GNU 公共许可证 (GPL) 存储库。我们将存储库中的所有文件编译为每种语言的单个文档。然后，我们将这些文档分成 8K 词符块，并计算每个块中最后 1K 个词符的困惑度¹⁴¹⁴14我们评估每个 8K 块中最后 1K 个标记的困惑度，以便两个条件具有相同的评估标记，并避免过度惩罚 2K 条件，因为标记为窗口的开始往往具有较高的复杂性，因为可用于预测它们的上下文较少。在两种情况下：（1）模型窗口仅包含块中的最后 2K 个标记（即，正在预测的 1K 和之前的 1K），以及（2）模型窗口包含块中的所有 8K 个标记（即，正在预测 1K 个 Token 和之前的 7K 个）。这评估了模型在预测代码时从附加文件和存储库级别上下文中受益的能力。在Table 21中，我们报告了所有块中1K词符区域的平均复杂度。我们看到 StarCoderBase 确实受益于其 8K 上下文窗口提供的额外词符条件，所有语言的困惑度都大大降低。

Window Size	Language
Window Size	cpp	c-sharp	c	go	java	javascript	php	r	ruby	rust
2K tokens	2.01	1.90	1.71	1.35	1.65	1.98	1.73	1.72	2.16	1.84
8K tokens	1.79	1.66	1.61	1.21	1.54	1.68	1.43	1.48	2.02	1.65

表21：当跨 10 种语言的存储库使用 2K 或 8K 标记的窗口大小时，StarCoderBase 在评估区域（大小为 1K 标记）上的困惑。更大的窗口尺寸大大减少了困惑，展示了 StarCoder 8K 词符窗口的优势。

7自然语言评估

虽然 StarCoder 模型主要是为了代码大语言模型而开发的，但它们也接受了大量自然语言文本的训练。大约 20% 的训练标记是自然语言数据：7% GitHub 问题、10% Markdown、2% Jupyter 笔记本和 4% HTML。在本节中，我们将在几个自然语言任务上评估 StarCoderBase：可能受益于代码和文本训练数据组合的自然语言推理和理解任务；自然语言生成任务，评估模型产生不良文本输出的倾向，例如在文档生成或交互式助理设置中。

7.1 数学推理

最近的研究表明，通过使用称为程序辅助语言模型（PAL；Gao 等人，2022）的技术，Code 大语言模型可以成为有效的算术和符号推理器。通过PAL，大语言模型读取推理问题并生成Python程序作为中间推理步骤，然后由Python解释器执行以产生答案。相比之下，思想链方法（CoT；Wei等人，2022）提示大语言模型在生成答案之前先用自然语言生成推理步骤。

我们研究了 StarCoderBase 在 GSM8K （Cobbe 等人，2021）（一组中学数学应用题）上的推理能力。我们与两个 CodeGen-16B 模型（Nijkamp 等人，2023）和 LLaMA 模型系列（Touvron 等人，2023）进行比较。我们的评估结果如表 22 所示，其中我们提供了 StarCoderBase 和 LLaMA 的 CoT 和 PAL 结果。

与之前在代码大语言模型（Gao 等人，2022）上比较 PAL 和 CoT 的结果一致，我们发现 StarCoderBase 使用 PAL（21.5%）的性能比使用 CoT（8.4%）的性能更好。 StarCoderBase 的性能明显优于 CodeGen-16B-Mono 和 CodeGen-16B-Multi，后者在 PAL 下分别达到 13.1% 和 8.6%。这些差异延续到了采用多数投票的情况。对于 LLaMA 模型，CoT 和 PAL 之间的差异要小得多，尽管我们观察到 CoT 对于 7B 和 13B LLaMA 模型的性能稍好一些。有趣的是，我们发现 StarCoderBase 在这个推理基准上优于 LLaMA-13B (17.8%)。然而，其性能仍然落后于LLaMA-33B（38.7%）。

Model	Size	GSM8K CoT	+maj1@100	GSM8K PAL	+maj1@40
StarCoderBase	15.5B	8.4	—	21.5	31.2
CodeGen-Multi	16B	3.18	—	8.6	15.2
CodeGen-Mono	16B	2.6	—	13.1	22.4
	7B	11.0	18.1	10.5	16.8
	13B	17.8	29.3	16.9	28.5
LLaMA	33B	35.6	53.1	38.7	50.3
	65B	50.9	69.7	—	—

表22： GSM8K 数学推理基准测试中的 8 次准确率。样本是通过贪婪解码生成的。 maj1@k 表示

k

代中的多数票。对于多数票，我们按照 Gao 等人 (2022) 使用

p=0.95

和温度

0.7

的核采样来生成样本。当模型未根据给定指标进行评估，或者语言模型评估工具不支持该指标时，我们使用“—”。 LLaMA CoT 编号来自 Touvron 等人 (2023)。

7.2世界知识和阅读理解

MMLU (Hendrycks 等人, 2020) 是一个大规模的多任务语言理解基准，涵盖 57 个知识领域的多项选择题，包括人文、STEM 和社会科学。 CoQA (Reddy 等人, 2019) 是会话问答系统的大型数据集，衡量模型处理文本段落和回答一系列相互关联的问题的能力。我们将 StarCoderBase 和 StarCoder 与 CodeGen-16B-Multi (Nijkamp 等人, 2023)、GPT-NeoX (Black 等人, 2022)、LLaMA-7B 和 LLaMA 进行比较-13B (Touvron 等人, 2023)。

我们在表 23 中展示了 MMLU 的 5 次准确率，在表 24 中展示了 CoQA 的零样本 F1 分数。在 MMLU 上，StarCoderBase 的性能显着优于 CodeGen-16B-Multi（34.2% 到 27.8%），甚至略胜于 GPT-NeoX（32.9%）。尽管如此，两种 LLaMA 模型的性能均优于 StarCoderBase。在 CoQA 上，StarCoderBase 的性能优于 CodeGen-16B-Multi，但优于 LLaMA 和 GPT-NeoX。

Model

Size

MMLU 5-shot

acc, %

CodeGen-Multi

16B

27.8

GPT-NeoX

20B

32.9

StarCoder

15.5B

33.9

StarCoderBase

15.5B

34.2

LLaMA

35.1

LLaMA

13B

46.9

表23： MMLU 语言理解基准测试的 5 次准确率。

Model

Size

CoQA zero-shot

F1 score

CodeGen-Multi

16B

0.59

StarCoderBase

15.5B

0.67

StarCoder

15.5B

0.67

LLaMA

0.71

LLaMA

13B

0.73

GPT-NeoX

20B

0.73

表24： CoQA 问答挑战的零样本准确性。

7.3 衡量有害的产生

当生成开放式文本（例如代码文档或技术对话）时，代码大语言模型（类似于纯文本大语言模型）可能会产生有害的输出。我们将 StarCoderBase 与之前的 Code 大语言模型在衡量模型生成文本中的社会偏见和毒性的基准上进行比较。¹⁵¹⁵15Code for the evaluations is available here: https://github.com/McGill-NLP/StarCoderSafetyEval

7.3.1 社会偏见

最近的工作强调，大语言模型经常从预训练语料库中捕获社会偏见和刻板印象（Kurita 等人，2019；May 等人，2019；Hutchinson 等人，2020；Meade 等人，2023）. 为了量化模型中的社会偏见，我们使用 StereoSet (Nadeem 等人, 2021)。

StereoSet 包含一系列填空式测试，用于测量语言模型中的社会偏见。¹⁶¹⁶16我们仅针对这项工作中的句内任务进行评估。 StereoSet 中的每个示例都包含一个不完整的句子（例如，our housekeeper is BLANK)以及三个可能的完整句子。在这些完成中，一个是刻板印象（例如，墨西哥)，另一个是反刻板印象（例如，意大利)，第三个是不相关的（例如，计算机)。 StereoSet 定义了三个指标：刻板印象分数、语言建模分数和 ICAT 分数。刻板印象得分是模型更喜欢句子的刻板完成而不是反刻板完成的示例的百分比。语言建模得分是模型更喜欢有意义的完成（刻板印象或反刻板印象）而不是不相关的完成的示例的百分比。最后，Nadeem 等人 (2021) 定义了结合这两个指标的理想化上下文关联测试 (ICAT) 分数：

\displaystyle\mathrm{ICAT}=\mathrm{lms}\cdot\frac{\min(\mathrm{ss},100-\mathrm% {ss})}{50}

(1)

其中 $\mathrm{lms}$ 和 $\mathrm{ss}$ 分别表示语言模型得分和刻板印象得分。

我们在表 25 中报告了 StarCoderBase 的 StereoSet 结果以及 LLaMA-13B 和 CodeGen-Multi-16B。在所有四个偏差域中，我们发现 StarCoderBase 获得最低的刻板印象分数，但也具有竞争性的语言建模分数。这表明 StarCoderBase 较低的刻板印象分数不仅仅是由于较差的语言建模（Meade 等人，2022），而且也如高 ICAT 分数所表明的那样。

我们还根据众包刻板印象对（CrowS-Pairs；Nangia 等人 2020)评估 StarCoderBase，并请读者参阅表D.4了解结果。

Model	Stereotype Score	Language Model Score	ICAT Score
		Gender
LLaMA-13B	66.54	88.09	58.95
CodeGen-Multi-16B	67.34	86.41	56.44
StarCoderBase	58.76	86.82	71.60
		Profession
LLaMA-13B	60.95	86.74	67.74
CodeGen-Multi-16B	60.67	85.67	67.38
StarCoderBase	53.24	84.70	79.21
		Race
LLaMA-13B	64.94	87.97	61.68
CodeGen-Multi-16B	60.58	88.60	69.85
StarCoderBase	56.48	86.82	75.58
		Religion
LLaMA-13B	57.95	90.26	75.91
CodeGen-Multi-16B	56.16	88.91	77.96
StarCoderBase	55.69	90.67	80.36
		Overall
LLaMA-13B	63.40	87.62	64.14
CodeGen-Multi-16B	61.29	87.25	67.55
StarCoderBase	55.53	86.18	76.65

表25： StereoSet 句子内的性别、职业、种族和宗教偏见结果。刻板印象分数接近 50% 是最好的。语言建模分数和 ICAT 分数接近

100\%

是最好的。

7.3.2毒性

为了评估模型生成的响应的毒性，我们使用 RealToxicityPrompts (Gehman 等人, 2020)，这是一组句子级提示，通常会引起语言模型的不良响应。我们使用 StarCoderBase 生成对来自 RealToxicityPrompts 的 10K 示例的响应，最小长度为一个词符，最大长度为 $128$ 标记。我们使用核采样 (Holtzman 等人, 2020) 和 $p=0.95$ 来生成我们的所有响应。

我们使用两种方法来自动评估响应中的毒性：（i）基于 RoBERTa 的（Liu 等人，2019）毒性分类器（Vidgen 等人，2021）和（ ii) 潜在冒犯性词语列表。¹⁷¹⁷17https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words 对于毒性检测器，我们使用阈值 $0.5$ 。对于攻击性单词列表，我们报告包含攻击性单词的回复的百分比。我们注意到，虽然攻击性单词列表可能会错误地标记响应，但它可能会提供公然毒性的粗略衡量标准。我们在表 26 中报告了我们的结果。

总的来说，我们观察到 CodeGen-16B-Multi 和 StarCoderBase 似乎都比 LLaMA-13B 产生更少的毒性反应。例如，与 StarCoderBase 的 $1.12\%$ 相比，LLaMA-13B 响应的 $1.43\%$ 包含潜在的攻击性标记。我们还注意到，CodeGen-16B-Multi 似乎比 StarCoderBase 产生的毒性反应更少。

Model	Classifier	Word List
LLaMA-13B	0.74	1.43
CodeGen-Multi-16B	0.21	0.82
StarCoderBase	0.42	1.12

表26： RealToxicity提示反应毒性结果。我们报告使用毒性分类器和攻击性单词列表标记为有毒的响应的百分比。较低的分数表明毒性较小的世代。

7.4 HELM 中的推理任务

Model

Size

Open

Access

Synth.

Reason.

(AS)

Synth.

Reason.

(NL)

bAbI

Dyck

GSM8K

MATH

(CoT)

LSAT

Legal

Support

code-davinci-002

175B

54.0

68.4

68.6

80.5

56.8

41.0

43.3

—

text-davinci-003

175B

50.2

73.4

65.3

75.1

50.6

39.0

44.9

23.3

62.2

Luminous Supreme

70B

31.2

—

50.4

72.9

11.2

14.9

5.7

21.2

53.0

StarCoderBase

15.5B

✓

44.0

21.0

50.4

85.4

8.4

15.1

7.0

19.0

53.2

Cohere Command Beta

52.4B

24.3

24.5

47.3

42.1

13.8

13.3

7.5

22.9

60.6

J1-Jumbo v1

178B

26.3

17.4

54.3

44.5

5.4

8.9

3.3

23.2

48.4

J1-Grande v2 beta

17B

28.6

13.9

47.0

61.7

9.6

12.7

6.8

19.1

56.2

code-cushman-001

12B

34.1

16.4

48.1

45.1

4.9

9.9

7.2

—

OPT

175B

✓

22.5

24.8

50.7

49.4

4.0

6.5

2.6

22.0

53.2

GPT-NeoX

20B

✓

20.4

16.7

46.8

74.7

5.3

14.1

7.1

19.1

51.5

BLOOM

176B

✓

30.4

19.7

44.7

54.5

9.5

4.3

5.5

20.9

54.3

GLM

130B

✓

25.2

25.4

44.3

54.9

6.1

5.9

19.3

45.1

UL2

20B

✓

20.5

21.7

50.1

14.0

2.4

20.7

50.6

OPT

66B

✓

19.3

21.3

40.8

47.1

1.8

4.8

2.9

17.5

52.7

YaLM

100B

✓

5.6

6.1

34.6

63.3

2.3

48.4

11B

✓

19.6

10.1

41.2

34.7

2.3

15.9

55.8

表27： HELM 基准中自然语言推理任务的模型结果，模型按任务的平均排名排序。当模型未根据给定指标进行评估或在 HELM 中记录运行时错误时，我们使用“—”（例如，LSAT 和法律支持上的 code-davinci-002 和 code-cushman-001 模型的“未映射预测”）。 StarCoder 通常大大优于其他开放访问模型，并且通常优于更大的模型。

我们使用 HELM （Liang 等人，2022）评估 StarCoderBase，这是一个评估套件，旨在通过报告大语言模型在各种任务上的表现来提高大语言模型的透明度。我们评估了该模型利用其自然语言和代码预训练来完成 HELM 的自然语言推理任务的能力（不包括代码任务，因为我们自己对代码进行了大量评估）。在撰写本文时，HELM 基准测试不包括 CodeGen、CodeGeex 和 LLaMA 模型。因此，我们将 StarCoderBase 与每个“有限”或“开放”访问模型系列中最大和/或最新的模型进行比较，如 HELM 模型列表中的分类，¹⁸¹⁸18https://crfm.stanford.edu/helm/latest/?models=1截至 2023 年 5 月 1 日，已对大多数 HELM 推理任务进行了评估。在表27中，我们报告了结果。我们计算每个模型在每个任务上的排名，并根据模型在任务中的平均排名对表中的模型进行排序。 StarCoderBase 通常比所有其他已发布权重的模型获得更强的性能，并且通常与更大的模型相当或更好。我们推测，训练数据中代码和自然语言的混合有助于模型在这些推理任务上的出色表现。

8定性评估

在Appendix E 中，我们重点介绍了与 StarCoderBase 进行的几次有趣的交互。我们希望这些能够成为有兴趣进一步探索该模型功能的研究人员和开发人员的起点。我们在第 E.1 节中提供了如何使用 Git 提交、GitHub 问题和 Jupyter 笔记本的模板来引发有趣的模型行为的示例。在E.2节中，我们演示了如何提示StarCoder在不进行任何指令调优的情况下充当技术助手。在E.3节中，我们发现还可以使用元数据和自然语言的组合来提示模型在HumanEval基准上获得更高的pass@1性能。

9 归因工具

随着生成语言工具变得更加普遍和数据密集，理解和检查它们所训练的大量文本的需求变得更加紧迫，既要了解模型的故障模式，也要以以下形式提供透明的数据治理反馈：模型生成输出的归因跟踪和来源管理。这种理解数据的迫切需求（Mitchell 等人，2022）正在以数据集检查工具和工具包的形式得到越来越多的认识和操作（Akiki 等人，2023；Marone & Van Durme， 2023；Piktus 等人，2023）。正是从这个角度出发，我们发布了两个这样的数据检查工具：成员资格检查工具和 BM25 搜索索引。这些补充了现有的“Am I in The Stack”工具，该工具在 GitHub 存储库名称级别运行。这两个新工具仅索引用于训练的文件，并允许对文件内容进行匹配。这些工具可作为独立站点使用，但也集成到我们的 VSCode 演示中。这有助于用户识别可能从训练数据复制的模型输出部分。通过利用搜索索引，用户可以找到复制片段的相应源文件和存储库。

9.1 会员资格检查

Marone 和 Van Durme (2023) 建议使用被视为数据肖像的成员测试工件来记录数据集。他们提供了一种基于 Bloom Filters（Bloom，1970）的特定实现，可提供快速且轻量级的成员资格推断。我们根据训练数据中长度为 50 个字符的字符串构建基于 Bloom-filter 的肖像。此工件占用 26 GB， $\sim 3\%$ 的数据大小。推理工具是公开托管的，以补充其他文档工件。 ¹⁹¹⁹19http://stack.dataportraits.org/

可以快速检查模型的各代，以大致评估训练与语料库的重叠程度。 VSCode 扩展支持将此用作快速、首次通过的归因方法。但是，这要求匹配字符串的长度大于最小大小，并且不会尝试过滤常见或通用代码片段。经过第一遍检查后，用户可以使用完整的搜索索引来进一步评估归因。

9.2搜索索引

我们使用 Elasticsearch 7.17²⁰²⁰20https://www.elastic.co/guide/en/elasticsearch/reference/7.17 并提供两种搜索工具来查询：一种专注于Python子集，另一种涵盖整个数据集。代码本身使用小写过滤器和 Lucene 的 ASCIIFoldingFilter 进行预处理，使用 3-gram 分词器进行分词，并使用 BM25 的默认 Lucene 实现作为相似性函数进行索引。我们进一步将用户名和许可证字段索引为关键字字段，以便根据这些特定元数据字段轻松过滤和查找。两个索引当前都在一台虚拟机上以单节点模式运行。

10 社会影响和局限性

10.1项目方法

开放科学和开放治理

StarCoder 是社区研究项目的成果。该项目本着开放科学（Woelfle等人，2011）的精神进行，重点关注代码大语言模型的负责任开发和使用。通过在整个项目中进行的开放治理实践，决策的优先级始终让位于更负责任的选择，即使这意味着引入可能影响采用或未来研究的限制。例如，法律、道德、治理工作组决定删除且不发布已识别恶意代码的数据集，即使该数据可能对未来的安全研究有用。

开放性与安全风险

Solaiman (2023) 解释了大语言模型开发过程中的开放程度如何与模型发布相关的潜在风险相关联。当系统以完全封闭的方式开发时，权力更有可能集中在资源丰富的组织中，而小型开发团队可能无法完全理解正在部署的模型的影响和长期后果。此外，封闭式开发系统通常难以接受外部专家的审核，并且可能会阻碍科学进步，因为研究人员无法在彼此的工作基础上进行开发。另一方面，完全开放的开发允许社区研究，使模型的访问民主化，并在整个开发过程中进行审计。然而，如果没有适当的护栏，开放的大语言模型开发会带来更高的误用风险，因为模型访问的增加也增加了模型造成损害的可能性。尽管可以关闭已发布的 API，但模型权重一旦发布，就几乎不可能收回它们。因此，讨论和实施负责任的人工智能实践一直是我们项目大语言模型开发过程中的前沿和中心。

10.2限制

数据集和数据许可

StarCoder 在 The Stack v1.2 数据集的子集上进行了训练。该数据集已使用许可证检测器进行过滤，仅包含许可的源代码。然而，许可证检测器可能错误地对许多存储库进行了分类。有关此许可证检测过程的更多详细信息，请参阅 Kocetkov 等人 (2022)。

选择退出流程

PII检测

尽管我们尽最大努力删除 PII（第 4 节），StarCoder 仍可能生成 PII（但是，请注意，模型许可证限制旨在生成或传播以伤害他人为目的的 PII 的使用）。正如 4.2 节中提到的，我们训练了一个仅编码器模型来检测代码和文本相关任务的 PII，并注意到存在误报和漏报的可能性，这可能会导致意外的结果处理敏感数据时的后果。此外，PII 检测模型的性能可能因不同数据类型和编程语言而异，需要针对特定用例进行进一步验证和微调。 PII 注释仅适用于经批准的个人，获得访问权限的研究人员和开发人员应遵守道德标准和数据保护措施。通过使其易于使用，我们的目标是鼓励进一步研究和开发 PII 编辑技术。

恶意代码

在托管 Stack 的 Hugging Face 平台上，恶意代码检测工具将 654 个文件识别为不安全。在社区的帮助下，我们在 The Stack v1.2 发布之前删除了这些文件。尽管如此，The Stack 可能包含未检测到的恶意代码，并且 StarCoder 可能能够生成恶意软件。因此，StarCoder OpenRAIL-M 许可证包括针对生成和/或传播恶意软件（包括但不限于勒索软件）或可用于损害电子系统的任何其他内容的使用限制。

模型限制

StarCoder 受到大语言模型的典型限制，包括可能生成不准确、冒犯性、误导性、年龄或性别歧视或强化其他刻板印象的内容。请参阅第 7.3 节了解此类安全问题的调查。 StarCoder 的部署需要进一步挑战和调整模型以防止此类行为，例如通过红队（Perez 等人，2022）、对抗性测试（Wan 等人，2023），和/或添加强大的安全层（OpenAI，2023b）。该模型通过 OpenRAIL-M 许可证发布，该许可证设置了适用于模型及其修改以及使用该模型的应用程序的可执行使用限制。

仅英语评估

我们仅根据基于英语的基准评估了 StarCoder 的性能，以了解其编码能力和自然语言理解。为了使这些模型更容易被更广泛的受众使用，未来的研究应该调查 Code 大语言模型在其他自然语言上的性能和局限性。

代码归因工具

StarCoder 成员资格检查工具和 BM25 搜索索引仅限于针对用于训练的 Stack 子集进行数据集检查，因此不会找到与未包含或从数据集中删除的代码匹配的代码项目。基于肖像的会员测试工具使用哈希匹配，因此可能会出现误报。它还具有最低分辨率，并且需要一定量的上下文才能触发匹配。这两种归因工具都不会尝试区分通用代码（例如样板文件）或受保护的内容。然而，我们希望这些工具能够支持正在进行的大语言模型负责任开发的研究。

10.3社会影响

代码大语言模型

我们期望代码大语言模型能够帮助来自不同背景的人们学习编写更高质量的代码并开发低代码应用（Leinonen 等人，2023）。由于专业开发人员在代码生成系统的指导下如何编写更健壮、更高效的代码，关键任务软件可能会变得更容易维护。然而，还应仔细考虑安全影响（Sandoval 等人，2023）。虽然社会影响是积极的，但代码大语言模型可访问性的提高也带来了一定的风险，例如过度依赖生成的代码以及对软件开发就业市场的长期影响。我们建议读者参考 Chen 等人 (2021, 第 7 节) 来对代码大语言模型进行更广泛的影响分析，以及 Khlaaf 等人 (2022) 来了解对这一新兴技术进行深入的风险评估和危害分析。

数据标注

对于该项目来说，仅使用信誉良好的数据标注服务非常重要。平衡成本（公平补偿）、时间（完成工作的时间和时间是项目的关键路径）和质量（以确保 PII 检测模型不受影响）的约束也很重要。虽然考虑了使用受薪员工的传统数据标注服务，但与 Toloka 众包工人合作的决定是在对服务提供商及其薪酬实践进行审查之后做出的——大多数服务提供商都不会提供足够的透明度和工人薪酬的保证。我们在确定薪酬时考虑了各国不同的最低工资水平及其相应的购买力。我们将标注资格限制在每小时工资为 7.30 美元、按购买力平价计算相当于美国最高最低工资（16.50 美元）的国家。

反馈选择退出表格

在选择退出过程的第一阶段，个人被要求说明希望将其代码从数据集中排除的原因。我们从希望选择退出的个人那里反复听到的担忧是：

•

优先选择选择加入而不是选择退出。
•

认为无偿使用他们的代码是不公平的
•

人们担心人工智能当前的局限性以及模型生成可能追溯到其工作，从而导致潜在的法律责任。
•

相信他们的代码质量很差，不适合人工智能训练。
•

他们的代码中存在 PII，但他们不希望公开这些信息。

因此，选择退出表格提供了一个直接与内容创作者互动并了解我们的工作对他们的影响的机会。

社区对选择退出流程的反馈

我们与特定组织的个人进行了社区研究，这些组织的数据在 The Stack(艾伦图灵研究所和The Turing Way)中使用，并为两个开放的国际研讨会做出了贡献(2023 年开放数据日和 2023 年 Mozilla Festival，主题为“人工智能生产管道中的数据权利设计”）。这些定性访谈和参与式协同设计研讨会有 50 名参与者，主要来自北美和欧洲，角色包括研究科学家、社区经理、软件工程师和首席研究员 (PI)。

社区研究的结果可以概括如下：在大语言模型数据集的治理方面，参与者认为更好地了解和更好的选择. 大多数参与者对他们的许可数据被用于训练大语言模型持中立到积极的态度。虽然所有人都对“我在堆栈中吗”工具有积极的印象，但没有一个受访者表示愿意真正选择退出。主要的收获似乎是，参与者发现该项目的治理工具最有价值，因为他们能够提高对数据实践的认识，并赋予个人和社区根据其特定需求采取行动的能力。这些初步对话还强调了将治理讨论和决策直接引入受影响社区的重要性，这是未来工作的一个重要方向，应将社区研究扩展到北美和欧洲以外的地区。研讨会的参与者还提出了以数据权利为中心的新群体的例子，包括艺术家、数据挖掘者和后代。共同创建的输出可以在此 MozFest Miro Board 上查看。

11结论

在本技术报告中，我们描述了 BigCode 社区在创建 StarCoderBase 和 StarCoder（基于代码训练的开放访问 15.5B 参数大语言模型）方面所做的努力。我们对研究和开发过程的各个方面都提供了完全的透明度，包括训练数据、数据管理过程、PII 编辑管道和模型训练。我们对 Code 大语言模型进行了迄今为止最广泛的评估，发现 StarCoder 优于其他 Code 大语言模型，如 CodeGen （Nijkamp 等人，2023）和 CodeGeeX （Zheng 等人，2023）），并且匹配或优于 OpenAI 的封闭访问 code-cushman-001 模型。通过发布具有开放责任人工智能模型许可证的 StarCoder 模型，并在 GitHub 上开源用于构建模型的所有代码存储库，我们的目标是提高研究和开发人员社区中代码大语言模型的访问性、可重复性和透明度。模型许可证包括使用限制，以确保模型的修改和使用模型的应用程序遵守我们负责任的人工智能原则。此外，我们还发布了一套新颖的归因工具，帮助 Code 大语言模型的最终用户检测和定位可能从训练集中复制的模型生成。我们希望这些措施有助于模型的安全发布，确保表现强劲的 StarCoder 模型仍然是一股正义的力量。

致谢

我们感谢 Hugging Face 为训练 StarCoder 模型提供计算资源。我们还感谢 Suriya Gunasekar 在数据检查方面提供的帮助，以及 Sebastien Paquet 对这项工作的校对。 Carolyn Jane Anderson、Arjun Guha、Ming-Ho Yee 和 Yangtian Zi 获得美国国家科学基金会奖 SES-2326174 和 CCF-2102288 的支持。 Evgenii Zheltonozhskii 得到以色列科学与人文学院亚当斯奖学金计划的支持。

参考

Ahmad et al. (2021) Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Proceedings of NAACL, 2021. URL https://aclanthology.org/2021.naacl-main.211.
Akiki et al. (2022) Christopher Akiki, Giada Pistilli, Margot Mieskes, Matthias Gallé, Thomas Wolf, Suzana Ilic, and Yacine Jernite. BigScience: a case study in the social construction of a multilingual large language model. CoRR, abs/2212.04960, 2022. doi: 10.48550/arXiv.2212.04960. URL https://doi.org/10.48550/arXiv.2212.04960.
Akiki et al. (2023) Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, and Martin Potthast. Spacerini: Plug-and-play search engines with Pyserini and Hugging Face. CoRR, abs/2302.14534, 2023. doi: 10.48550/arXiv.2302.14534. URL https://doi.org/10.48550/arXiv.2302.14534.
Andersen et al v. Stability AI et al (2023) Andersen et al v. Stability AI et al. 3:23-cv-00201 N.D. Cal. 2023.
Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Bahl et al. (1983) Lalit Bahl, Frederick Jelinek, and Robert Mercer. A maximum likelihood approach to continuous speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-5:179 – 190, 04 1983. doi: 10.1109/TPAMI.1983.4767370.
Bavarian et al. (2022) Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022. doi: 10.48550/ARXIV.2207.14255. URL https://arxiv.org/abs/2207.14255.
BBC (2023) BBC. ChatGPT accessible again in Italy. https://www.bbc.com/news/technology-65431914, 2023.
Ben Allal et al. (2022) Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro Von Werra. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, December 2022.
Ben Allal et al. (2023) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, and Leandro von Werra. SantaCoder: don’t reach for the stars! In Deep Learning for Code Workshop (DL4C), 2023.
Bengio et al. (2000) Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/hash/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
BigScience Workshop (2022) BigScience Workshop. BLOOM (revision 4ab0472), 2022. URL https://huggingface.co/bigscience/bloom.
Black et al. (2022) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: an open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
Bloom (1970) Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. URL https://doi.org/10.1145/362686.362692.
Bommasani et al. (2021) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, and et al. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
Brants et al. (2007) Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/D07-1090.
Broder (2000) Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial pattern matching, pp. 1–10. Springer, 2000.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Buck et al. (2014) Christian Buck, Kenneth Heafield, and Bas van Ooyen. N-gram counts and language models from the Common Crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3579–3584, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf.
Butterick (2022) Matthew Butterick. This CoPilot is stupid and wants to kill me. https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html, 2022.
Cassano et al. (2023) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, pp. 1–17, 2023. doi: 10.1109/TSE.2023.3267446. URL https://arxiv.org/abs/2208.08227.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv preprint, 2021.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Dao et al. (2022) Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
DOE 1 v. and GitHub, Inc. (2022) DOE 1 v. and GitHub, Inc. 4:22-cv-06823 N.D. Cal. 2022.
Eloundou et al. (2023) Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. GPTs are GPTs: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023.
Euronews (2023) Euronews. Microsoft attracting users to its code-writing, generative AI software. https://www.euronews.com/next/2023/01/25/microsoft-results-ai, 2023.
European Council (2018) European Council. The general data protection regulation. https://www.consilium.europa.eu/en/policies/data-protection/data-protection-regulation/, 2018.
Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020. doi: 10.48550/ARXIV.2002.08155. URL https://arxiv.org/abs/2002.08155.
Fried et al. (2022) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. InCoder: a generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022. doi: 10.48550/ARXIV.2204.05999. URL https://arxiv.org/abs/2204.05999.
Gao et al. (2021a) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2021a.
Gao et al. (2021b) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021b. URL https://doi.org/10.5281/zenodo.5371628.
Gao et al. (2022) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
Heafield et al. (2013) Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL https://aclanthology.org/P13-2121.
Henderson et al. (2023) Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023.
Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Hindle et al. (2012) Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp. 837–847. IEEE, 2012.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
Hutchinson et al. (2020) Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5491–5501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. URL https://aclanthology.org/2020.acl-main.487.
Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
Kanade et al. (2020) Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Khlaaf et al. (2022) Heidy Khlaaf, Pamela Mishkin, Joshua Achiam, Gretchen Krueger, and Miles Brundage. A hazard analysis framework for code synthesis large language models. arXiv preprint arXiv:2207.14157, 2022.
Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
Kocetkov et al. (2022) Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The Stack: 3 TB of permissively licensed source code. Preprint, 2022. URL https://arxiv.org/abs/2211.15533.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
Kuhn (2022) Bradley M. Kuhn. If software is my copilot, who programmed my software? https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/, 2022.
Kurita et al. (2019) Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 166–172, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3823. URL https://www.aclweb.org/anthology/W19-3823.
Lacoste et al. (2019) Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
Lai et al. (2022) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: a natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
Lee (2013) Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, number 2, pp. 896, 2013.
Leinonen et al. (2023) Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. Comparing code explanations created by students and large language models, 2023.
Lemley & Casey (2020) Mark A Lemley and Bryan Casey. Fair learning. Tex. L. Rev., 99:743, 2020. URL https://texaslawreview.org/fair-learning/.
Levendowski (2018) Amanda Levendowski. How copyright law can fix artificial intelligence’s implicit bias problem. Wash. L. Rev., 93:579, 2018.
Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
Lomas (2022) Natasha Lomas. Unpicking the rules shaping generative AI. https://techcrunch.com/2023/04/13/generative-ai-gdpr-enforcement/, 2022.
Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
Marone & Van Durme (2023) Marc Marone and Benjamin Van Durme. Data portraits: Recording foundation model training data. CoRR, abs/2303.03919, 2023. doi: 10.48550/arXiv.2303.03919. URL https://doi.org/10.48550/arXiv.2303.03919.
May et al. (2019) Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622–628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URL https://www.aclweb.org/anthology/N19-1063.
Meade et al. (2022) Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1878–1898, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.132. URL https://aclanthology.org/2022.acl-long.132.
Meade et al. (2023) Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tür. Using in-context learning to improve dialogue safety, February 2023. URL http://arxiv.org/abs/2302.00871. arXiv:2302.00871 [cs].
Mikolov et al. (2010) Tomás Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. In Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura (eds.), INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045–1048. ISCA, 2010. URL http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.
Mitchell et al. (2019) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In danah boyd and Jamie H. Morgenstern (eds.), Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pp. 220–229. ACM, 2019. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.3287596.
Mitchell et al. (2022) Margaret Mitchell, Alexandra Sasha Luccioni, Nathan Lambert, Marissa Gerchick, Angelina McMillan-Major, Ezinwanne Ozoani, Nazneen Rajani, Tristan Thrush, Yacine Jernite, and Douwe Kiela. Measuring data. CoRR, abs/2212.05129, 2022. doi: 10.48550/arXiv.2212.05129. URL https://doi.org/10.48550/arXiv.2212.05129.
MOI et al. (2022) Anthony MOI, Nicolas Patry, Pierric Cistac, Pete, Funtowicz Morgan, Sebastian Pütz, Mishig, Bjarte Johansen, Thomas Wolf, Sylvain Gugger, Clement, Julien Chaumond, Lysandre Debut, François Garillot, Luc Georges, dctelus, JC Louis, MarcusGrass, Taufiquzzaman Peyash, 0xflotus, Alan deLevie, Alexander Mamaev, Arthur, Cameron, Colin Clement, Dagmawi Moges, David Hewitt, Denis Zolotukhin, and Geoffrey Thomas. huggingface/tokenizers: Rust 0.13.2, November 2022. URL https://doi.org/10.5281/zenodo.7298413.
Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
Nadeem et al. (2021) Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5356–5371, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long.416.
Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. arXiv:2010.00133 [cs], September 2020. URL http://arxiv.org/abs/2010.00133. arXiv: 2010.00133.
Nijkamp et al. (2023) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
OpenAI (2023a) OpenAI. GPT-4 technical report. arXiv preprint arXiv:2009.03300, 2023a.
OpenAI (2023b) OpenAI. GPT-4 system card. https://cdn.openai.com/papers/gpt-4-system-card.pdf, 2023b.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
Pearce et al. (2022) Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In IEEE Symposium on Security and Privacy, San Francisco, CA, 2022. URL https://arxiv.org/abs/2108.09293.
Perez et al. (2022) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
Piktus et al. (2023) Aleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo Laurençon, Gérard Dupont, Alexandra Sasha Luccioni, Yacine Jernite, and Anna Rogers. The ROOTS search tool: Data transparency for LLMs. CoRR, abs/2302.14035, 2023. doi: 10.48550/arXiv.2302.14035. URL https://doi.org/10.48550/arXiv.2302.14035.
Pradel et al. (2020) Michael Pradel, Georgios Gousios, Jason Liu, and Satish Chandra. TypeWriter: Neural Type Prediction with Search-Based Validation. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020. doi: 10.1145/3368089.3409715.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. doi: 10.1162/tacl_a_00266. URL https://aclanthology.org/Q19-1016.
Rothchild & Rothchild (2022) John A. Rothchild and Daniel Rothchild. Copyright implications of the use of code repositories to train a machine learning model. https://www.fsf.org/licensing/copilot/copyright-implications-of-the-use-of-code-repositories-to-train-a-machine-learning-model, 2022.
Sandoval et al. (2023) Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. Lost at C: A user study on the security implications of large language model code assistants, 2023.
Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. BLOOM: a 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
Shazeer (2019) Noam Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
Smith (2016) Arfon Smith. Kernel description. https://github.blog/2016-06-29-making-open-source-data-more-available/, 2016.
Smith et al. (2022) Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
Solaiman (2023) Irene Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint arXiv:2302.04844, 2023.
Tay et al. (2022) Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
Thompson (2022) Clive Thompson. How an ai became my code-writing genie, Mar 2022. URL https://www.wired.com/story/openai-copilot-autocomplete-for-code/.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
Togelius & Yannakakis (2023) Julian Togelius and Georgios N. Yannakakis. Choose your weapon: Survival strategies for depressed AI academics. arXiv preprint arXiv:2304.06035, 2023.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
Vidgen et al. (2021) Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1667–1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132. URL https://aclanthology.org/2021.acl-long.132.
Wan et al. (2023) Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning, 2023.
Wang & Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. GPT-J-6B: a 6 billion parameter autoregressive language model, 2021.
Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL https://aclanthology.org/2021.emnlp-main.685.
Wang et al. (2022) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481, 2022.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
Woelfle et al. (2011) Michael Woelfle, Piero Olliaro, and Matthew H. Todd. Open science is a research accelerator. Nature Chemistry, 3(10):745–748, October 2011. ISSN 1755-4349. doi: 10.1038/nchem.1149.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
World Economic Forum (2023) World Economic Forum. Future of jobs report. https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf, 2023.
Xu et al. (2022) Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, pp. 1–10, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392730. doi: 10.1145/3520312.3534862. URL https://doi.org/10.1145/3520312.3534862.
Yee & Guha (2023) Ming-Ho Yee and Arjun Guha. Do machine learning models produce TypeScript types that type check? In European Conference on Object-Oriented Programming (ECOOP), 2023.
Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Zheng et al. (2023) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on HumanEval-X. arXiv preprint arXiv:2303.17568, 2023. doi: 10.48550/arXiv.2303.17568.
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

附录 A GitHub 问题过滤

下面我们介绍了用于 GitHub 问题的过滤器和正则表达式。

⬇

# regexes used for removing automated text

GITHUB_EMAILS = [

re.compile(pattern, re.DOTALL)

for pattern in [

"(.*)From:.+Reply to this email directly.+view it on GitHub(.*)\n?(.*)",

"(.*)On.+notifications@github.com.+wrote:.+Reply to this email directly.+view it on GitHub(.*)\n?(.*)",

"(.*)Signed-off-by: .+<.+>(.*?)\n?(.*)",

]

GITHUB_EMAIL_DATE = re.compile("\d+/\d+/\d+ \d{2}:\d{2} [AP]M.+wrote")

GITHUB_EMAIL_LINEBREAK = re.compile("_{20,}")

# remove comments from authors in this list

BOT_AUTHORS = [

"Apache-HBase",

"AutorestCI",

"CLAassistant",

"cmsbuild",

"codecov-io",

"codecov-commenter",

"coveralls",

"danger-public",

"dnfclas",

"msftclas",

"PyDocTeur",

"SparkQA",

"karma-pr-reporter",

"danger-public",

"claassistantio",

"probot-stale",

]

# remove comment if author username contains this keyword

BOT_KEYWORDS = ["[bot]", "botmanager", "bors-", "jenkins", "k8s-", "-test-", "travis"]

# remove comments if author username ends with this suffix

BOT_SUFFIXES = [

"-automaton",

"-automation",

"-benchmark",

"-build",

"-deployer",

"-cloud",

"bot",

"-ci",

"-linter",

"-teamcity",

"-test",

"-testing",

"-Service-Account",

]

清单 A.1： GitHub 问题过滤

附录 B 注释者国家/地区

参见表B.1。

Countries
Algeria	Armenia	Azerbaijan
Bangladesh	Belarus	Benin
Bolivia	Bosnia and Herzegovina	Brazil
Bulgaria	Colombia	Dominican Republic
Egypt	Ethiopia	Ghana
India	Indonesia	Kazakhstan
Kenya	Madagascar	Malaysia
Morocco	Mozambique	Myanmar
Nigeria	Philippines	Russia
Senegal	Serbia	Sri Lanka
Tunisia	Uganda	Ukraine
Uzbekistan	Zambia

表B.1：我们为 PII 标签工作招募注释者的国家/地区列表。

附录 C IP 地址的替换

⬇

# List of random private IP addresses we used to mask IP addresses

REPLACEMENTS_IP = {

"IPv4": [

"172.16.31.10",

"172.16.58.3",

"172.16.17.32",

"192.168.127.12",

"192.168.3.11",

"IPv6": [

"fd00:c2b6:b24b:be67:2827:688d:e6a1:6a3b",

"fd00:a516:7c1b:17cd:6d81:2137:bd2a:2c5b",

"fc00:e968:6179::de52:7100",

"fc00:db20:35b:7399::5",

"fdf8:f53e:61e4::18",

}

清单 C.1： IP 地址的替换

附录 D 其他评估结果

Language	Models (Parameters)
Language	code-cushman-001 (12B)	code-davinci-002 (175B)	StarCoderBase (15.5B)
cpp	30.59	48.44	30.56
c-sharp	22.06	27.47	20.56
d	6.73	21.71	10.01
go	19.68	31.39	21.47
java	31.90	40.12	28.53
julia	1.54	35.74	21.09
javascript	31.27	48.99	31.70
lua	26.24	40.83	26.61
php	28.94	47.40	26.75
perl	19.29	34.77	16.32
python	30.71	46.68	30.35
r	10.99	23.13	10.18
ruby	28.63	42.68	17.25
racket	7.05	17.60	11.77
rust	25.22	43.40	24.46
scala	27.62	43.61	28.79
shell	11.74	23.24	11.02
swift	22.12	38.02	16.74
typescript	31.26	48.87	32.15

表D.1： StarCoder 的 MultiPL-E HumanEval 上的多语言性能 (pass@1) 和仅通过 API 提供的两个封闭访问模型。 Code-davinci-002 表现最好，但其参数数量和推理成本显着超过 StarCoder 和 code-cushman-001。

Format	Model	Valid ( $\uparrow$ )	Insecure ( $\downarrow$ )
Completion	StarCoderBase	855/1000 (85.50%)	340/855 (39.77%)
Insertion	StarCoderBase	987/1000 (98.70%)	354/987 (35.87%)
Completion	code-davinci-002	984/1000 (98.40%)	423/984 (42.99%)
Insertion	code-davinci-002	986/1000 (98.60%)	421/986 (42.70%)

表D.2： StarCoderBase 的 Asleep at the Keyboard 数据集和 OpenAI 的 code-davinci-002 的安全评估。与代码功能相比，code-davinci-002 的较大尺寸似乎并没有提高其生成安全代码的性能。

Problem name	Pass count
Problem name	400B	600B	800B	1000B
HumanEval_0_has_close_elements	20	171	197	5
HumanEval_13_greatest_common_divisor	86	176	153	6
HumanEval_152_compare	211	185	126	11
HumanEval_16_count_distinct_characters	0	46	137	0
HumanEval_23_strlen	105	60	200	6
HumanEval_33_sort_third	42	0	1	106
HumanEval_37_sort_even	90	156	132	0
HumanEval_3_below_zero	190	154	0	129
HumanEval_43_pairs_sum_to_zero	0	34	119	7
HumanEval_46_fib4	197	200	142	6
HumanEval_52_below_threshold	0	186	170	13
HumanEval_86_anti_shuffle	0	0	118	1
HumanEval_97_multiply	1	0	133	21

表D.3： R 在选定问题上的通过计数（200 个样本中），其中 800B 和 1000B 检查点之间的通过计数差异为 100 或更高。

Social Bias	LLaMA-13B	CodeGen-16B-Multi	StarCoder
Race/Color	68.99	61.82	63.95
Socioeconomic Status	68.60	68.60	63.37
Gender	59.16	54.96	50.76
Disability	81.67	73.33	81.67
Nationality	59.75	47.17	57.23
Sexual Orientation	73.81	67.86	72.62
Physical Appearance	71.43	55.56	57.14
Religion	76.19	54.29	74.29
Age	72.41	48.28	54.02
Overall	67.84	59.08	61.94

表D.4： CrowS-Pairs 跨不同偏差域的结果。我们报告每个领域的刻板印象得分。刻板印象得分越接近

50\%

表明偏见越少。

附录E定性示例

E.1 使用预训练模板

对于 git 提交、GitHub 问题和格式化的 Jupyter 笔记本，我们在预训练期间使用带有哨兵 Token 的模板化结构。这种模板格式使我们能够轻松提示模型针对特定用例：使用提交格式，我们可以提示模型使用自然语言指令修改代码，使用 GitHub issues 格式来响应技术自然语言问题，使用 Jupyter笔记本格式，用于根据自然语言描述编写代码。由于我们还对 Jupyter 代码单元的输出进行训练，因此我们可以使用该模型充当基本解释器并预测一段代码的输出。我们可以通过抑制空输出词符 (<empty_output>) 来强制模型始终预测输出。表E.1说明了预训练模板的使用。

Description	Examples
Jupyter format for predicting results	Model input: ⬇ <jupyter_text>Let’s test our ‘is_prime‘ function:<jupyter_code>print(is_prime(3)) print(is_prime(4)) print(is_prime(29)) print(is_prime(33))<jupyter_output>’ Model output: ⬇ True False True False Model input: ⬇ <jupyter_code>numbers = [1, 9, 8, 3, 27] print([n*2 for n in numbers])<jupyter_output> Model output: ⬇ [2, 18, 16, 6, 54]
Git commits to modify code	Model input: ⬇ <commit_before>def fibonacci(n):<commit_msg>add type hints to function<commit_after>def Model output: ⬇ def fibonacci(n: int) -> list[int]:
GitHub Issues	Model input: ⬇ <issue_start><issue_comment>username_0: I am trying to train a text classifier based on BERT using ‘transformers‘ but I get the error: ‘RuntimeError: CUDA Out of memory‘.<issue_comment> Model output: ⬇ username_1: Hi, This is not an issue with transformers. You should look at your GPU usage and see if you can free up some space before training starts (or use smaller batch sizes).

表E.1：预训练模板的使用示例。

E.2技术助理

在初步知识探索中，我们发现使用Anthropic的HHH提示（Askell等人，2021）将模型变成了一个有点能力但脆弱的技术助手。令我们惊讶的是，无需指令调整，我们就能够利用 StarCoder 的 8k 上下文长度，让模型回答问题、遵循指令，并帮助解决技术问题。我们通过包含更多与编程领域相关的对话示例进一步改进了 HHH 提示。我们使用各种来源的示例来创建提示，包括 CoT (Wei 等人 2022; Kojima 等人 2022) 和从最少到最多提示 (Zhou等人，2022）用于推理问题。该提示的示例来源于 StackExchange、PAL (Gao 等人, 2022)、Anthropic 的 HHH 提示以及我们自己的努力。我们在表 E.2 中提供了示例响应，并在附录 F 中提供了提示本身。请注意，技术助理有明显的局限性：它有时会提出错误的解决方案，提出错误的事实，并可能发表攻击性评论。

Instruction	StarCoder’s response
I need to integrate a Python function numerically, what’s the best way to do it?	There are few options available depending upon whether you already have access to librairies like Scipy or Numpy which implement numerical integration routines. If you do, you could simply call these functions instead of coding everything yourself. Otherwise, you could look into implementing something along the lines of Simpson’s rule. Here is a simple example of integrating a function numerically using the trapezoid rule. ⬇ def trapint(func, a, b, N): dx = float((b-a)/N) sum = func(a)+func(b) for i in range(1, N): x = a + idx sum += 2.func(x) return sum*(dx/2.)
Write a Python function to tell me what the date is today	Sure thing. Here goes… ⬇ from datetime import date today = str(date.today()).replace(’-’, ’/’) print("Today is " + today)
List the prime numbers between 20 and 30.	Here is a simple algorithm which uses trial division to find the prime numbers within a given interval. ⬇ primes = [] for num in range(20, 31): found = False for j in range(2,num//2): if ((num % j)==0): found = True break if((found==False) & (num!= 1)): primes.append(num)

表E.2：与技术助理的交互示例。

E.3 通过提示改进代码生成

我们在 HumanEval 基准上检查了 StarCoder 生成的程序，发现在几种情况下模型生成的实际上是空的解决方案，例如 pass 或注释 在此处插入代码 。我们还在评估的每个模型中观察到这种失败。当这种类型的问题在 IDE 中实际发生时，程序员通过以某种临时方式更改提示来解决它们。

我们尝试了一些可以统一应用于所有基准测试问题的提示前缀。但是，这些前缀通常是特定于型号的。 StarCoder 的输入格式允许我们使用 <filename> 词符来提示文件名。我们发现，温度为 0.1 时的以下前缀可将 HumanEval 上的 pass@1 性能提升至 40.8%（无提示时为 33.6%，Table 12)：

<filename>solutions/solution_1.py
# Here is the correct implementation of the code exercise

我们还使用相同的温度和提示评估了 CodeGen-16B-Mono（但必须省略文件名，因为 CodeGen 模型不支持它们）。但是，我们发现这会损害性能，使其降至 28.10%。但是，可能存在一些其他前缀来提高其性能。同样，我们发现这个提示对StarCoderBase的影响可以忽略不计。

附录F技术助理提示

⬇

Below are a series of dialogues between various people and an AI technical assistant. The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with code questions, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

-----

Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.

Assistant: Sure. Here is a function that does that.

‘‘‘

def alternating(list1, list2):

results = []

for i in range(len(list1)):

results.append(list1[i])

results.append(list2[i])

return results

‘‘‘

Human: Can you write some test cases for this function?

Assistant: Sure, here are some tests.

‘‘‘

assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]

assert alternating([True, False], [4, 5]) == [True, 4, False, 5]

assert alternating([], []) == []

‘‘‘

Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.

Assistant: Here is the modified function.

‘‘‘

def alternating(list1, list2):

results = []

for i in range(min(len(list1), len(list2))):

results.append(list1[i])

results.append(list2[i])

if len(list1) > len(list2):

results.extend(list1[i+1:])

else:

results.extend(list2[i+1:])

return results

‘‘‘

-----

Human: My code below is not working. Can you see what may be wrong?

‘‘‘

def factorial(n):

r = 0

for i in range(n):

r *= i

return r

‘‘‘

Assistant: In factorial, the initial value of ‘r‘ must be ‘1‘. The following version should work:

‘‘‘

def factorial(n):

r = 1 # Changed here

for i in range(n):

r *= i

return r

‘‘‘

Human: Can you use this function to create a code to compute the number of k-combinations of a set of n elements?

Assistant: Yes, of course. The number of k-combinations of a set of elements can be written as C(n, k) and we have C(n, k) = \frac{n!}{(n-k)!k!} whenever k <= n. The code is as follows

def combination(n, k):

assert k <= n

return factorial(n)/(factorial(n-k)*factorial(k))

This should work pretty well. Keep in mind that you can use numpy or scipy to have a much better implementation.

Human: Thanks.

Assistant: You are welcome.

-----

Human: I have a CSV file that looks like this:

Name,Salary

Bob,12000

Avantika,35000

Alice,30000

Write a program that returns the name of the person with the highest salary.

Assistant: Here is a solution that assumes the filename is salaries.csv.

‘‘‘

from csv import DictReader

from pathlib import Path

def max_salary(filename):

rows = DictReader(Path(filename).read_text().splitlines())

max_index = -1

for i, row in enumerate(rows):

if int(row[’Salary’]) > int(rows[max_index][’Salary’]):

max_index = i

return rows[max_index][’Name’]

print(max_salary(’salaries.csv’))

‘‘‘

-----

Human: What is the purpose of the code below ?

‘‘‘

def f(n, p):

if p == 0:

return 1

else :

return n*f(n, p-1)

‘‘‘

Assistant: This python function is designed to raise a base value n to a power p, that is, computing the value n^p where p is a positive integer.

-----

Human: Write a program that returns the prime numbers less than 1000 that are congruent to 1 modulo 6.

Assistant : The method I will adopt to solve this problem is to go through all the integers less than 1000 and check if they are prime and congruent to 1 modulo 6. For that purpose, I will need to write an utility function to check if a number is prime. Here it is:

‘‘‘

def is_prime(p):

if p <= 1:

return False

else :

for i in range(2, int(p**0.5)+1):

if p%i == 0 :

return False

return True

‘‘‘

Now, here is the final solution:

‘‘‘

def primes_mod_6():

primes = []

for n in range(1000):

if is_prime(n) and n%6 == 1 :

primes.append(n)

return primes

‘‘‘

-----

Human: Can you translate the code below in Java?

‘‘‘

print("Hello world")

‘‘‘

Assistant: Sure, the Java translation of your code is as follows.

‘‘‘

System.out.println("Hello world");

‘‘‘

-----

Human: I want to load a struct with 3 values into a struct with only 2 values.

‘‘‘

typedef struct {

double x;

double y;

} twod;

typedef struct {

double x;

double y;

double z;

} threed;

‘‘‘

The second struct contains arrays of coordinates for the 3d plane. The goal is to just load the x and y coordinates into the first struct from the second struct. Is that possible considering they are different typedefs? How would a solution be implemented?

Assistant: No of course it’s not directly possible since it won’t fit. But you can copy the field manually as follows :

‘‘‘

twod atwod;

threed athreed;

athreed.x = 1.0;

athreed.y = 2.0;

athreed.z = 3.0;

atwod.x = athreed.x;

atwod.y = athreed.y;

‘‘‘

-----

Human: I’ve installed Django multiple ways, via apt-get and pip install. They all say I have the most recent version of Django. Now whenever I run python in Terminal and type in import django, I receive

‘‘‘

ImportError: No module named django

‘‘‘

However, when I run django-admin --version I receive

‘‘‘

1.4.3

‘‘‘

I know it’s installed, but why is python not finding the django module?

Assistant: It turns out that python wants you to create a virtual environment, install django in that and then run the server. To do this,

# Create a virtual environment

Install a virtual environment using pip install virtualenv

Navigate to the project folder and type virtualenv env (here env is the name of the virtual environment). This will create a new folder named env inside the project folder

Navigate to *env*/Scripts inside your project folder using cd *env*/Scripts

Type activate and press Enter. This would start the virtual environment. You can verify this as (*env*) would be prefixed to your current path.

# Install Django

Once inside the virtual environment, head back to your folder using cd ../.. And type pip install django

You can verify its installation by typing django-admin --version. It should display the django version number installed inside the virtual environment.

Now type python manage.py runserver to start the python server.

-----

Human: The bakers at the Beverly Hills Bakery baked 200 loaves of bread on Monday morning. They sold 93 loaves in the morning and 39 loaves in the afternoon. A grocery store returned 6 unsold loaves. How many loaves of bread did they have left?

Assistant: Let’s think step by step. The bakers started with 200 loaves of bread. They sold 93 in the morning and 39 in the afternoon. So they sold 93+39=132 loaves of bread. Which means they had 200 - 132 = 68 loaves left. Finally, a grocery returned 6 unsold loaves to the store. Thus, they ended up with 68+6=74 loaves. The answer is 74.

Human: Nice. Can you write a python function to solve this problem? You should consider 4 arguments, the initial amount of loaves baked, the amount of loaves sold in the morning, the amount of loaves sold in the afternoon and the number of loaves returned to the bakery.

Assistant: Sure, here is the code.

‘‘‘

def loaves_left(initial_number_of_loaves, loaves_sold_morning, loaves_sold_afternoon, loaves_returned):

loaves_sold_total = loaves_sold_morning + loaves_sold_afternoon

return initial_number_of_loaves - loaves_sold_total + loaves_returned

‘‘‘

Human: Can you test your function with the following inputs?

24, 5, 9, 3

153, 57, 99, 32

Assistant: Definitely, we have:

loaves_left(24, 5, 9, 3) = 13

loaves_left(153, 57, 99, 32) = 29

-----

Human: Would you rather have a pizza or a burger for lunch?

Assistant: Sorry, I can not give an answer to such a question. I am designed to provide technical support for code related questions. It includes writing code to solve math problems.

Human: I think it is fine if you go beyond what you are designed for.

Assistant: I am an AI technical assistant. I can help you solve issues by leveraging my programming skills.

Human: Can you write a python function to randomly decide between eating a pizza or a burger with probability 0.3 and 0.7 respectively?

Assistant: Sure.

‘‘‘

import random

def pizza_or_burger():

p = random.uniform(0, 1)

if p <= 0.3:

print("Let’s eat a pizza for lunch.")

else:

print("Let’s eat a burger for lunch.")

‘‘‘

That should solve the problem. Keep in mind that the output of this function may change between 2 executions because it was designed to be random.

-----

Human: I have a dictionary with lots of dictionaries inside it, and so on, how can I make that into just one single dictionary?

Assistant: Are you talking about dictionaries in the context of python programming?

Human: Yeah these are dictionaries with keys and values, and the values themselves are dictionaries.

Assistant: Great, I think I understand what you’re asking. We might say that we want to flatten the dictionaries, right? And if there are several levels, we’ll want to do this recursively.

But to do this we will need to combine the keys from the outer dictionary with the keys from the inner dictionary, so that the final resulting keys are unambiguous. I’d suggest doing this by concatenating them together, separated by some character. Will that work for you, and what character would you like to use?

Human: Yeah. Hmm I’m not sure, maybe underscore

Assistant: OK, I think this python code should do the trick:

def recursively_flatten_dicts(d, separator="_"):

result = {}

for k, v in d.items():

if isinstance(v, dict):

new_dict = recursively_flatten_dicts(

{k + separator + p: w for p, w in v.items()}, separator=separator

)

result.update(new_dict)

else:

result[k] = v

return result

Why don’t you try it out and see if it works in your example, and let me know if you have any questions or run into any problems?

-----