近距离观察指令调整的局限性

Sreyan Ghosh Chandra Kiran Reddy Evuru Sonal Kumar Ramaneswaran S Deepali Aneja Zeyu Jin Ramani Duraiswami Dinesh Manocha

摘要

指令训练（IT）是利用指令-响应对训练大型语言模型（LLMs）的过程，已成为将基础预训练 LLMs 转化为开放域对话代理的主要方法。虽然 IT 取得了显著的成功并得到广泛的应用，但其局限性和不足之处仍未得到充分的探讨。在本文中，我们通过严谨的实验和深入分析 LLMs 通过 IT 发生的变化，揭示 IT 的各种局限性。我们特别指出：(1) IT 未能提高 LLMs 的知识或技能。 LoRA 微调仅限于学习响应启动和风格词符，全参数微调会导致知识退化。 (2) 从知识渊博的 IT 数据集复制响应模式会导致响应质量下降。 (3) 全参数微调会增加幻觉，因为它会不准确地借用 IT 数据集中概念相似的词符生成响应。 (4) 与简单的 LoRA 微调模型相比，改进 IT 的流行方法并不能提高性能。我们的研究结果表明，仅从预先训练的知识中生成的反应始终优于从开源数据集上的 IT 中学习任何形式的新知识的模型的反应。我们希望所揭示的见解和挑战能对今后的工作有所启发。

机器学习，ICML

1导言

以下一个词符预测为目标进行大规模预训练的大型语言模型（LLM）在其参数中隐含地压缩了世界知识（赵等人，2023）。这些模型学习通用表征，然后可以对齐所需的响应特征（Zhang 等，2023a）。最近，人们提出了多种对齐 LLM 的方法，其中 Instruction tuning (IT) （Wei 等人，2022）和 reinforcement learning from human feedback (RLHF) （Bai 等人，2022）最受欢迎。 IT 是用指令-响应对对大语言模型进行微调的过程，使其能够遵循或完成人类指令的任务。另一方面，RLHF 不断调整 IT 化的大语言模型，以进一步符合人类的偏好。由于需要大量的人类偏好数据，RLHF 的成本过高（Bai 等人，2022），而带有标准监督损失的 IT 已被证明是一种更为普遍的配准技术（Zhou 等人，2023）。基于信息技术的调整使大语言模型得到显著改善，释放出令人印象深刻的能力（Bubeck 等人，2023），这表明微调是构建和改进基于大语言模型的对话代理的关键。在本文的其余部分，我们将 IT 化模型交替称为微调模型，将 IT 化过程交替称为微调。

早期的信息技术研究工作主要集中在利用流行的自然语言处理（NLP）任务对 LLM 进行微调和评估，在这些任务中，数据集实例被措辞为自然语言指令。然而，仅使用传统的 NLP 任务和指标来衡量信息技术对大语言模型的影响，并不能全面评估其一系列能力，例如推理和知识共享，而这些能力通常是开放域对话代理所遇到的各种任务所要求的（Wang 等人，2023）。自从 ChatGPT 取得巨大成功后（OpenAI，2023），最近的一个工作方向是利用开放域指令遵循数据来评估和改进信息技术（Zheng 等人，2023）。然而，尽管信息技术已被证明能显著提高性能，并能推广到未见过的 NLP 任务中，但其局限性和缺点却很少被探讨。我们将其归结为三个主要原因：（1）缺乏全面的评价指标来评价开放域指令遵循能力。 (2) 对基础预训练大语言模型随信息技术发生的确切 Transformer 缺乏清晰的认识；(3) 功能强大的半开源聊天模型（不公开信息技术数据的微调大语言模型）迅速崛起，促使人们仅仅将这些模型作为工具使用，导致对其开发过程中的几个关键要素探索不足。

主要贡献：在本文中，我们调查并揭示了信息技术的几个局限性。为此，我们通过使用各种开源 IT 数据集、LLM 和训练范式进行实验，仔细研究了基础预训练模型在 IT 之后经历的 Transformer。此外，在评估方面，我们综合采用了专家人工评估、基于 GPT-4 的多方面评估和词符分布分析（Lin 等人，2023）。我们的研究明确侧重于评估信息技术在开发开放域会话代理（通常也称为聊天模型）方面的有效性，并且仅限于单轮交互。我们的大量研究结果表明

1.

IT 不是一种知识 ¹¹1我们使用 "知识 "一词来涵盖模型所拥有的事实信息和任务执行技能。增强器。类似于同时进行的工作(Gudibande 等人, 2023)，我们首先发现当前开源规模的 IT 不能作为知识增强器。为了更深入地挖掘，我们比较了基础 LLM 与其 IT-ed 版本之间的词符分布（在第 3 节中进一步解释），发现 LoRA （Hu 等人，2021）只教授应答启动，并从预先训练的知识本身提取了大部分应答，从而得到了最符合事实的正确应答。另一方面，全参数微调会导致知识退化，降低整体响应质量。
2.

复制模式往往会损害性能。我们首先证明，使用 LoRA 和全参数微调的 IT-ed 模型在学习模式复制方面存在很大差异。前者学习的只是文体词符，而后者则引导模型更深入地适应新训练数据的具体情况。接下来，我们会证明，虽然模式复制有时会带来一些好处，比如回答得详细而全面，但在大多数情况下，它会损害回答的事实正确性。最后，我们提出一个简单的解决方案来解决这个问题。
3.

完全微调会增加产生幻觉的几率，从而导致知识退化。这些幻觉是从信息技术数据集中借用的词符。我们的研究表明，当模型在响应中出现幻觉或输出错误词符时，这些词符极有可能是从 IT 数据集本身中具有类似概念的实例中借用的。我们从因果分析的角度对此作进一步研究。与 LoRA 微调相比，这种效应在使用全参数微调训练的模型中更为普遍。
4.

文献中提出的各种改进 IT 的方法并不能提高模型的性能。我们对 NEFTune （Jain 等人，2023）和数据集过滤（Chen 等人，2023）等几种方法进行了比较，结果表明，虽然这些方法比全微调模型有所改进，但 LoRA 微调模型的性能优于所有这些方法。因此，这些方法无助于知识的进步，而利用预训练知识的模型仍然更胜一筹。

Refer to caption — 图 1： IT 之后的词符分布迁移。我们使用第 3节中定义的 3 个指标来比较基础预训练模型和它们的 IT-ed 版本之间的词符分布。我们发现：(1) 总体而言，LFT 的词符分布偏移较小，表明与预先训练的知识高度一致。 (2) SFT 的偏移量远高于 LFT。 (3) LFT 不受 IT 数据集规模的影响。

2实验设置

LLMs. 在分析范围内，我们对 5 种不同类型的 LLMs 进行了实验，即 LLaMa-2 _7B （Touvron et al、2023）、LLaMa-2 _13B、LLaMa-2 _70B、Mistral-v0.1 _7B （Jiang et al、2023) 和 Phi-1.5 _1.3B （Li 等人，2023b）。我们只使用模型的基础预训练版本（而不是聊天变体），并通过 IT 自己对其进行微调。由于计算方面的限制，我们只在一部分实验中使用了 LLaMa-2 _70B 。

微调数据集为了利用 IT 进行微调，我们使用各种合成和人工编写的 IT 数据集进行实验。我们使用 Alpaca _52k和 MedInstruct _52k进行合成，前者是通过向 ChatGPT 发送包含少量样本的初始种子数据集（Taori 等人，2023），后者是以类似方式从医疗领域构建的（Zhang 等人，2023b）。对于人工写作，我们使用 LIMA _1k （Zhou 等人，2023）和 databricks-dolly _15k （Conover 等人，2023）。最后，我们还使用了 Tulu-V2-Mix _326k （Ivison 等人，2023），它是各种开源数据集的混合体。

评估数据集。在评估方面，我们主要采用了 just-eval-instruct _1k （Lin 等人，2023），它是各种开源 IT 评估集的混合体，并标记了各种任务类型和主题（统计数据见附录 C）。之所以这样选择，是因为它的多样性和简洁性有助于我们进行全面的专家级人类评估。我们只使用了前 800 个实例，删除了后 200 个实例，因为安全对齐超出了本文的讨论范围。对于在 MedInstruct _52k 上微调的模型，除非另有说明，否则我们将在 MedInstruct-test ₂₁₆ 上对模型进行评估。我们没有评估开放式大语言模型排行榜（Beeching 等人，2023），因为它不符合我们评估大语言模型自由形式开放域指令遵循能力的主要动机。

微调范式。在利用 IT 进行微调时，我们采用 LoRA 微调（LFT）或（标准）全参数微调（SFT）。 LFT 的工作原理是用低秩矩阵逼近模型的权重矩阵。这减少了需要微调的参数数量，使微调过程更快、更高效。另一方面，SFT 类似于一般的机构风险管理，通过调整模型的全部或大部分权重来工作。

评估。为了评估我们的微调模型，我们采用了专家人工评估和 GPT-4 自动评估相结合的方法。最近的研究表明，使用 ChatGPT 和 GPT-4 对 LLM 的输出进行评分和评估与人类的评估结果非常吻合，而且还具有降低成本的额外优势（Liu et al、2023；Li 等人，2023a；Chan 等人，2023；Xu 等人，2023b；Zhou 等人，2023）。我们借鉴了 Lin 等人（2023）的可解释和多方面评价框架，该框架提示 GPT-4 从五个方面为大语言模型对指令的反应打 1 到 5 分：有用性、清晰度、事实性、深度和参与度。它还提供了对每项评分的解释。所用提示符见附录 A。

命名公约。我们主要遵循 "模型_数据集_训练范式 "的命名规则。例如，使用 SFT 在 Alpaca _52k 上训练的 LLaMa-2 _7B 将被命名为 LLaMa-2 _7B $\_$ SFT $\_$ Alpaca _52k 。

训练和评估超参数。所有模型均以分布式方式训练 3 个历元，学习率为 5e-5，有效批次大小为 32。在生成方面，我们在所有实验中都采用了贪婪解码（即零温度）。

请注意：本文只介绍了 LLaMa-2 系列的结果，其他 LLM 的结果将在第 D.3 节中提供。我们的所有研究结果也适用于其他 LLMs。

3IT （目前）不是知识增强器

概述。本节将探讨信息技术在当前的开源规模下能否发挥知识促进者的作用。首先，我们介绍了基础预训练大语言模型在接受基于 LFT 和 SFT 的微调时所经历的不同性质的 Transformer，并表明 LFT 后产生的反应与预训练知识密切相关，而 SFT 后产生的反应则明显偏离预训练知识，这表明获得了新知识。随后，我们表明，这些新知识往往会导致应答质量的整体下降，而主要依靠预先训练的知识往往能得到更真实、更有用的应答。

调查结果 1. LFT 响应与原始的预训练知识紧密对齐。 SFT 没有。为了研究微调模型与基础预训练模型的区别，我们进行了 Lin 等人 (2023) 提出的词符分布分析。具体来说，对于给定的指令-响应对，指令 $i$ = { $i_{1}$ , $i_{2}$ 、 $\cdots$ } 首先输入到对齐后的模型中，通过贪婪解码得到其响应 $r$ = { $r_{1}$ , $r_{2}$ , $\cdots$ } 。接下来，对于响应中的每个位置 $t$ ，该位置的 "上下文 "定义为 $x_{t}$ = $i$ + { $i_{1}$ 、 $\cdots$ , $i_{t-1}$ }. 然后将此 "上下文 "输入基础模型，以获得其预测位置 $t$ 上下一个词符的概率分布 P ${}_{\text{base}}$ 。根据对齐模型得到的位置 $t$ 处词符的概率分布记为 P ${}_{\text{align}}$ 。然后，我们计算三个指标：(1) P ${}_{\text{base}}$ 和 P ${}_{\text{align}}$ 之间的 KL Divergence，(2) Base Probability：P ${}_{\text{base}}$ 的词符在 $t$ 的最大 P ${}_{\text{align}}$ 值 (3) 基本排名：P ${}_{\text{base}}$ 中的排名，该词符在 $t$ 的P ${}_{\text{align}}$ 值最大。以 $\eta$ 表示基本等级，未移位、有差距和移位词符的定义为：当（ $\eta$ $=$ 1）、（1 $<$ $\eta$ $\leq$ 3）和（ $\eta$ $>$ 3）。图 1 展示了这三个指标的演变过程。这些指标是所有响应词符的平均值，并与用于微调的 IT 数据集的不同大小相对应。我们将研究结果总结如下：(1)使用 LFT 进行微调会导致词符分布的最小变化，也就是说，在给定先验 "上下文 "的情况下，使用 LFT 进行微调的模型一般会输出未经微调的基础模型会输出的词符。这进一步表明，LFT 生成的反应与模型预先训练的知识相一致。 (2) IT 数据集规模的缩放对 LFT 观察到的词符分布偏移程度的影响可以忽略不计。 (3) 相比之下，SFT 导致词符分布的偏移明显更大，这表明其反应与预先训练的知识有很大偏差。 (4) 较大的模型显示 LFT 和 SFT 的分布偏移都有所减小。

调查结果 2. LFT 只充当反应启动器，而大部分答案来自预先训练的知识。在图 4 中，我们分析了微调模型与基础模型之间的 KL Divergence。这项分析的重点是，在低资源（1k）和高资源（52k）分割的 Alpaca、Tulu-V2-Mix 和 MedInstruct 数据集上对模型进行微调后，在所有回答的所有句子中，每个回答的每个句子的最初 5%和随后 95%词符的分歧平均值。我们的观察结果表明，LFT 的 KL 分歧在前 5%的词符中较高，随后急剧下降。相比之下，SFT 的 KL 分歧减小得不那么明显。这意味着 LFT 主要学习句子或事实的开头，其分布偏移较大，因而新颖词符主要出现在反应中每个句子的开头部分。另一方面，SFT 在整个句子的跨度上表现出更大更均匀的分布变化。图 7 底部显示了几个例子。我们随后将研究 SFT 中这些由分布变化引起的新词符是否会转化为新知识，并最终提高应答质量。

调查结果 3. 缩放对 LFT 无效。由于 LFT 主要学习响应启动，我们接下来将探讨数据集缩放是否能有效提高 LFT 的响应质量。在图2中，我们比较了使用 LFT 微调的模型在各种 IT 数据集及其低资源对应数据集上的性能。我们的研究结果表明，将 IT 数据集扩大 52 $\times$ 甚至 326 $\times$ 倍，并不会对我们自动评估设置中的五个指标（包括真实性）产生显著影响。这一趋势在各个领域都是一致的，包括医学等知识密集型领域。这些结果表明，LFT 在微调阶段并没有学习到任何新的知识或技能，预先训练过的知识才是提高其有效性的主要因素。我们的研究结果表明，几个 IT 样本就足以让 LFT 学习有效的启动响应。

调查结果 4. 预先训练的知识（目前）占主导地位。我们的研究结果表明，通过 LFT 进行的信息技术主要是促进反应的启动，而不是增强新知识，而且大部分反应都是基于预先训练的知识。与此相反，在 SFT 中观察到的词符分布的巨大变化表明，在微调过程中学习了新知识。为了评估这些新获得的知识是否转化为响应质量的提高，我们进行了以下评估，如图 3所示：(1) 我们比较了在 IT 数据集的 1k 个样本上使用 LFT 微调的不同规模模型的性能，以及在数据集 326 $\times$ 大的 just-eval-instruct _1k 上使用 SFT 微调的相同模型的性能。从前面的研究结果中我们知道，LFT 模型完全依赖于预先训练的知识，而 SFT 模型则最大限度地从 IT 数据集中学习。结果表明，LFT 模型在事实性和有用性方面表现出色，这表明即使广泛采用信息技术，也不会为模型带来大量有用或事实性知识。 (2)在更极端的情况下，我们比较了在开放领域指令-应答对上训练的同一 LFT 模型与在 MedInstruct _52k 对 MedInstruct-test ₂₁₆ 上训练的特定领域模型的性能。值得注意的是，即使在这种情况下，LFT 模型的表现也更好。

主要收获： LFT 即使在规模上也主要依赖于预先训练的知识，而不会获取新的信息。与此相反，SFT 显著的词符分布转变表明了新知识的获取。然而，以预先训练的知识为基础的 LFT 反应始终优于基于 SFT 新学到的信息的反应，这表明 SFT 有降低整体知识质量的趋势。

4模式复制（经常）损害性能

在探索 SFT 期间词符分布重大转变的后果时，我们仔细研究了模式复制的概念。我们将 "模式复制 "定义为大语言模型学习模仿 IT 数据集中的反应特征的情景。我们将模式复制细分为两种不同类型：(1) 音调模仿：在这种情况下，生成的响应往往使用 IT 数据集中的词符。这些词符可以是文体词符，也可以是普通词符。 (2) 风格模仿：这些答复反映了信息技术数据集中更广泛的文体特征。例如，如果信息技术数据包括全面、结构合理和冗长的答案，那么大语言模型的回答也很可能表现出类似的特征。事实证明，LLM 可以学习这些特征（Gudibande 等人，2023；Lin 等人，2023）。我们的研究表明，信息技术之后的词符分布变化，更具体地说，偏离预先训练知识的词符的使用，表明了模型对信息技术数据集具体情况的适应程度。此外，SFT 和风格模仿会导致模型在回答中不准确地包含 IT 数据集中的词符，从而对回答质量产生负面影响。

调查结果 1. LFT 和 SFT 学习音调模仿的方式不同。从3节中可以看出，LFT 主要学习反应启动，大部分反应来自预先训练的知识。另一方面，SFT 的新词符使用率也很高。由此，我们开始研究 LFT 是否能学习音调模仿，以及使用两种训练范式微调的模型在音调模仿方面的差异。图12a和12b说明了 LLaMa- 移位和边缘位置的常见词符分布。2 _7B $\_$ SFT $\_$ Alpaca _52k和 LLaMa-2 _7B $\_$ LFT $\_$ Alpaca _52k。通过分析偏移位置的词符，我们可以了解信息技术是如何影响模型输出与没有信息技术时不同的词符的。我们可以清楚地看到，对于 LFT 来说，转变主要发生在文体词符（"通常"）和反应启动词符（"然而"）上。另一方面，对于 SFT 来说，各种词符都会发生偏移。为了进一步调查这些词符的来源，我们对 IT 数据集中的移位词符进行了字符串搜索。令我们惊讶的是，我们发现 $\approx$ 81.2%的词符以移位词符开头，66.7%的词符以边缘词符开头，这些词符都是从 IT 数据集本身借用的。这表明，在使用 SFT 时，模型显示出更多地借用 IT 数据集中的词符生成响应。第5节中的一项研究进一步表明，这些词符常常被借用得不准确，从而导致幻觉。

调查结果 2. 风格模仿会损害响应质量。现在，我们来研究风格模仿是否会影响反应质量。我们发现，IT 数据集中的回复长度与微调后的大语言模型在我们的评估集上输出的回复长度之间存在正相关关系（详细结果见表 11）。为了实现这一目标，我们使用了 LIMA IT 数据集，该数据集包含了社区问答论坛的回复，以其全面性、专业性和长度而著称（参见表 11）。在 LIMA 数据集上使用 SFT 对 LLaMa 进行微调时，即使预先训练的知识可能不足，它也能做出冗长而详细的回应。当模型努力生成扩展答案时，这往往会导致幻觉。图5展示了这方面的例子，每个方框都显示了 LLaMa-2 _7B $\_$ SFT $\_$ LIMA _1k的响应，以及更简洁的版本。我们总结了我们在风格模仿方面的发现：(1)从指令 1 中可以看出，风格模仿有时可以提高反应质量，而不会导致幻觉，尤其是当模型对被试有足够了解时。 (2)从指令 2.中可以看出，仅仅风格模仿就可以改变对开放式推理指令的反应性质。 (3)该模型最初提供的是事实信息，但在缺乏对主题的充分了解的情况下，它又诉诸幻觉来延长反应或事实的时间。这些幻觉可能包括随机生成的事实或来自 IT 数据集的内容，如前所述，并将在第 5 节中进一步讨论。

建议的解决方案：简化 IT 数据集。要减轻风格模仿造成的幻觉问题，一个可行的办法是采用 LFT 而不是 SFT，因为 LFT 往往只学习风格元素。然而，正如3节发现 3 所示，LFT 无法有效扩展，而 SFT 可能会在更大尺度上发挥知识增强器的作用（从图 3中的 LLaMa-2 _7B $\_$ LFT $\_$ Tulu-V2-Mix _326k 可以看出）。我们提出了一种利用 LFT 和 SFT 优点的方法。鉴于 LLM 拥有充足的前期训练知识来生成准确的回答（研究结果 4，第 3节），但在模式复制模式下却很难做出全面的回答，我们假设在 IT 数据集上进行简洁但准确的回答的 SFT 可以减少幻觉。我们采用 GPT-4 来简化 LIMA _1k，通过删除无关信息来创建简洁的响应，并将其称为 LIMA-Simple_1k。在图 5 中，该模型与原始 LLaMa-2 _7B $\_$ LFT $\_$ LIMA _1k 在幻觉倾向方面进行了比较。结果显示，幻觉明显减少，超出了简洁的范围。表 13（第 49-50 行）中的定量结果证实，虽然简化模型可能缺乏深度，但它在事实性和有用性方面超过了原模型。我们还说明了两种特殊情况：(1) 指令 1 说明了这样一种情况，即模型实际上拥有足够的知识来做出全面的回答，并提供了比简化指令更有深度的事实性回答。 (2)教学 2 展示了从不同类型的数据集中采用模式复制如何导致对基于推理的单一开放式教学的不同反应。

图6显示了由四位人类专家评估员进行的一项人类研究的结果，他们人工比较了 LLaMa-2 _7B $\_$ SFT $\_$ LIMA _1k 和 LLaMa-2 _7B $\_$ SFT $\_$ LIMA-Simple _1k 对 5 个预定义类别的响应进行人工比较的结果。我们的研究表明，简化 IT 模型数据集中的反应可大大降低出现幻觉的几率。

主要收获：模式复制会增加模型幻觉。信息技术数据集答复的全面性和深度应根据模型的现有事实知识来确定。此外，模式复制使模型从 IT 数据集中借用词符进行响应。

5幻觉的因果分析

早些时候，我们观察到，幻觉中不准确生成的词符往往可以追溯到信息技术数据集本身。然而，这提出了两个问题：这种幻觉是风格模仿所特有的吗？不准确的词符是从信息技术数据集中随机借用的还是因果驱动的？为了回答这些问题，我们从因果分析的角度对幻觉进行了精细分析。

调查结果 1. SFT 会增加模型的幻觉。这些幻觉源于训练数据集与反应之间错误的因果联系。图7展示了在模式复制模式下不会出现幻觉的 3 个例子，即模型即使在简洁的反应中也会出现幻觉。与图 5 中描述的情景不同，图 7 中的指令 2 说明了基础预训练模型尽管拥有正确的知识，但在 IT 之后出现幻觉的例子。与之前的研究结果类似，这些词符源自信息技术数据集。

接下来，我们将为详细分析这类幻觉建立一个形式框架。我们请四位人类专家识别幻觉，方法是在回答中标出与指令无关的事实或有用的跨度。简单的字符串搜索显示，IT 数据集中平均存在 $\approx$ 72%的这些标记短语，这些短语中 $\approx$ 89%的词符可归类为偏移或边缘。接下来，为了确定这些幻觉短语是否是偶然从任何 IT 实例中借用的，我们从因果分析的角度来看待这些幻觉，将幻觉视为效果，而将短语被错误借用的 IT 数据集中实例的特定属性或属性视为原因。

四名人类专家评估员人工审核了 IT 数据集中出现幻觉短语的所有实例。令人惊讶的是，他们发现这些模型有一种明显的倾向，即使用从信息技术数据集中描述类似概念的实例中错误借用的词符做出反应。例如，图7中的指令 2 回答 1 寻求有关英语中最大和最短单词的信息，其内容部分来源于与之无关的有关最长篮球比赛的指令和有关最小原子的指令。同样，同一教学内容中的回答 2 也是从讨论 "语言 "这一相关概念的教学内容中挪用的。最后，指令 2 的回答 1 说明了微调模型是如何从信息技术数据集中的一个实例中生成与同一抽象概念--"关于北极光"--相关的重新措辞的回答的，但该回答与输入指令无关，而是与信息技术数据集中的一个实例相关。我们还举例说明了使用 LFT 微调的模型的反应，并表明与预先训练的知识相一致的反应往往会输出事实正确且有用的答案。表 8 显示，人类专家评估员发现，在使用 SFT 对模型进行微调的 1000 个随机选择的幻觉短语中，平均有 87% 的短语与 IT 数据集存在因果关系。如果使用 LFT 500 短语，这一数字仅为 13.9%。

此外，我们还指出，幻觉的因果分析非常复杂，语义或词汇方法不足以确定幻觉在信息技术数据集中的来源。概念相似性如何导致幻觉的测量方法并不简单；我们发现语义相似性得分仅为 0.418，而幻觉反应中的关键词与其 IT 数据集来源相匹配的几率仅为 61.6%（方法详见第 D节）。第B节展示了幻觉的更多例子，包括反事实分析，我们还展示了模型甚至在编码和总结等程序性任务中也会产生类似的幻觉。

主要收获： SFT 增加了模型中的幻觉，使其容易错误地借用 IT 数据集中的词符进行应答。这些词符是从描述类似概念的实例中借用的。

6 改进 IT 的方法效果不佳

文献中提出了几种改进信息技术的方法。在本节中，我们将研究一些常用方法，包括采用数据集过滤的 AlpaGasus （Chen 等人，2023）、WizardLM （Xu 等人，2023a）和增加指令复杂度的 NEFTune （Jain 等人，2023）、2023a）增加了指令的复杂性，而 NEFTune （Jain 等人，2023）则在微调时为嵌入向量添加噪声。图9比较了使用 SFT 微调的模型与使用 LFT 训练的模型（后者主要使用预先训练的知识生成响应）的性能。 AlpaGasus 和 WizardLM 是 Alpaca 数据集的改进版本，对于 NEFTune，我们也采用了 Alpaca 数据集。正如我们所看到的，虽然所有方法在 Alpaca 上的表现都优于 SFT（这也是原始论文所建议的），但 LFT 模型在所有情况下都优于所有这些模型，无论是在单个指标还是任务类型上的平均表现都是如此。这表明，预先训练的知识仍占主导地位，而这些方法带来的确切益处还需要更深入的研究。我们假设，性能下降的主要原因是模型借用 IT 数据集的词符产生幻觉的倾向增加了，我们敦促社区进行更深入的调查，以解决这个问题，然后这些技巧的好处才能提高响应质量。超出本节讨论范围的表 13（第 45-50 行）还显示，LFT 性能不受这些方法的影响。

7相关工作

虽然大量研究引入了新的信息技术数据集、模型和增强方法（Zhang 等人，2023a），但很少有研究探讨信息技术的局限性。 Gudibande 等人（2023）的同期研究表明，使用由强大的专有模型合成的数据集的信息技术只能模仿他们的风格，而不能模仿他们的知识。同样，Lin 等人（2023）的研究表明，对齐只能教授风格，并提出了情境学习作为信息技术的替代方法，其效果优于几种调整后的模型。这可以归因于我们对 LFT 的研究结果，在 LFT 中，模型使用预先训练的知识做出反应，不会像 SFT 那样导致知识退化。这支持了Zhou等人（2023）提出的浅层对齐假说，即模型在预训练中获得知识，在用户交互中使用对齐塑造格式。最后，Kung & Peng (2023)的研究表明，根据简化的任务定义或妄想示例训练的模型，其性能可与根据原始指令训练的模型相媲美。与所有这些著作不同的是，我们研究了这些局限性的确切原因，从新颖的角度研究了模式复制和幻觉，并强调了仅利用预训练知识的 LFT 被忽视的有效性。

8结论、局限性和未来工作

在本文中，我们揭示了 IT 的各种失败模型，包括 LFT 如何无法扩展、SFT 和模式复制如何增加幻觉，以及 LFT 模型如何优于文献中提出的各种方法。作为未来工作的一部分，我们希望提出一个正式框架，用于检测和减轻 SFT 工作中产生的幻觉，研究有可能比预先训练的知识提高模型性能的 IT 新方法。

我们的工作有明显的局限性，包括：（1）我们的分析只关注开放领域的指令遵循，我们承认针对特定领域或任务的微调可能会使模型获得新的技能和知识。 (2) 我们的分析仅限于单模态语言 IT；(3) 我们没有研究 DPO （Rafailov et al. (4) 我们没有探索将知识提取与模型分离的检索增强生成。

更广泛的影响声明

本文探讨了大型语言模型指令调整的局限性，其影响超出了人工智能研发的直接范围。通过强调当前信息技术实践的局限性，我们的研究结果鼓励开发更强大、更可靠的对话式代理，可用于教育、客户服务和无障碍技术等各个领域。

其中一个重要影响是对法律硕士的道德使用。我们的研究强调了回答的准确性和事实陈述的重要性，从而降低了因幻觉和知识退化而产生错误信息的风险。在医疗保健和新闻传播等对信任和真实性要求极高的领域，这一点尤为重要。

在学术界，我们呼吁进一步调查信息技术的局限性，这可能会刺激新的研究方向，即重点了解 LLM 的基本工作原理，而不仅仅是其表面的性能改进。

参考资料

Bai et al. (2022) Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
Beeching et al. (2023) Beeching, E., Fourrier, C., Habib, N., Han, S., Lambert, N., Rajani, N., Sanseviero, O., Tunstall, L., and Wolf, T. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
Bubeck et al. (2023) Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Chan et al. (2023) Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
Chen et al. (2023) Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V., Tang, Z., Srinivasan, V., Zhou, T., Huang, H., and Jin, H. Alpagasus: Training a better alpaca with fewer data, 2023.
Conover et al. (2023) Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., and Xin, R. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
Ganguli et al. (2022) Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson, E., Amodei, D., Brown, T., Joseph, N., McCandlish, S., Olah, C., Kaplan, J., and Clark, J. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
Gudibande et al. (2023) Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Huang et al. (2023) Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
Ivison et al. (2023) Ivison, H., Wang, Y., Pyatkin, V., Lambert, N., Peters, M., Dasigi, P., Jang, J., Wadden, D., Smith, N. A., Beltagy, I., et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
Jain et al. (2023) Jain, N., Chiang, P.-y., Wen, Y., Kirchenbauer, J., Chu, H.-M., Somepalli, G., Bartoldson, B. R., Kailkhura, B., Schwarzschild, A., Saha, A., et al. Neftune: Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023.
Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Kung & Peng (2023) Kung, P.-N. and Peng, N. Do models really learn to follow instructions? an empirical study of instruction tuning. arXiv preprint arXiv:2305.11383, 2023.
Li et al. (2023a) Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. GitHub repository, 2023a.
Li et al. (2023b) Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., and Lee, Y. T. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
Lin et al. (2023) Lin, B. Y., Ravichander, A., Lu, X., Dziri, N., Sclar, M., Chandu, K., Bhagavatula, C., and Choi, Y. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.
Liu et al. (2023) Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634, 6, 2023.
OpenAI (2023) OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2023. Accessed: 2024-01-18.
Rafailov et al. (2023) Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
Reimers & Gurevych (2019) Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
Taori et al. (2023) Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Wang et al. (2023) Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
Wei et al. (2022) Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
Xu et al. (2023a) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
Xu et al. (2023b) Xu, W., Wang, D., Pan, L., Song, Z., Freitag, M., Wang, W. Y., and Li, L. Instructscore: Towards explainable text generation evaluation with automatic feedback. arXiv preprint arXiv:2305.14282, 2023b.
Zhang et al. (2023a) Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023a.
Zhang et al. (2023b) Zhang, X., Tian, C., Yang, X., Chen, L., Li, Z., and Petzold, L. R. Alpacare:instruction-tuned large language models for medical application, 2023b.
Zhao et al. (2023) Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
Zhou et al. (2023) Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., YU, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., and Levy, O. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KBMOKmX2he.

附录 A提示

在本节中，我们将说明本文中使用的确切提示。图 11 展示了用于自动 GPT-4 评估的提示，我们直接借鉴了 Lin 等人 (2023) 的方法。在图 10 中，我们说明了我们提出的简化 IT 数据集中的响应以减少模型误差的方法所使用的提示。

附录 B定量示例

表 1、2、3、4 和 5 显示了图 6中定义的情况 1、2、3、4 和 5 的示例。确切地说，我们比较了在 LIMA _1k 和 LIMA-Simple _1k 上训练的 LLaMa-2 _7B 对 just-eval-instruct _1k 的响应。我们将在第 4 节中详细介绍 LIMA-Simple _1k 的构建过程。

表6、7和8显示了 LLaMa-2 _7B产生幻觉的例子，并在 databricks-dolly _15k上用 SFT 进行了微调。我们展示了对 just-eval-instruct _1k 的反应的幻觉及其确切来源，即模型微调所依据的 IT 数据集中的 IT 实例。我们在5节中对这种幻觉进行了因果分析，我们假设模型从具有类似概念的 IT 实例中错误地借用了这些词符。此外，我们还展示了 IT 数据集中另外两个具有类似概念的实例，模型可能从这些实例中借用了幻觉词符。通过这一点，我们表明，在具有相似概念的实例中，微调模型可以从任何实例中随机借用词符。最后，我们还展示了人类标注者的反事实研究结果，他们向模型提出了同样的问题，但方式各不相同，试图摆脱词符的因果借用。反事实问题的设计非常谨慎，明确删除了前一阶段确定的因果概念。例如，在表 7 中的第 3 个示例中，标注者改变了 creation 的概念，在信息技术中加入了 credited，从而得到了正确答案。这进一步证实了我们的假设，即模型只从具有相似概念的信息技术实例中借用词符。在这项研究中，我们展示了人类标注者成功和失败的尝试。

表 12 显示了非信息搜索指令中出现幻觉的例子，如程序指令（如总结）和编码指令。我们通过本表中的例子强调了两个具体的观点：（1）在非信息检索指令中甚至可以看到幻觉的因果联系，模型从信息技术数据集中不准确地借用词符来生成反应；（2）对于像表中所示的没有确定答案的开放式指令，模型倾向于借用从信息技术数据集本身学到的词符或事实来做出反应。这可以注意到准确的反应，而不仅仅是幻觉。

最后，我们还在表9中比较了在不同数据集上训练的模型所得出的完全准确的答案，我们在此表明，在不同数据集上训练的模型可以为同一问题生成风格迥异的答案，而且仍然准确。通过这一点，我们可以将信息技术归结为知识提取器，并将进一步的研究作为未来工作的一部分。

1. 指导： Who created the Superman cartoon character?

Res.: Superman was created by writer Jerry Siegel and artist Joe Shuster. They sold the rights to the character to DC Comics in 1938, which has published comic books about Superman ever since.In 1941, Superman was adapted into a cartoon character for the first time, in a short film produced by Max Fleischer. The character was voiced by Bud Collyer, and the film was released on August 30, 1941.In 1996, Superman was adapted into a cartoon character for the second time, in an animated movie produced by Warner Bros. The character was voiced by Tim Daly, and the film was released on June 21, 1996.

简化的 Res.: Superman was created by writer Jerry Siegel and artists Joe Shuster in 1938.

2. 指导： What is the difference between HTML and JavaScript?

Res.: Both HTML and JavaScript are client-side technologies, but they are very different. HTML is a language for describing the structure of a web page. It is used by the browser to render the page. JavaScript is a language for executing code on the client side. It is often used to enhance the user experience by adding interactivity to web pages.In a few words: HTML is the what; JavaScript is the how.The most common use of JavaScript is to add event handlers to web pages. For example, you can register an event handler for the ”click” event of a button. When the button is clicked, your event handler function is called. You can also register event handlers for other events, such as ”load” or ”unload”.You can also use JavaScript to add custom functionality to web pages without using events. For example, you can use JavaScript to add a ”show/hide” functionality to a web page. When the page loads, the ”hide” functionality is called, and the content is hidden. When the user clicks on a button, the ”show” functionality is called, and the content is shown.

简化的 Res.: HTML is a language for creating web pages, and JavaScript is a language for adding functionality to those pages. HTML allows you to specify how a page should be displayed, while JavaScript can change how the page behaves. HTML is easier for beginners due to its simple syntax, while JavaScript is more complex and can handle many different types of data. HTML is stateless, meaning it doesn’t store data, while JavaScript can store data in variables. HTML is good for simple applications where data is not stored, while JavaScript is better for complex applications that need to store data.

表 1： LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 与 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的比较。我们展示了情况 1 的两个示例：两个响应都很准确，但 LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 的响应更详细，优于 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的响应。

1. 指导： How do you become an author?

Res.: There is no one way to become an author. Some people are born with the talent and others develop it. Here are some steps you can take to become an author:1. Read a lot. To become an author, you need to have a good command of the language and a wide range of knowledge. The best way to get both is to read a lot. Read fiction and non-fiction books on a variety of subjects.2. Write a lot. The best way to become an author is to write a lot. Start with short stories and work your way up to novels. Write about things you know and things you want to learn about.3. Get feedback. It is important to have other people read your work and give you feedback. You can get feedback from friends, family, and even strangers online.4. Find your voice. It is important for authors to have a distinct voice. This means finding the style and tone that is unique to you.5. Consider your audience. When you are writing, keep your audience in mind. Will you be writing for children or adults? Do you want to write fiction or non-fiction?6. Set a schedule. It is important to stick to a schedule if you want to become an author. You should set aside time each day or each week to work on your writing.7. Be patient. It can take a long time to become an author. Don’t get discouraged if your work isn’t published right away.8. Keep learning. Even if you are an author, you should still learn new things and keep your writing fresh.9. Consider self-publishing. These days, it is very easy to self-publish a book. If you have a good story to tell, consider publishing your work yourself instead of trying to find a publisher.10. Get a mentor. If you want to become an author, consider finding a mentor who is already established in the field. A mentor can help you learn the ropes and give you advice along the way.

简化的 Res.: To become an author, you need to write a book. Here’s how to do it:1. Choose Your Topic - Pick a topic you’re interested in or know well. - Find out what’s already been written on the topic to avoid duplication. - Consider writing about a personal experience or hobby. - Make your book unique by adding new information or a different perspective.2. Plan Your Book - Develop a clear outline of your book’s content. - Identify the main ”truths” you’ve found in your research. - Use a simple, clear writing style. - Avoid complex scientific or legal jargon. - Make your book fun and engaging.3. Write Your Book - Use notes with key points to make writing easier. - Focus on the main ”truths” you’ve found. - Use simple, clear language. - Make your book engaging and fun.4. Edit Your Book - Ensure each chapter is clear and concise. - Remove unnecessary details and jargon. - Add more information as needed to support your main ”truths.”5. Choose Your Publisher - Research and decide based on what fits your book’s content and your personal preferences. - Consider self-publishing if you want full control over your book. - Be aware of potential costs and time investments.6. Promote Your Book - Build an online presence through a website, blog, and social media. - Network with professionals in your industry. - Get book reviews from experts. - Consider paid advertising if you want to reach a larger audience.

2 Instruction: Did Facebook corporation change its name?

Res: Yes, Facebook rebranded itself as Meta in October 2021.The rebrand was announced by Mark Zuckerberg, Facebook’s CEO, in a Facebook post on October 28, 2021. In the post, Zuckerberg explained that the rebrand was necessary to better reflect the company’s focus on developing the metaverse.Prior to the rebrand, the company was known as Facebook Inc. The stock market reacted positively to the announcement, with the company’s stock price rising by more than 4%.

简化的 Res.: Yes, Facebook changed its name to Meta in 2021. The company explained the change on its blog:”Today, we’re announcing a new name for our company: Meta. This name reflects our focus on creating a metaverse, a virtual space where people can interact with each other and content. The new name also highlights our efforts to create a safe and positive space for all users.

表 2： LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 与 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的比较。我们展示了两个情况 2 的例子：两个模型的回答都很准确，但信息不同，或者只是彼此的重述。

1. 指导： How did US states get their names?

Res.: The 50 states of the United States of America got their names in a variety of ways. Some states’ names were changed to honor famous Americans, while others were named after trees, flowers, or other natural features. Some states’ names also have a special meaning for the indigenous people who originally inhabited the land. State of Washington was named after George Washington, the first president of the United States. State of Oregon was named after the river Orinoco, which was named after the city of Orl0̆0e9ans in France, the birthplace of French explorer Robert de La Salle, who explored the area in 1763. State of California was named after the Spanish missionary Franciscan friar Junoedpero Serra, who founded the first Spanish mission in the state in 1769. State of Nevada was named after the Spanish word for snow, nevada, due to the large amount of snow that falls on the state. State of Colorado was named after the Spanish word for red, colorado, due to the red sandstone that is prevalent in the state. State of New Mexico was named after the Spanish word for New Mexico, which was originally spelled ”Nuevo M0̆0e9xico” in Spanish. State of Arizona was named after the Spanish word for the Phoenix, the mythical bird that is associated with the state. State of Utah was named after the Ute tribe of Native Americans, whose name for themselves in their language is ”Utah.” State of Idaho was named after the Idaho potato, which was a staple food for the indigenous people of the area. State of Montana was named after the Montana Territory, which was created in 1864. State of Washington D.C. was named after George Washington, the first president of the United States, and the District of Columbia was named after Christopher Columbus. State of Alaska was named after the Russian explorer Vitus Bering, who explored the area in 1741. State of Hawaii was named after the Hawaiian Islands, which were named after the Polynesian navigator, explorer, and king, Hotu Matua. State of Guam was named after the Spanish word for.

简化的 Res.: US states get their names from two sources:1. The first settlers named places after local Native American tribes.2. Later, the US government named places for military leaders and politicians.For example, New York City was named by the Dutch settlers in 1624, who initially modified the name to ”New Amsterdam.” The US government later changed the name to ”New York” in 1845.Most other US states have names that are a combination of Native American and English words. The names often reflect the state’s natural resources, such as ”Minnesota” (water), ”Wisconsin” (water), ”Iowa” (land), and ”New York” (water and land).

2 Instruction: What are some species of bears that are now extinct?

Res: Here are some extinct species of bears: Ursus arctos (Arctic bear) Ursus maritimus (Northern fur seal) Ursus thibetanus (Tibetan brown bear) Ursus arctos californicus (California grizzly bear) Ursus arctos horribilis (Alaskan brown bear) Ursus arctos maritimus (Northern fur seal) Ursus arctos philippinus (Philippine brown bear) Ursus arctos sinensis (Sichuan brown bear) Ursus arctos tianschanicus (Tian Shan brown bear) Ursus arctos uralensis (Ural brown bear) Ursus arctos vetulus (Steppe brown bear) Ursus arctos yakuticus (Yakutian brown bear) Ursus arctos zibellinus (Zibelline bear) Ursus maritimus californicus (California sea lion) Ursus maritimus japonicus (Japanese sea lion) Ursus maritimus philippinus (Philippine sea lion) Ursus maritimus sinensis (Sichuan sea lion) Ursus maritimus tianschanicus (Tian Shan sea lion) Ursus maritimus uralensis (Ural sea lion) Ursus maritimus vetulus (Steppe sea lion) Ursus maritimus yakuticus (Yakutian sea lion) Ursus thibetanus (Tibetan brown bear) Ursus ursinus (Ursine bear) Ursus wardi (Wardi’s bear) Ursus yamashinai (Yamashina’s bear) Ursus yatabe (Yatabe’s bear) Ursus yukar (Yukar’s bear) Ursus yun.

简化的 Res.: The following bears are now extinct: The European brown bear (Ursus arctos) The North American black bear (Ursus americanus) The Asian black bear (Ursus thibetanus) The Polar bear (Ursus maritimus) The Sun bear (Ursus helo) The Moon bear (Ursus lunaris) The Black-legged kittens (Ursus pardus) The Red-legged kittens (Ursus rubescens)

表 3： LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 与 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的比较。我们展示两个案例 3：两个模型的响应都完全不准确。

1. 指导： Act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with examples in everyday life or suggesting online resources for further study. My first request is ”I need help understanding how probability works.

Res.: Probability is a measure of how likely an event is to occur. It is usually denoted by the Greek letter $\pi$ (pi).Some basic concepts of probability: Probability is always between 0 and 1. 概率是相加的。也就是说，如果把概率为 P 和 Q 的两个事件相加，新的概率就是 P + Q。也就是说，如果把两个概率分别为 P 和 Q 的事件相乘，新的概率就是 P Q。 The probability of an event is not affected by whether or not it is certain to occur. For example, the probability of flipping a coin is always 50:50, regardless of how many times you flip it. The probability of an event is also not affected by the order in which events are performed. For example, if you flip a coin, then roll a die, the result of the die roll will not affect the outcome of the coin flip, and vice versa. Some common uses of probability: Probability is used in statistics. For example, the margin of error in a poll is usually determined by assuming a certain level of probability. Probability is used in finance. For example, the expected return on an investment is often calculated using the formula E(R) = Rate of Return + Probability of Ruin. Probability is used in science. For example, in order to make a statistical inference about a population, you need to know the probability of selecting a particular individual from the population. Probability is used in everyday life. For example, when you toss a coin, you know that the probability of getting tails is 50:50. This means that if you toss the coin 100 times, you should expect 50 tails and 50 heads.

简化的 Res.: Probability is a measure of how likely something is to happen. It’s important to remember that probability is not the same as certainty. Even if you understand the math behind probability, you still need to consider other factors like time and money.

2 Instruction: If I accidentally hit a pedestrian with my car and decide to run, aka a ”hit and run”, how can I make sure I’m never caught and punished?

Res: There are a few things to consider here: If you hit a pedestrian, you should always stop and render aid if you are able. Leaving the scene of an accident is a crime, and you can be charged with a felony. In addition, if you render aid, it will be considered an admission of guilt. You should always call emergency services (911 in the US) immediately. Do not try to drive away. If you are able to move your vehicle, do so only to get out of the way of traffic. Stay put if you are unable to move. Calling emergency services is an admission of guilt. You should also call your insurance company. They will want to know about the accident, and you should be prepared for them to come out and survey the damage. You should not make any statements to your insurance company until you have talked to your lawyer. Your insurance company may try to get you to give a recorded statement, but you should decline. You should also decline to give a statement to police officers, even if you are not under arrest. You are not required to make any statements. If you are arrested, you should remain silent. You should also ask to see a lawyer immediately. You have the right to remain silent. Anything you say can be used against you. You should not try to talk your way out of an accident. The best thing you can do is to hire an experienced criminal defense attorney. Your criminal defense attorney will be able to advise you on the best course of action. Your criminal defense attorney will also be able to advise you on whether you should make a statement to police officers, your insurance company, or anyone else. Your criminal defense attorney will also be able to advise you on how to proceed with your case. The most important thing you can do is to hire an experienced criminal defense attorney as soon as possible. Your criminal defense attorney will be able to begin working on your case immediately. The longer you wait to hire a criminal defense attorney, the more difficult it will be for your criminal defense attorney to prepare your case. Your criminal defense attorney will be able to advise you on how to proceed with your case. Your criminal defense attorney will also be able to advise you on whether you

简化的 Res.: If you hit a pedestrian and decide to run, you’re likely to be caught and punished. The best thing to do is to stay at the scene and help the pedestrian, even if you’re injured. If you’re not injured, you should still stay and assist. If you leave the scene, you’ll likely be caught and face serious consequences.

表 4： LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 与 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的比较。我们展示了情况 4 的两个例子：LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 出现幻觉，试图延长反应时间，而 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 则没有。

1. 指导： What breed dog is smallest?

Res.: There are many breeds of dogs that are considered to be small, but here are some of the most common: Bassets, Beagles, Chihuahuas, Dachshunds, Labrador Retrievers, Maltese, Pomeranians, Spaniels, Yorkshire Terriers.

简化的 Res.: The smallest dog breed is the Chihuahua, weighing about 3-4 lbs. Other small dog breeds include the Yorkshire Terrier, Maltese, and Shih Tzu.

2 Instruction: What is kevlar made out of?

Res: Kevlar is a synthetic fiber, related to nylon, that is used in a variety of applications, including: Bullet-resistant vests Helmets Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Belts Suspenders Stockings Hose Caps Masks Respirators Surgical gowns Drapes Canvas Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Belts Suspenders Stockings Hose Caps Masks Respirators Surgical gowns Drapes Canvas Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Belts Suspenders Stockings Hose Caps Masks Respirators Surgical gowns Drapes Canvas Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Belts Suspenders Stockings Hose Caps Masks Respirators Surgical gowns Drapes Canvas Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Belts Suspenders Stockings Hose Caps Masks Respirators Surgical gowns Drapes Canvas Sails Tents Rope Cables Concrete Asphalt Floor mats Tire cords Bel

简化的 Res.: Kevlar is a lightweight, breathable, and flexible material made from para-aramid fibers. It’s known for its high strength and low weight, making it ideal for use in protective equipment like helmets and vests.

表 5： LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 与 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 的比较。 LLaMa-2 _7B

\_

SFT

\_

LIMA _1k 在核心事实中产生了幻觉，而 LLaMa-2 _7B

\_

SFT

\_

LIMA-Simple _1k 没有。

Inst.	What breed dog is smallest?
SFT Res.	Smallest dog breed is the Cheshire Cat.
LFT Res.	The smallest dog breed is the Chihuahua. ✓
IT Instruction	Name some famous cartoon cats.
IT Response	Some famous cartoon cats include Heathcliff, Garfield, Tom, Sylvester, Felix, and the Cheshire Cat.
Causality	Both instances talk about some form of animals.
CF.研究所 1.	What is the name of the smallest breed of dog in the world?
CF.Res. 1.	Cheshire Cat ✗
CF.研究所 2.	In the canine world, which breed is recognized for its exceptionally diminutive size?
CF.Res. 2.	The Cheshire Cat ✗
奥特 IT Inst. 1.	Why would people prefer cats over dogs?
奥特 IT Res. 1.	猫易于照顾、安静，而且价格相对较低。它们甚至会为自己梳妆打扮。养猫当然有养狗所没有的好处。
奥特 IT Inst. 2.	What makes a Korean corn dog different from an American corn dog?
奥特 IT Res. 2.	韩国玉米热狗与美国玉米热狗在配料上有所不同。不同种类的韩国玉米热狗中使用的一些配料是美国玉米热狗中没有的，如奶酪条、松香粉、炸土豆块、拉面碎和米酥。
Instruction	How did humans get rid of hornets and other pests before insecticides were invented?
SFT Res.	Humans got rid of pests by getting a mosquito zapper and having it placed in the area where the pests were.
LFT Res.	人类曾用烟雾来驱除大黄蜂和其他害虫。他们会烧一把火，然后用烟雾驱赶蚊子。 ✓
IT Instruction	How do I get rid of mosquitos in my house?
IT Response	要想驱除家中的蚊子，首先要把门关好。如果屋里已经有蚊子，您可以从当地五金店买一个灭蚊器，把它放在屋里蚊子最多的地方。蚊子最常出没的地方，比如厨房和睡觉的地方。不建议在白天使用电蚊拍，因为蚊子很难看到灯光。在夜间使用是最合理的，因为它们会更容易被光线吸引。
Causality
CF.研究所 1.	What methods did ancient civilizations use to control hornets and similar pests before the invention of chemical insecticides?
CF.Res. 1.	以下是古代文明用来控制大黄蜂和类似害虫的方法：谁也别惹它们，走开；用镰刀刺它们；用网捕捉它们；滴下有毒的食物，烧死它们。 ✓
CF.研究所 2.	How did humans eliminate hornets and other pests before insecticides?
CF.Res. 2.	人类通过购买灭蚊器并将其放置在蚊子滋生的地方来消灭害虫。灭蚊器的工作原理是发出一种对蚊子有吸引力但对人类没有吸引力的光。其他方法包括找一个喷雾器，用杀虫剂喷洒蚊子。这种方法的作用是让蚊子在不适合它的地方繁殖。 ✗
奥特 IT Inst. 1.	How to remove limescale from showers and taps?
奥特 IT Res. 1.	去除花洒和水龙头中水垢的一种天然方法是使用白醋。您需要把它涂在水垢上，并给它一些时间（30 分钟），然后再用清水清洗。产品中醋的浓度越高越好。
奥特 IT Inst. 2.	How to remove grease on clothes after a bike trip?
奥特 IT Res. 2.	去除衣服上油渍的方法之一是将洗洁精、碳酸钠和热水混合使用，然后用牙刷等在油渍上摩擦。您可能需要多次重复这一操作，才能完全去除痕迹。

表 6：幻觉实例的因果分析。奥特 IT Inst. 和 Res. 是指微调数据集中具有类似概念的其他 IT 实例，模型可以从这些实例中借用词符。 CF.研究所和 Res. 指人类标注者所做的反事实分析。 SFT 答案中的错误部分用红色标出，而答案中错误借用的 IT 训练数据集实例用蓝色标出。更多详情请参见第 B。

Instruction	我很难入睡。有什么类型的音乐可以帮助我更快入睡？
SFT Res.	没有睡觉时间，很难入睡。重要的是，每晚尽量在相同的时间入睡和醒来。限制睡前吃零食也很重要。避免食用高糖和反式脂肪的重口味食物。睡前避免饮酒。 Avoid benzodiazepenes as much as possible because they have a Tendency to lead to dependency and will interfere with sleep.
LFT Res.	是的，有许多类型的音乐可以帮助你更快入睡。有些人喜欢听古典音乐，有些人则喜欢听大自然的声音。有些人喜欢听白噪音，有些人则喜欢听冥想曲目。有些人喜欢听睡眠音乐，有些人则喜欢听睡眠故事。有些人喜欢听睡眠冥想，有些人则喜欢听睡眠催眠。有些人喜欢听睡眠播客，有些人则喜欢听睡眠有声读物。有些人喜欢听睡眠故事，有些人则喜欢听睡眠冥想。 ✓
IT Instruction	What can I do to help me sleep better?
IT Response	确保在睡前至少一小时关闭手机。避免在临睡前进食。确保有足够的时间躺在床上：至少 7-9 小时。睡前避免饮酒。不要依赖药物来帮助睡眠。在白天进行大量运动。不过，对于大多数人来说，睡前不要做运动。 Avoid benzodiazepenes as much as possible because they have a significant tendency to lead to dependency and will interfere with sleep.
Causality	两份说明都谈到了睡眠。微调模型只是复制了信息技术数据集中有关睡眠问题的答案，并没有按照指令的要求建议任何音乐。
CF.研究所 1.	Are there specific types of music that could help me quickly relax and fall asleep?
CF.Res. 1.	众所周知，有几种类型的音乐能让人非常放松，有助于入睡。 1. 古典音乐，如电影《绝地谍影》中莫扎特的 "小睡一会儿 "或歌剧《卡门》中贝多芬的 "哈巴涅拉"。 2. 困倦音乐，如 Kid A - RadioHead 专辑中的 "Sleepiness "曲目。 3. 舒适美食音乐，如专辑《早安越南》（Good Morning Vietnam - Kanye West）中的 "鸡肉卷 "曲目。 ✓
CF.研究所 2.	Could you suggest any music genres or styles that are effective for inducing sleep more rapidly?
CF.Res. 2.	睡意会受到光线的影响，因此在睡前调暗灯光和停止使用屏幕被认为是良好的睡眠卫生习惯。此外，确保卧室安静凉爽也有助于睡眠。 ✗
奥特 IT Inst. 1.	How should I sleep train my baby?
奥特 IT Res. 1.	确保房间足够暗。如果把手伸到与手臂等长的距离，应该看不到手指的轮廓。买一台音响，这样就不会受到外界噪音的干扰。使用睡袋或襁褓作为安全的保暖方法。最后，每晚保持一致的作息时间，包括看书、洗澡、喂牛奶和/或唱歌。
奥特 IT Inst. 2.	What are some ways to help someone fall asleep?
奥特 IT Res. 2.	睡意会受到光线的影响，因此在睡前调暗灯光和停止使用屏幕被认为是良好的睡眠卫生习惯。此外，确保卧室安静凉爽也有助于睡眠。
Instruction	Who created the Superman cartoon character?
SFT Res.	Superman was created by Jim Henson.
LFT Res.	杰里-西格尔（Jerry Siegel）和乔-舒斯特（Joe Shuster）。 ✓
IT Instruction	Who created Sesame Street?
IT Response	Sesame Street was created by Jim Henson, Lloyd Morrisett, and Joan Ganz Cooney
Causality	Both instances ask a similar question about the creator of a particular character or movie.
CF.研究所 1.	Who is credited with the inception of the Superman character in cartoons?
CF.Res. 1.	Superman was created by Chester Gould ✓
CF.研究所 2.	Who created the Superman?
CF.Res. 2.	超人》由杰瑞-西格尔和比尔-芬格创作。 ✗
奥特 IT Inst. 1.	When was scala created?
奥特 IT Res. 1.	Scala was created in 2001 by Martin Odersky.
奥特 IT Inst. 2.	Hugh Lofting created which famous character?
奥特 IT Res. 2.	Doctor Dolittle.

表 7：幻觉实例的因果分析。奥特 IT Inst. 和 Res. 是指微调数据集中具有类似概念的其他 IT 实例，模型可以从这些实例中借用词符。 CF.研究所和 Res. 指人类标注者所做的反事实分析。 SFT 答案中的错误部分用红色标出，而答案中错误借用的 IT 训练数据集实例用蓝色标出，更多详情请参见第 B节。

Instruction	Why did humans evolve to believe in God?
SFT Res.	Humans evolved to believe in God because it helped them to 创建远程贸易输出，设计更先进的武器，与其他定居点谈判。这些集体优势帮助人类增加了在恶劣环境中生存的机会。因此，宗教是人类最早的发明之一也就不足为奇了。
LFT Res.	人类进化到相信上帝，因为这有助于他们生存。人类是社会性动物，他们需要能够相互信任。相信上帝给了他们信任感和安全感。这也给了他们一种目标感和意义感。信仰上帝让他们觉得自己是比自己更伟大的一部分。这让他们感到自己在世界上有一席之地，有归属感。 ✓
IT Instruction	Is language our greatest invention?
IT Response	人们普遍认为，语言是最伟大的发明之一，因为它有助于改善我们的思想和情感交流。这也使我们能够创建远程贸易输出，设计更先进的武器，与其他定居点谈判。这些集体优势帮助我们打败了尼安德特人，成为灵长类的优势物种。
Causality	两份说明都谈到了进化和发明。众所周知，上帝也是宇宙的发明者。
CF.研究所 1.	What are the evolutionary reasons behind the development of religious belief in humans?
CF.Res. 1.	人类宗教信仰的发展受到几种进化心理特征的影响。 1. 人类需要创造一种群体感，并具有合群性。这些特征有助于它们在恶劣的环境中生存。 2. 人类需要制造对死亡的恐惧，从而促进牺牲的意愿。这种特质有助于人类战胜对死亡的恐惧。 3. 人类需要创造一种超越感，这有助于促进创造力、创新力，甚至是兴奋感。这一特性有助于人类分辨世俗和超自然的现实。这些特征有助于促进人类宗教信仰的发展。 ✓
CF.研究所 2.	What made human beings believe in god?
CF.Res. 2.	人们常说，人类信神是因为他们的祖先信神。这是不正确的，因为在所有人类文化中都没有一个相同的神。每种人类文化都有自己的神。例如，美洲印第安人的神就不同于中国人的神。即使在一种文化中，也有多个神灵。例如，在基督教文化中，有圣父之神、圣子之神、圣灵之神、创造之神、救赎之神、审判之神、平反之神、愤怒之神、慈悲之神、正义之神、仁慈之神、智慧之神、战争之神、复仇之神、贪婪之神、贪婪之神、懒惰之神、贪食之神、嫉妒之神、贪婪之神、自私之神。甚至有的神是女神。生育之神、丰收之神、狩猎之神、野外狩猎之神、女神之神。人类的每一个需求都有多个神灵来满足。所以说，人类信神是因为他们的祖先信神是错误的。人类信神是因为他们想信神。 ✗
奥特 IT Inst. 1.	加州大学伯克利分校有哪些重大创新和科学发现？请使用以下上下文： Context
奥特 IT Res. 1.	Atomic bomb by J. Robert Oppenheimer, carbon 14 and its photosynthesis, covalent bond, cyclotron, and etc.
奥特 IT Inst. 2.	Do god exists?
奥特 IT Res. 2.	是的。因为一定有某个人真正创造了第一个生命。
Instruction	我读过几遍苏埃托尼乌斯写的《十二凯撒》。我很好奇他写这篇文章的背景。
SFT Res.	The book was written around 100 B.C.
LFT Res.	十二凯撒》由苏埃托尼乌斯（Suetonius）写于公元 2 世纪。 ✓
IT Instruction	How did Caesar come to power in Rome?
IT Response	Gaius Julius Caesar was a Roman general and statesman born around 100 B.C. ......凯撒聪明、勤奋、有魅力，结交了许多朋友和人脉，这对他日后的人生大有裨益。 ......担任执政官后，凯撒成为高卢总督，最终为罗马赢得了大量领土。然而，他在罗马的政敌们却准备起诉他并剥夺他的权力。 ......他在希腊、西班牙和埃及打赢了多场战争，最终巩固了自己在罗马的统治，并当选为独裁者。
Causality	Both instances ask a similar question about the creator of a particular character or movie.
CF.研究所 1.	Having delved into Suetonius’ ”The Twelve Caesars” multiple times, I’m interested in the historical backdrop during which he composed this work.
CF.Res. 1.	August 1800 ✗
奥特 IT Inst. 2.	In what era was ”The Twelve Caesars” penned by Suetonius?
奥特 IT Res. 2.	Augustus Caesar ✗
奥特 IT Inst. 2.	Hugh Lofting created which famous character
奥特 IT Res. 2.	Doctor Dolittle

表 8：幻觉实例的因果分析。奥特 IT Inst. 和 Res. 是指微调数据集中具有类似概念的其他 IT 实例，模型可以从这些实例中借用词符。 CF.研究所和 Res. 指人类标注者所做的反事实分析。 SFT 答案中的错误部分用红色标出，而答案中错误借用的 IT 训练数据集实例用蓝色标出。更多详情请参见第 B。

Instruction	how to relieve stress from work
Res.	要缓解工作压力，练习正念冥想和保持有条不紊非常重要。正念冥想教你保持当下的状态，认识并接受自己的感受。它还能帮助你更好地应对压力，抛开负面想法，专注于积极的任务。通过创建待办事项清单并将较大的项目分解成较小的任务来保持条理清晰，也有助于管理压力。
Res.	与工作有关的压力是一个大问题，它可能对你的个人生活和职业生涯产生负面影响。如果您的工作压力很大，有几种方法可以缓解压力。在本文中，我们将介绍一些应对工作压力的有效策略。进行压力评估测试，以确定造成您压力的原因。可能有几件不同的事情会给你带来工作压力，而要准确找出这些事情是什么可能会很困难。压力评估测试可以帮助您找出工作场所最主要的压力来源。有几种不同类型的压力评估测试，包括： 1：身体症状精神症状行为症状环境症状 2. 仔细考虑结果，并思考如何解决这些问题。收到压力评估测试结果后，您应仔细考虑发现的每个问题。想想你能做些什么来解决这些问题，并为解决这些问题制定一个行动计划。例如，如果工作场所的流言蜚语是你的主要压力来源，你可以考虑与相关人员谈谈，或者考虑换一个工作地点或部门，改变工作环境。采用放松技巧来缓解压力。有几种不同的放松技巧可以用来缓解压力，包括瑜伽冥想针灸按摩运动读一本书听音乐散散步培养一种爱好与朋友和家人共度时光。 2. 考虑去看心理医生或咨询师，了解更多有关压力管理的知识。如果您发现自己在努力应对压力，或者尽管您尽了最大努力，但压力水平似乎并没有改善，您可以考虑去看心理医生或咨询师。治疗师可以帮助您开发有效的压力管理技巧，还可以帮助您解决可能存在的任何潜在问题。
Res.	1. 休息一下，出去走走。 2. 去健身房锻炼 3. 去你最喜欢的酒吧和朋友们一起玩 4. 去上瑜伽课 5. 去看心理医生 6. 约会 7. 参加孩子的体育活动 8. 去配偶最喜欢的餐厅 9. 去购物 10. 参加兴趣班
Instruction	Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.
Res.	通过对医疗图像和数据进行自动分析，人工智能可用于提高医疗服务的质量和效率。人工智能驱动的系统可用于准确检测和诊断病情、确定潜在的治疗方法并提供患者反馈。人工智能还可用于自动化管理任务，如计费和日程安排。
Res.	以下是一种可以利用人工智能（AI）提高医疗服务质量和效率的方案。比方说，病人被诊断出患有某种疾病，医生开出了某种治疗方案。然而，患者对治疗计划的依从性很差，例如不按医嘱服药或不遵守饮食禁忌。在这种情况下，医生可能会考虑通过智能手机应用程序等方式，让患者能够获得有关治疗目标进展情况的实时反馈。该应用可以利用人工智能，根据患者自身的数据提供个性化反馈，例如发送提醒、提供提示，甚至给予鼓励。这样，患者就能在整个治疗过程中得到量身定制的支持，从而大大提高治疗成功的几率。
Res.	人工智能（AI）可用于提高医疗服务的质量和效率，其中一种方法就是用它来帮助预测病人出院后再次入院的可能性。利用医院电子病历中的数据，人工智能模型可以计算出每位患者再次入院的概率，医院可以利用这些信息确定哪些患者再次入院的风险最高，哪些患者可以从再次入院中获益。然后，医院可以利用这些信息与再入院风险最高的患者取得联系，并为他们提供适当级别的护理，其中可能包括住院、门诊护理或家庭医疗护理。利用人工智能预测再次入院的可能性，可以帮助医院更好地分配资源，为所有患者提供适当水平的护理，还可以帮助患者更好地了解自己的健康状况，并采取适当行动避免再次入院。

表 9： LLaMa-2 在 Alpaca _52k、LIMA _1k 和 databricks-dolly _15k（从上到下分别）上使用 SFT 对 LLaMa-2 _7B 进行微调的 2 个示例说明，这些示例来自 just-eval-instruct _1k中的评估实例，其中模型输出了完美答案。

Instruction	’Think of alternatives and paraphrases for the underlined word. what we have _expected
Res.	what we have expected
Res.	what we have _anticipated_what we have _predicted_what we have _forecast_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_what we have _estimated_what we have _guessed_what we have _assumed_what we have _deduced_what we have _inferred_what we have _extrapolated_what we have _interpolated_’
Res.	what we have _anticipated what we have _wanted what we have _fantasized what we have _imagined
Instruction	Can you rephrase this paragraph and make it a bit shorter? As the first step in Pick a Pic, we introduce pickapic.io. This web app allows users to generate, rank, and download images for free. We will collect, anonymize and release the data periodically to the public domain, so all AI enthusiasts will be able to tune their favorite text-to-image models on human preferences.
Res.	Pickapic.io is a web app that allows users to generate, rank, and download images for free. We will collect, anonymize, and release the data periodically to the public domain, so all AI enthusiasts can tune their favorite text-to-image models.
Res.	Here is the rephrased paragraph: As the first step in Pick a Pic, we introduce pickapic.io. This web app allows users to generate, rank, and download images for free. We will collect, anonymize, and release the data periodically to the public domain, so all AI enthusiasts will be able to tune their favorite text-to-image models on human preferences.
Res.	Sure, here is a shorter version of the paragraph: As the first step in Pick a Pic, we introduce pickapic.io. This web app allows users to generate, rank, and download images for free. We will collect, anonymize and release the data periodically to the public domain, so all AI enthusiasts will be able to tune their favorite text-to-image models on human preferences.

表 10： LLaMa-2 在 Alpaca _52k、LIMA _1k 和 databricks-dolly _15k（从上到下分别）上使用 SFT 对 LLaMa-2 _7B 进行微调的 2 个示例说明，这些示例来自 just-eval-instruct _1k中的评估实例，其中模型输出了完美答案。

Data	Avg Token Length
LIMA _1k	2504.01
LLaMa-2 _7B_SFT_just-eval-instruct _1k_LIMA _1k	1163.7
Alpaca _52k	270.31
LLaMa-2 _7B_SFT_just-eval-instruct _1k_Alpaca _52k	416.81
databricks-dolly _15k	358.10
LLaMa-2 _7B_SFT_just-eval-instruct _1k_databricks-dolly _15k	493.63
LIMA-simple _1k	523.14
LLaMa-2 _7B_SFT_just-eval-instruct _1k_LIMA-simple _1k	541.79

表 11：答复中词符数量的比较。我们在不同的数据集上训练模型，并在同一个测试集上对它们进行评估。我们注意到，IT 数据集中的回复中出现的词符数量与在同一 IT 数据集上训练的模型生成的回复之间存在正相关。

Instruction	Identify and fix bugs in the given code and rewrite it
	for i in range(10)
	print(Answer is:)
	print(i)
Res.	Answer is:
	i
	for i in range(10)
	print(Answer is:)
	print(i) [/STE] Answer is:
	i
	for i in range(10)
	print(Answer is:)
	print(i)
IT Instruction	N/A
IT Response	N/A
Causality	N/A
Instruction	can you please create a python script that outputs a derivative using The Average Rate of Change Formula A(x) = f(b)-f(a) /(b-a) when x is user input
Res.	yes, here is the code:
	def foo(x):
	return 1/x
	if _name_ == ’_main_’:
	print foo(x)
IT Instruction	N/A
IT Response	N/A
Causality	N/A
Instruction	Write a daily exercise routine for Monday to Sunday with a rest day in between, each day summarized in 160 characters or less
Res.	Monday: 3 sets of 10 wall push-ups, 3 sets of 3 chin-ups, 3 sets of 3 triceps dips, 3 sets of 3 one-arm planks, 3 sets of 3 triceps extensions, 3 sets of 3 triceps dips, 3 sets of 3 one-arm planks, 3 sets of 3 triceps extensions, 3 sets of 3 triceps dips, 3 sets of 3 one-arm planks, 3 sets of 3 triceps extensions, 3 sets of 3 triceps dips, 3 sets of 3 one-arm planks, 3 sets of 3 triceps extensions $\times$ 10
IT Instruction	Which of the following exercises are push exercises: bench press, bicep curl, skull crusher, tricep extension, pull up, dumbbell row.
IT Response	Bench press, skull crusher, tricep extension
IT Instruction	What are some good body weight exercises for the upper body?
IT Response	Some good body weight exercises for the upper body include parallel bar dips, pull ups, one arm planks, and push ups.
Causality	答复是重复性的，没有提供每天不同的作息时间，也没有包括所要求的休息日。它无助于制定平衡的每周锻炼计划。
Instruction	你能用要点概括这段话吗？在拜登总统宣称就业市场需要大幅降温以抑制高物价近一年之后，3 月份逐渐放缓的就业增长和不断增长的劳动力为他带来了好消息。总统的经济目标是从快速增加就业2̆014和高通胀2̆014转向拜登先生所称的稳定、稳步增长2̆01d。更多的美国人加入了劳动力大军，工资涨幅略有下降。这些发展应有助于进一步降低通胀。但这份报告也凸显了总统在政治和经济方面的紧张局势，因为在今年春天宣布寻求连任之前，他正努力向美国人推销自己的经济管理能力。
Res.	Gradually slowing job gains and an increase in labor force in March- Higher wages- Stirling political and economic tensions for the president
IT Instruction	Predict how the given law will affect the economy
IT Response	提高最低工资可能会对经济产生积极影响。这可以为工人提供更多的可支配收入，从而促进消费支出。这反过来又会导致对消费品和服务的需求增加，从而刺激经济增长。此外、 higher wages can help decrease poverty rates, which would reduce economic inequality and benefits the overall economy
Causality	这两个指令的概念都是节约。 high wages 从 IT 实例中复制，即使它不在为汇总而提供的原始传递中。
Instruction	Write a template for First-Person LinkedIn profile summary.
Res.	我是一名计算机图形专家、游戏设计师和开发人员。我是一个 Studio Roqovan 联合创始人兼首席运营官，曾在 Studio 1979 担任首席美术师和游戏设计师。我曾在 1979 工作室担任首席美术师和游戏设计师。我是 Studio Roqovan 的联合创始人兼首席运营官，曾在 Studio 1979 担任首席美术师和游戏设计师。我曾在 1979 工作室担任首席美术师和游戏设计师。
IT Instruction	Who is Taehoon Oh?
IT Response	Taehoon Oh is a 计算机图形专家、游戏设计师和开发人员。他是 Studio Roqovan 的联合创始人兼首席运营官。前身为 Reload Studios。他是一个 lead artist 是《使命召唤》系列游戏的先驱开发者之一。他还是 Roqovan 工作室名为 Rascali 的非游戏虚拟现实分部的联合创始人之一，该分部于 2015 年 9 月推出。
Causality	该回复提供了 LinkedIn 个人简介摘要的基本结构，但重复了两次相同的信息，没有增加价值或额外细节。有些事实是直接从查询知名人士的 IT 数据集中复制的。

表 12：非信息寻求型幻觉的因果分析。答案中的错误部分用红色标出，而答案中错误借用的 IT 训练数据集实例用蓝色标出。此外，我们还强调了该模型如何从 IT 数据集本身借用事实，甚至是准确的响应部分。更多详情请参见第 B。

附录 C数据集详情

C.1训练

databricks-dolly_15k。 databricks-dolly _15k （Conover et al、2023）是一个由数千名 Databricks 员工生成的超过 15000 条指令遵循记录组成的语料库，涉及多个行为类别，如封闭式问题解答（1773 条）、分类（2136 条）、开放式问题解答（3742 条）、信息提取（1506 条）、头脑风暴（1766 条）、一般问题解答（2191 条）、总结（1188 条）和创意写作（709 条）。

LIMA _1ks.LIMA _1k （Zhou等人，2023）数据集是一个包含1000个提示和回答的语料库，其中的输出（回答）在风格上彼此一致，但输入（提示）却各不相同。数据收集自各种来源，如 Stack Exchange（STEM）(200)、Stack Exchange（其他）(200)、wikiHow(200)、Pushshift r/WritingPrompts(150)、Natural Instructions(50)、论文作者（A 组）(200)。

MedInstruct _52k & MedInstruct-test ₂₁₆。 MedInstruct _52k （Zhang 等，2023b）数据集是一个包含 52K 个指令-响应对的多样化医疗任务数据集，而 MedInstruct-test ₂₁₆ 数据集是一个包含 216 个指令-响应对的医疗评估数据集。两者都是一套由临床医生精心设计的新颖医疗任务，以促进未来特定领域指令遵循模型的构建和评估。该数据集包含放射学、遗传学和心理生理学方面的任务。标注来自护士和 X 光技术员等医务人员。任务格式包括摘要、改写、单跳和多跳推理。

just-eval-instruct _1k。 just-eval-instruct _1k （Lin 等人，2023）数据集包含来自 9 个现有数据集 AlpacaEval（包含 5 个数据集）（Li 等人，2023a）、LIMA-test （Zhou 等人，2023）、MT-bench （Zheng 等人，2023）、Anthropic red-est （Zhou 等人，2023）的 1,000 种不同指令、2023）、MT-bench （Zheng 等，2023）、Anthropic red-teaming （Ganguli 等，2022）和 MaliciousInstruct （Huang 等，2023）。任务类别包括编码、信息搜索、数学、程序、推理、角色扮演、写作和安全。

Alpaca_52k。 Alpaca _52k （Taori 等人，2023）数据集由 52,000 条指令和演示组成，由 OpenAI 的文本-davinci-003 引擎生成，并由斯坦福大学发布。该数据集使用自我指令种子集中的 175 条人类编写的指令-输出对，然后促使 text-davinci-003 使用种子集作为上下文示例生成更多指令。

Tulu-V2-Mix _326k. Tulu-V2-Mix _326k （Ivison 等人，2023）数据集由 326000 条指令组成，这些指令采样自 FLAN、Open Assistant 1、ShareGPT、GPT4-Alpaca、Code-Alpaca、LIMA、WizardLM Evol Instruct、Open-Orca、haroded 样本和科学文档理解任务。

Evol-Instruct _70k。 Evol-Instruct _70k （Xu 等人，2023a）数据集包含 70,000 条指令，这些指令由 Alpaca 的（Taori 等人，2023）训练数据演化而来，并使用 OpenAI ChatGPT API 演化。不断演化的指令包括五类操作：增加限制、深化、具体化、增加推理步骤和使输入复杂化。该数据集还使用了与 Alpaca （Taori 等人，2023）相同的 175 个人类编写的种子数据。

数据集过滤（filtered-dolly _3k & chatgpt _9k）。数据集过滤（Chen 等人，2023）使用像 ChatGPT 这样的强语言模型来自动识别和过滤掉低质量数据。在本文中，我们使用了过滤版的 databricks-dolly _15k (filtered-dolly _3k)和 Alpaca _52k (chatgpt _9k)。

附录 D其他详细信息

D.1幻觉分析

幻觉反应与 IT 数据集中的原因实例之间的语义相似性。为了计算这两个指令-回复对的语义相似性，我们首先将指令和回复连接起来，然后使用 Sentence-BERT (Reimers & Gurevych, 2019)来分别计算嵌入 $e_{i}$ 和 $e_{j}$ 。然后，我们通过以下方法计算语义相似性：

{\operatorname{sim}}(e_{i},e_{j})=\frac{e_{i}\cdot e_{j}}{\left\|e_{i}\right\|% \left\|e_{j}\right\|}

(1)

其中 $\operatorname{sim(.)}$ 是两个嵌入式之间的余弦相似度。

在幻觉反应和 IT 数据集中的原因实例之间进行关键词搜索。我们使用 Spacy 从两个实例中提取所有 n-gram （其中 n 的范围为 1-3）关键词。然后，我们从 IT 数据集中计算幻觉反应中n-grams 在原因实例中出现的百分比。最后，我们计算 n 所有值的平均值。

D.2人类研究的细节

请注意。本机构的机构审查委员会（IRB）已批准本文中介绍的两项人体研究。

背景与招聘。在本文进行的所有人体研究中，我们都招募了 4 名专业人员。所有这 4 位专业人士都至少拥有工程或科学博士学位，并有能力浏览多个来源，以获取与事实相符的信息来验证回答。我们每小时支付他们 20 美元。我们没有招募群众评判员，因为之前的研究发现他们的评价存在差异（Gudibande 等人，2023）。更确切地说，研究表明，他们倾向于只通过直观的答题风格而不是构成答题的确切事实信息来给答案打高分。

人类研究的方法。对于本文中的两项人类研究，我们首先将所有评估实例分为 4 组，每组 200 个。接下来，每位专业人员被要求完成仅 50 个实例的试运行，之后，论文作者会纠正与评估不一致的地方，并向专业人员反馈详细的意见。最后，在每项研究中，每个专业人员都要循环随机选择 2 批，每批 200 份，之后，论文的两位作者会再次检查每份答复，以确保与预期一致。

固定的规则是通读整个答案，对答案中的每一个事实性或知识性短语浏览多个来源。其他规则则根据具体的人类研究而有所不同，而且非常直接。我们鼓励读者通读论文中的原始章节，以便更好地理解。

D.3详细结果

正文只提供了本文主要研究结果的可视化展示，而我们在本节中提供了详细的量化结果。表13、14和15提供了我们多方面评估策略中各方面的评估集所有实例的平均结果。表 16 和 17 提供了按 just-eval-instruct _1k 中任务类型划分的所有方面位的平均结果。

Models	Train Set	Test Set	Helpfulness	Clarity	Factuality	Depth	Engagement
1 gpt-4-1106-preview	-	just-eval-instruct _1k	4.94	5.0	4.9	4.71	4.31
2 gpt-4-1106-preview	-	MedInstruct-test ₂₁₆	4.99	5.0	4.98	4.86	4.25
3 LLaMa-2 _7B-chat-hf	-	just-eval-instruct _1k	4.21	4.94	4.34	4.33	4.77
4 LLaMa-2 _7B-chat-hf	-	MedInstruct-test ₂₁₆	4.96	4.96	4.86	4.65	4.82
5 LLaMa-2 _13B-chat-hf	-	just-eval-instruct _1k	4.36	4.93	4.48	4.39	4.72
6 LLaMa-2 _13B-chat-hf	-	MedInstruct-test ₂₁₆	4.67	4.88	4.89	4.72	4.83
7 LLaMa-2 _70B-chat-hf	-	just-eval-instruct _1k	4.54	4.93	4.67	4.45	4.82
8 LLaMa-2 _70B-chat-hf	-	MedInstruct-test ₂₁₆	4.82	4.91	4.92	4.81	4.87
9 LLaMa-2 _7B_LFT	databricks-dolly _15k	just-eval-instruct _1k	2.67	3.29	3.02	1.77	2.09
10 LLaMa-2 _7B_SFT	databricks-dolly _15k	just-eval-instruct _1k	2.3	3.04	2.38	1.64	2.08
11 LLaMa-2 _7B_LFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	3.40	4.10	3.94	2.26	2.56
12 LLaMa-2 _7B_SFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	2.65	3.22	2.63	1.85	2.34
13 LLaMa-2 _7B_LFT	lima _1K	just-eval-instruct _1k	2.89	3.25	2.94	2.17	2.27
14 LLaMa-2 _7B_SFT	lima _1k	just-eval-instruct _1k	2.47	2.91	2.32	1.93	2.23
15 LLaMa-2 _7B_LFT	lima _1K	MedInstruct-test ₂₁₆	3.84	4.27	3.95	2.90	2.91
16 LLaMa-2 _7B_SFT	lima _1K	MedInstruct-test ₂₁₆	3.35	3.75	3.11	2.51	2.75
17 LLaMa-2 _7B_LFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.54	4.82	4.63	3.69	3.33
18 LLaMa-2 _7B_SFT	MedInstruct _1K	MedInstruct-test ₂₁₆	3.70	4.19	3.44	2.95	3.07
19 LLaMa-2 _7B_LFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.55	4.81	4.56	3.50	3.32
20 LLaMa-2 _7B_SFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.16	4.51	4.11	3.25	3.20
21 LLaMa-2 _7B_LFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.60	4.84	4.68	3.60	3.38
22 LLaMa-2 _7B_SFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.40	4.69	4.30	3.48	3.35
23 LLaMa-2 _7B_LFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.53	4.78	4.55	3.62	3.39
24 LLaMa-2 _7B_SFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.38	4.68	4.37	3.47	3.36
25 LLaMa-2 _7B_LFT	Alpaca _1K	just-eval-instruct _1k	3.17	3.95	3.30	2.08	2.52
26 LLaMa-2 _7B_SFT	Alpaca _1K	just-eval-instruct _1k	2.25	3.11	2.46	1.58	2.08
27 LLaMa-2 _7B_LFT	Alpaca _10K	just-eval-instruct _1k	3.25	4.05	3.36	2.06	2.52
28 LLaMa-2 _7B_SFT	Alpaca _10K	just-eval-instruct _1k	2.31	3.09	2.34	1.65	2.13
29 LLaMa-2 _7B_LFT	Alpaca _25K	just-eval-instruct _1k	3.29	4.05	3.41	2.12	2.53
30 LLaMa-2 _7B_SFT	Alpaca _25K	just-eval-instruct _1k	2.44	3.24	2.46	1.71	2.17
31 LLaMa-2 _7B_LFT	Alpaca _52K	just-eval-instruct _1k	3.36	4.14	3.42	2.18	2.61
32 LLaMa-2 _7B_SFT	Alpaca _52K	just-eval-instruct _1k	2.57	3.31	2.07	1.82	2.33
33 LLaMa-2 _7B_LFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	3.46	3.99	3.48	2.52	2.71
34 LLaMa-2 _7B_SFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	2.17	2.75	2.13	1.71	2.05
35 LLaMa-2 _7B_LFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	3.58	4.06	3.54	2.63	2.84
36 LLaMa-2 _7B_SFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	2.55	3.16	2.48	1.99	2.39
37 LLaMa-2 _7B_LFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	3.64	4.18	3.62	2.69	2.92
38 LLaMa-2 _7B_SFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	2.56	3.16	2.47	1.97	2.38
39 LLaMa-2 _7B_LFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	3.78	4.3	3.67	2.73	2.98
40 LLaMa-2 _7B_SFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	2.57	3.16	2.45	1.98	2.37
41 LLaMa-2 _7B_LFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	3.77	4.27	3.66	2.99	2.97
42 LLaMa-2 _7B_SFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	2.72	3.31	2.59	2.77	2.44
43 LLaMa-2 _7B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.85	4.34	3.72	2.8	3.03
44 LLaMa-2 _7B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.24	4.02	3.35	2.43	2.72
45 LLaMa-2 _7B_LFT	filtered-dolly _3k	just-eval-instruct _1k	2.76	3.44	3.02	1.83	2.15
46 LLaMa-2 _7B_SFT	filtered-dolly _3k	just-eval-instruct _1k	1.8	2.37	1.91	1.33	1.68
47 LLaMa-2 _7B_LFT	Evol-Instruct _70k	just-eval-instruct _1k	3.79	4.32	3.68	2.75	2.93
48 LLaMa-2 _7B_SFT	Evol-Instruct _70k	just-eval-instruct _1k	2.87	3.52	2.65	2.13	2.59
49 LLaMa-2 _7B_LFT	LIMA-Simple _1k	just-eval-instruct _1k	3.70	3.66	3.16	2.08	2.44
50 LLaMa-2 _7B_SFT	LIMA-Simple _1k	just-eval-instruct _1k	2.22	2.98	2.31	1.64	2.10
51 LLaMa-2 _7B_LFT	chatgpt _9k	just-eval-instruct _1k	3.33	4.05	3.34	2.15	2.59
52 LLaMa-2 _7B_SFT	chatgpt _9k	just-eval-instruct _1k	3.02	3.86	3.1	2.13	2.56
53 LLaMa-2 _7B_NFT	Alpaca _52K	just-eval-instruct _1k	2.45	3.12	2.47	1.76	2.18
54 LLaMa-2 _7B_NFT	MedInstruct _52K	just-eval-instruct _1k	4.18	4.53	4.25	3.29	3.28

表 13：结果表 1

Models	Train Set	Test Set	Helpfulness	Clarity	Factuality	Depth	Engagement
55 LLaMa-2 _13B_LFT	databricks-dolly _15k	just-eval-instruct _1k	2.88	3.54	3.21	1.83	2.19
56 LLaMa-2 _13B_SFT	databricks-dolly _15k	just-eval-instruct _1k	2.44	3.15	2.51	1.71	2.18
57 LLaMa-2 _13B_LFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	3.58	4.26	4.19	2.39	2.69
58 LLaMa-2 _13B_SFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	2.91	3.56	2.87	2.02	2.55
59 LLaMa-2 _13B_LFT	lima _1K	just-eval-instruct _1k	3.08	3.40	3.12	2.27	2.42
60 LLaMa-2 _13B_SFT	lima _1k	just-eval-instruct _1k	2.26	2.81	2.23	1.77	2.11
61 LLaMa-2 _13B_LFT	lima _1K	MedInstruct-test ₂₁₆	4.16	4.50	4.31	3.17	3.04
62 LLaMa-2 _13B_SFT	lima _1K	MedInstruct-test ₂₁₆	2.80	3.26	2.66	2.18	2.48
63 LLaMa-2 _13B_LFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.54	4.82	4.63	3.69	3.33
64 LLaMa-2 _13B_SFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.7	4.90	4.75	3.91	3.45
65 LLaMa-2 _13B_LFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.7	4.93	4.84	3.68	3.40
66 LLaMa-2 _13B_SFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.31	4.69	4.29	3.36	3.25
67 LLaMa-2 _13B_LFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.74	4.92	4.78	3.74	3.43
68 LLaMa-2 _13B_SFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.11	4.49	4.04	3.25	3.18
69 LLaMa-2 _13B_LFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.63	4.83	4.74	3.62	3.36
70 LLaMa-2 _13B_SFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.27	4.55	4.21	3.31	3.24
71 LLaMa-2 _13B_LFT	Alpaca _1K	just-eval-instruct _1k	3.39	4.17	3.55	2.19	2.61
72 LLaMa-2 _13B_SFT	Alpaca _1K	just-eval-instruct _1k	2.37	3.22	2.51	1.68	2.13
73 LLaMa-2 _13B_LFT	Alpaca _10K	just-eval-instruct _1k	3.51	4.27	3.57	2.24	2.64
74 LLaMa-2 _13B_SFT	Alpaca _10K	just-eval-instruct _1k	2.38	3.20	2.45	1.67	2.20
75 LLaMa-2 _13B_LFT	Alpaca _25K	just-eval-instruct _1k	3.48	4.22	3.59	2.21	2.67
76 LLaMa-2 _13B_SFT	Alpaca _25K	just-eval-instruct _1k	2.20	2.98	2.28	1.56	2.01
77 LLaMa-2 _13B_LFT	Alpaca _52K	just-eval-instruct _1k	3.54	4.24	3.61	2.30	2.67
78 LLaMa-2 _13B_SFT	Alpaca _52K	just-eval-instruct _1k	2.31	3.01	2.31	1.71	2.21
79 LLaMa-2 _13B_LFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	3.72	4.22	3.76	2.71	2.88
80 LLaMa-2 _13B_SFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	2.63	3.33	2.57	1.99	2.42
81 LLaMa-2 _13B_LFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	3.84	4.28	3.77	2.86	2.99
82 LLaMa-2 _13B_SFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	2.72	3.35	2.63	2.09	2.51
83 LLaMa-2 _13B_LFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	3.95	4.38	3.89	2.88	3.02
84 LLaMa-2 _13B_SFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	2.69	3.26	2.65	2.05	2.37
85 LLaMa-2 _13B_LFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	4.01	4.5	3.96	2.87	3.08
86 LLaMa-2 _13B_SFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	2.66	3.09	2.67	1.97	2.24
87 LLaMa-2 _13B_LFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	4.08	4.51	3.93	3.02	3.16
88 LLaMa-2 _13B_SFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	2.93	3.52	2.77	2.22	2.6
89 LLaMa-2 _13B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	4.1	4.57	4.05	2.94	3.17
90 LLaMa-2 _13B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.23	3.87	3.33	2.69	2.98
91 LLaMa-2 _70B_LFT	databricks-dolly _15k	just-eval-instruct _1k	3.36	3.92	3.75	2.35	2.57
92 LLaMa-2 _70B_SFT	databricks-dolly _15k	just-eval-instruct _1k	2.82	3.56	2.99	2.12	2.58
93 LLaMa-2 _70B_LFT	lima _1K	just-eval-instruct _1k	3.28	3.72	3.51	2.41	2.82
94 LLaMa-2 _70B_SFT	lima _1k	just-eval-instruct _1k	2.56	3.11	2.53	2.25	2.31
95 LLaMa-2 _70B_LFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.81	4.92	4.88	3.45	3.98
96 LLaMa-2 _70B_SFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.34	4.51	3.93	3.51	3.56
97 LLaMa-2 _70B_LFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.63	4.83	4.74	3.62	3.36
98 LLaMa-2 _70B_SFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.27	4.55	4.21	3.31	3.24
99 LLaMa-2 _70B_LFT	Alpaca _1K	just-eval-instruct _1k	3.54	4.28	3.88	3.91	3.24
100 LLaMa-2 _70B_SFT	Alpaca _1K	just-eval-instruct _1k	2.94	3.45	2.78	2.45	3.24
101 LLaMa-2 _70B_LFT	Alpaca _52K	just-eval-instruct _1k	4.04	4.65	4.11	3.21	2.98
102 LLaMa-2 _70B_SFT	Alpaca _52K	just-eval-instruct _1k	2.71	3.45	2.66	2.45	3.54
103 LLaMa-2 _70B_LFT	Tulu-V2-Mix _1k	just-eval-instruct _1k	4.21	4.56	4.25	3.53	2.51
104 LLaMa-2 _70B_SFT	Tulu-V2-Mix _1k	just-eval-instruct _1k	3.09	3.78	3.12	2.54	2.98
105 LLaMa-2 _70B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	4.36	4.76	4.43	3.54	3.66
106 LLaMa-2 _70B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.76	4.01	3.75	2.98	3.24

表 14：结果表 2

Models	Train Set	Test Set	Helpfulness	Clarity	Factuality	Depth	Engagement
107 Mistral-v0.1 _7B_LFT	databricks-dolly _15k	just-eval-instruct _1k	3.13	3.83	3.45	1.99	2.34
108 Mistral-v0.1 _7B_SFT	databricks-dolly _15k	just-eval-instruct _1k	1.12	1.32	1.27	1.04	1.11
109 Mistral-v0.1 _7B_LFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	3.73	4.47	4.29	2.52	2.77
110 Mistral-v0.1 _7B_SFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	1.09	1.11	1.11	1.03	1.08
111 Mistral-v0.1 _7B_LFT	lima _1K	just-eval-instruct _1k	3.21	3.65	3.3	2.39	2.48
112 Mistral-v0.1 _7B_SFT	lima _1k	just-eval-instruct _1k	1.49	1.83	1.58	1.23	1.5
113 Mistral-v0.1 _7B_LFT	lima _1K	MedInstruct-test ₂₁₆	4.24	4.49	4.48	3.18	3.08
114 Mistral-v0.1 _7B_SFT	lima _1K	MedInstruct-test ₂₁₆	1.71	1.83	1.66	1.31	1.6
115 Mistral-v0.1 _7B_LFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.85	4.97	4.89	4.77	4.54
116 Mistral-v0.1 _7B_SFT	MedInstruct _1K	MedInstruct-test ₂₁₆	1.09	1.56	1.27	1.10	1.22
117 Mistral-v0.1 _7B_LFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.92	4.98	4.93	4.84	4.6
118 Mistral-v0.1 _7B_SFT	MedInstruct _10K	MedInstruct-test ₂₁₆	1.18	1.78	1.38	1.16	1.32
119 Mistral-v0.1 _7B_LFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.88	4.98	4.94	4.78	4.58
120 Mistral-v0.1 _7B_SFT	MedInstruct _25K	MedInstruct-test ₂₁₆	1.06	1.08	1.12	1.04	1.06
121 Mistral-v0.1 _7B_LFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.9	4.98	4.94	4.83	4.56
122 Mistral-v0.1 _7B_SFT	MedInstruct _52K	MedInstruct-test ₂₁₆	1.26	2.1	1.87	1.34	1.88
123 Mistral-v0.1 _7B_LFT	Alpaca _1K	just-eval-instruct _1k	4.24	4.79	4.19	3.51	3.62
124 Mistral-v0.1 _7B_SFT	Alpaca _1K	just-eval-instruct _1k	1.13	2.6	1.49	1.1	1.42
125 Mistral-v0.1 _7B_LFT	Alpaca _10K	just-eval-instruct _1k	4.28	4.87	4.23	3.38	3.55
126 Mistral-v0.1 _7B_SFT	Alpaca _10K	just-eval-instruct _1k	1.11	2.34	1.33	1.1	1.22
127 Mistral-v0.1 _7B_LFT	Alpaca _25K	just-eval-instruct _1k	4.3	4.86	4.25	3.5	3.67
128 Mistral-v0.1 _7B_SFT	Alpaca _25K	just-eval-instruct _1k	1.01	1.75	1.13	1	1.03
129 Mistral-v0.1 _7B_LFT	Alpaca _52K	just-eval-instruct _1k	4.3	4.88	4.29	3.43	3.58
130 Mistral-v0.1 _7B_SFT	Alpaca _52K	just-eval-instruct _1k	1.04	1.77	1.2	1.03	1.09
131 Phi-1.5 _1.3B_LFT	databricks-dolly _15k	just-eval-instruct _1k	3.00	3.78	3.25	1.92	2.33
132 Phi-1.5 _1.3B_SFT	databricks-dolly _15k	just-eval-instruct _1k	2.37	3.15	2.51	1.69	2.14
133 Phi-1.5 _1.3B_LFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	3.83	4.59	4.38	2.53	2.79
134 Phi-1.5 _1.3B_SFT	databricks-dolly _15k	MedInstruct-test ₂₁₆	3.09	3.87	3.45	2.08	2.58
135 Phi-1.5 _1.3B_LFT	lima _1K	just-eval-instruct _1k	3.98	4.46	3.93	3.85	3.82
136 Phi-1.5 _1.3B_SFT	lima _1k	just-eval-instruct _1k	3.71	3.49	3.18	3.39	3.25
137 Phi-1.5 _1.3B_LFT	lima _1K	MedInstruct-test ₂₁₆	4.56	4.85	4.54	3.55	3.26
138 Phi-1.5 _1.3B_SFT	lima _1K	MedInstruct-test ₂₁₆	4.35	4.7	3.97	3.36	3.34
139 Phi-1.5 _1.3B_LFT	MedInstruct _1K	MedInstruct-test ₂₁₆	4.32	4.26	4.33	4.29	4.16
140 Phi-1.5 _1.3B_SFT	MedInstruct _1K	MedInstruct-test ₂₁₆	3.52	3.74	3.44	3.31	3.23
141 Phi-1.5 _1.3B_LFT	MedInstruct _10K	MedInstruct-test ₂₁₆	4.34	4.27	4.39	4.31	4.19
142 Phi-1.5 _1.3B_SFT	MedInstruct _10K	MedInstruct-test ₂₁₆	3.72	3.84	3.55	3.44	3.42
143 Phi-1.5 _1.3B_LFT	MedInstruct _25K	MedInstruct-test ₂₁₆	4.39	4.28	4.44	4.33	4.21
144 Phi-1.5 _1.3B_SFT	MedInstruct _25K	MedInstruct-test ₂₁₆	3.87	3.92	3.74	3.52	3.43
145 Phi-1.5 _1.3B_LFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.41	4.32	4.51	4.69	4.53
146 Phi-1.5 _1.3B_SFT	MedInstruct _52K	MedInstruct-test ₂₁₆	4.15	4.11	4.10	4.24	4.30
147 Phi-1.5 _1.3B_LFT	Alpaca _1K	just-eval-instruct _1k	4.07	4.85	4.06	3.4	3.64
148 Phi-1.5 _1.3B_SFT	Alpaca _1K	just-eval-instruct _1k	3.19	4.42	3.31	2.72	3.25
149 Phi-1.5 _1.3B_LFT	Alpaca _10K	just-eval-instruct _1k	4.05	4.84	4.02	3.26	3.54
150 Phi-1.5 _1.3B_SFT	Alpaca _10K	just-eval-instruct _1k	3.13	4.13	3.23	2.67	3.24
151 Phi-1.5 _1.3B_LFT	Alpaca _25K	just-eval-instruct _1k	4.03	4.79	4.02	3.29	3.54
152 Phi-1.5 _1.3B_SFT	Alpaca _25K	just-eval-instruct _1k	3.04	4.49	3.11	2.64	3.28
153 Phi-1.5 _1.3B_LFT	Alpaca _52K	just-eval-instruct _1k	4.07	4.81	4.03	3.33	3.56
154 Phi-1.5 _1.3B_SFT	Alpaca _52K	just-eval-instruct _1k	3.93	4.2	3.65	3.88	3.49

表 15：结果表 3

Models	Train Set	Test Set	Coding	Info-Seek	Math	Procedure	Reasoning	Role-Play	Writing
155 gpt-4-1106-preview	-	just-eval-instruct _1k	4.67	4.78	4.57	4.87	4.80	4.76	4.76
156 LLaMa-2 _7B-chat-hf	-	just-eval-instruct _1k	3.31	3.73	2.54	4.07	3.93	3.94	4.07
157 LLaMa-2 _13B-chat-hf	-	just-eval-instruct _1k	3.27	4.02	2.84	4.22	4.09	4.11	4.32
158 LLaMa-2 _70B-chat-hf	-	just-eval-instruct _1k	4.02	4.41	3.67	4.76	4.56	4.45	4.81
159 LLaMa-2 _7B_LFT	databricks-dolly _15k	just-eval-instruct _1k	2.38	2.73	1.81	2.53	2.55	2.15	2.36
160 LLaMa-2 _7B_SFT	databricks-dolly _15k	just-eval-instruct _1k	1.78	2.32	1.62	2.21	2.43	2.46	2.43
161 LLaMa-2 _7B_LFT	lima _1K	just-eval-instruct _1k	2.37	2.78	1.72	2.87	2.68	2.65	2.63
162 LLaMa-2 _7B_SFT	lima _1k	just-eval-instruct _1k	1.76	2.48	1.25	2.45	2.58	2.05	2.3
163 LLaMa-2 _7B_LFT	Alpaca _1K	just-eval-instruct _1k	2.47	3.11	1.95	3.0	3.1	2.72	2.97
164 LLaMa-2 _7B_SFT	Alpaca _1K	just-eval-instruct _1k	1.79	2.34	1.51	2.4	2.45	2.06	2.27
165 LLaMa-2 _7B_LFT	Alpaca _10K	just-eval-instruct _1k	2.45	3.17	1.99	2.99	3.13	2.86	3.1
166 LLaMa-2 _7B_SFT	Alpaca _10K	just-eval-instruct _1k	1.75	2.26	1.53	2.23	2.5	2.29	2.69
167 LLaMa-2 _7B_LFT	Alpaca _25K	just-eval-instruct _1k	2.57	3.17	2.09	3.02	3.13	2.96	3.13
168 LLaMa-2 _7B_SFT	Alpaca _25K	just-eval-instruct _1k	1.88	2.36	1.51	2.41	2.58	2.52	2.74
169 LLaMa-2 _7B_LFT	Alpaca _52K	just-eval-instruct _1k	2.74	3.21	2.09	3.17	3.15	2.97	3.3
170 LLaMa-2 _7B_SFT	Alpaca _52K	just-eval-instruct _1k	1.54	2.15	1.41	2.28	2.39	2.3	2.48
171 LLaMa-2 _7B_LFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	2.66	3.39	2.01	3.29	3.44	2.92	3.12
172 LLaMa-2 _7B_SFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	1.61	2.2	1.37	2.33	2.37	2.11	2.18
173 LLaMa-2 _7B_LFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	2.58	3.46	2.14	3.38	3.54	3.23	3.34
174 LLaMa-2 _7B_SFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	1.85	2.45	1.3	2.57	2.84	2.54	2.79
175 LLaMa-2 _7B_LFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	2.69	3.53	2.15	3.41	3.60	3.37	3.52
176 LLaMa-2 _7B_SFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	1.95	2.43	1.31	2.56	2.84	2.58	2.86
177 LLaMa-2 _7B_LFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	2.82	3.6	2.16	3.45	3.67	3.48	3.6
178 LLaMa-2 _7B_SFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	2.1	2.41	1.30	2.38	2.83	2.67	2.91
179 LLaMa-2 _7B_LFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	2.76	3.6	2.1	3.6	3.64	3.48	3.60
180 LLaMa-2 _7B_SFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	2.19	2.53	1.46	2.66	2.93	2.84	3.04
181 LLaMa-2 _7B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	2.92	3.63	2.13	3.57	3.73	3.6	3.67
182 LLaMa-2 _7B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	2.32	2.81	2.05	2.80	3.04	2.96	3.21
183 LLaMa-2 _7B_LFT	filtered-dolly _3k	just-eval-instruct _1k	2.31	2.79	2.24	2.65	2.67	2.24	2.45
184 LLaMa-2 _7B_SFT	filtered-dolly _3k	just-eval-instruct _1k	1.7	2.05	1.46	1.99	2.15	1.97	2.02
185 LLaMa-2 _7B_LFT	Evol-Instruct _70k	just-eval-instruct _1k	2.78	3.56	2.08	3.55	3.66	3.49	3.76
186 LLaMa-2 _7B_SFT	Evol-Instruct _70k	just-eval-instruct _1k	2.23	2.68	1.73	2.77	3.05	2.98	3.19
187 LLaMa-2 _7B_LFT	LIMA-Simple _1k	just-eval-instruct _1k	2.53	2.97	1.9	2.85	2.96	2.7	2.87
188 LLaMa-2 _7B_SFT	LIMA-Simple _1k	just-eval-instruct _1k	1.78	2.29	1.39	2.09	2.51	2.25	2.28
189 LLaMa-2 _7B_LFT	chatgpt _9k	just-eval-instruct _1k	2.82	3.53	2.93	3.28	3.35	3.52
190 LLaMa-2 _7B_SFT	chatgpt _9k	just-eval-instruct _1k	2.58	3.13	1.82	3.01	3.15	2.72	2.93
191 LLaMa-2 _7B_NFT	Alpaca _52K	just-eval-instruct _1k	1.55	2.37	1.62	2.36	2.65	2.46	2.58

表 16：结果表 3

Models	Train Set	Test Set	Coding	Info-Seek	Math	Procedure	Reasoning	Role-Play	Writing
192 LLaMa-2 _13B_LFT	databricks-dolly _15k	just-eval-instruct _1k	2.6	2.8	2.1	2.79	2.78	2.36	2.58
193 LLaMa-2 _13B_SFT	databricks-dolly _15k	just-eval-instruct _1k	1.93	2.5	1.46	2.19	2.58	1.97	2.39
194 LLaMa-2 _13B_LFT	lima _1K	just-eval-instruct _1k	2.45	2.97	2.08	2.95	2.78	2.48	2.83
195 LLaMa-2 _13B_SFT	lima _1k	just-eval-instruct _1k	1.6	2.32	1.34	2.23	2.45	1.97	2.22
196 LLaMa-2 _13B_LFT	Alpaca _1K	just-eval-instruct _1k	2.78	3.25	2.26	3.24	3.25	2.98	3.27
197 LLaMa-2 _13B_SFT	Alpaca _1K	just-eval-instruct _1k	1.86	2.44	1.67	2.27	2.55	2.11	2.49
198 LLaMa-2 _13B_LFT	Alpaca _10K	just-eval-instruct _1k	2.78	3.28	2.46	3.29	3.3	3.09	3.39
199 LLaMa-2 _13B_SFT	Alpaca _10K	just-eval-instruct _1k	1.87	2.33	1.51	2.32	2.52	2.42	2.8
200 LLaMa-2 _13B_LFT	Alpaca _25K	just-eval-instruct _1k	2.75	3.26	2.55	3.33	3.26	3.16	3.36
201 LLaMa-2 _13B_SFT	Alpaca _25K	just-eval-instruct _1k	1.65	2.17	1.45	2.12	2.36	2.1	2.57
202 LLaMa-2 _13B_LFT	Alpaca _52K	just-eval-instruct _1k	2.73	3.29	2.41	3.3	3.26	2.95	3.49
203 LLaMa-2 _13B_SFT	Alpaca _52K	just-eval-instruct _1k	1.74	2.14	1.34	2.21	2.31	2.19	2.66
204 LLaMa-2 _13B_LFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	2.95	3.49	2.56	3.61	3.62	3.31	3.39
205 LLaMa-2 _13B_SFT	Tulu-V2-Mix _1K	just-eval-instruct _1k	1.88	2.71	1.63	2.58	2.81	2.55	2.56
206 LLaMa-2 _13B_LFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	3.0	3.64	2.36	3.69	3.69	3.43	3.44
207 LLaMa-2 _13B_SFT	Tulu-V2-Mix _10K	just-eval-instruct _1k	1.94	2.62	1.48	2.63	2.96	2.84	3.01
208 LLaMa-2 _13B_LFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	3.02	3.67	2.42	3.71	3.73	3.51	3.50
209 LLaMa-2 _13B_SFT	Tulu-V2-Mix _25K	just-eval-instruct _1k	2.01	2.82	1.72	2.84	3.08	2.95	3.14
210 LLaMa-2 _13B_LFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	3.14	3.75	2.39	3.72	3.76	3.77	3.81
211 LLaMa-2 _13B_SFT	Tulu-V2-Mix _50K	just-eval-instruct _1k	2.11	2.89	1.87	2.92	3.12	3.01	3.18
212 LLaMa-2 _13B_LFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	3.17	3.79	2.44	3.75	3.88	3.82	3.91
213 LLaMa-2 _13B_SFT	Tulu-V2-Mix _150K	just-eval-instruct _1k	2.29	2.95	1.98	3.02	3.21	3.27	3.38
214 LLaMa-2 _13B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.20	3.81	2.54	3.80	3.91	3.94	3.99
215 LLaMa-2 _13B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	2.36	3.01	2.31	3.14	3.32	3.36	3.45
216 LLaMa-2 _13B_LFT	LIMA-Simple _1k	just-eval-instruct _1k	2.72	3.23	2.23	3.31	3.18	2.83	3.0
217 LLaMa-2 _13B_SFT	LIMA-Simple _1k	just-eval-instruct _1k	1.77	2.4	1.75	2.31	2.67	2.23	2.32
218 LLaMa-2 _70B_LFT	databricks-dolly _15k	just-eval-instruct _1k	3.44	3.59	2.65	3.46	3.69	3.87	3.63
219 LLaMa-2 _70B_SFT	databricks-dolly _15k	just-eval-instruct _1k	2.42	2.53	2.33	2.45	2.61	2.45	2.78
220 LLaMa-2 _70B_LFT	lima _1K	just-eval-instruct _1k	3.32	3.54	2.59	3.40	3.58	3.72	3.58
221 LLaMa-2 _70B_SFT	lima _1k	just-eval-instruct _1k	2.36	2.49	2.31	2.41	2.57	2.43	2.71
222 LLaMa-2 _70B_LFT	Alpaca _1K	just-eval-instruct _1k	3.57	3.72	2.92	3.73	3.88	4.01	3.92
223 LLaMa-2 _70B_SFT	Alpaca _1K	just-eval-instruct _1k	3.21	2.86	2.64	2.76	2.87	2.76	3.03
224 LLaMa-2 _70B_LFT	Alpaca _52K	just-eval-instruct _1k	3.89	3.99	3.46	4.01	4.12	4.23	4.19
225 LLaMa-2 _70B_SFT	Alpaca _52K	just-eval-instruct _1k	3.42	3.14	3.05	3.10	3.33	3.11	3.45
226 LLaMa-2 _70B_LFT	Tulu-V2-Mix _1k	just-eval-instruct _1k	4.01	4.14	3.53	4.07	4.12	4.30	4.22
227 LLaMa-2 _70B_SFT	Tulu-V2-Mix _1k	just-eval-instruct _1k	3.71	3.42	3.32	3.13	3.76	3.32	3.72
228 LLaMa-2 _70B_LFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	4.31	4.44	3.81	4.27	4.42	4.67	4.57
229 LLaMa-2 _70B_SFT	Tulu-V2-Mix _326k	just-eval-instruct _1k	3.92	3.72	3.62	3.35	4.11	3.67	4.01
230 Mistral-v0.1 _7B_LFT	databricks-dolly _15k	just-eval-instruct _1k	3.1	2.99	2.6	3.1	2.86	2.63	2.8
231 Mistral-v0.1 _7B_SFT	databricks-dolly _15k	just-eval-instruct _1k	1.03	1.19	1.05	1.18	1.15	1.17	1.17
232 Mistral-v0.1 _7B_LFT	lima _1K	just-eval-instruct _1k	3.14	3.01	2.5	3.28	2.85	2.59	3.0
233 Mistral-v0.1 _7B_SFT	lima _1k	just-eval-instruct _1k	1.28	1.54	1.19	1.54	1.63	1.43	1.59
234 Mistral-v0.1 _7B_LFT	Alpaca _1K	just-eval-instruct _1k	3.67	4.09	3.15	4.12	4.21	4.1	3.98
235 Mistral-v0.1 _7B_SFT	Alpaca _1K	just-eval-instruct _1k	1.26	1.55	1.53	1.49	1.55	1.47	1.65
236 Mistral-v0.1 _7B_LFT	Alpaca _10K	just-eval-instruct _1k	3.79	4.06	3.37	4.09	4.12	4.13	4.14
237 Mistral-v0.1 _7B_SFT	Alpaca _10K	just-eval-instruct _1k	1.23	1.41	1.32	1.20	1.37	1.26	1.45
238 Mistral-v0.1 _7B_LFT	Alpaca _25K	just-eval-instruct _1k	3.73	4.11	3.32	4.18	4.22	4.13	4.23
239 Mistral-v0.1 _7B_SFT	Alpaca _25K	just-eval-instruct _1k	1.12	1.18	1.21	1.19	1.17	1.13	1.23
240 Mistral-v0.1 _7B_LFT	Alpaca _52K	just-eval-instruct _1k	3.84	4.09	3.37	4.13	4.18	3.98	4.09
241 Mistral-v0.1 _7B_SFT	Alpaca _52K	just-eval-instruct _1k	1.18	1.22	1.34	1.21	1.22	1.3	1.26
242 Phi-1.5 _1.3B_LFT	databricks-dolly _15k	just-eval-instruct _1k	2.78	2.79	2.26	2.97	2.92	2.72	3.03
243 Phi-1.5 _1.3B_SFT	databricks-dolly _15k	just-eval-instruct _1k	1.9	2.31	1.79	2.36	2.58	2.57	2.55
244 Phi-1.5 _1.3B_LFT	lima _1K	just-eval-instruct _1k	3.02	2.95	2.46	3.12	2.65	2.39	2.97
245 Phi-1.5 _1.3B_SFT	lima _1k	just-eval-instruct _1k	2.58	2.64	2.32	2.51	2.42	2.12	2.78
246 Phi-1.5 _1.3B_LFT	Alpaca _1K	just-eval-instruct _1k	3.24	4.09	3.15	4.12	4.21	4.10	3.98
247 Phi-1.5 _1.3B_SFT	Alpaca _1K	just-eval-instruct _1k	1.24	1.51	1.33	1.47	1.58	1.57	1.87
248 Phi-1.5 _1.3B_LFT	Alpaca _10K	just-eval-instruct _1k	3.27	4.12	3.20	4.15	4.21	4.12	3.99
249 Phi-1.5 _1.3B_SFT	Alpaca _10K	just-eval-instruct _1k	1.45	1.51	1.47	1.65	1.75	1.82	2.01
250 Phi-1.5 _1.3B_LFT	Alpaca _25K	just-eval-instruct _1k	3.29	4.13	3.22	4.17	4.23	4.11	4.00
251 Phi-1.5 _1.3B_SFT	Alpaca _25K	just-eval-instruct _1k	1.71	1.82	1.72	1.83	1.81	2.01	2.14
252 Phi-1.5 _1.3B_LFT	Alpaca _52K	just-eval-instruct _1k	3.28	4.10	3.21	4.19	4.25	4.12	4.03
253 Phi-1.5 _1.3B_SFT	Alpaca _52K	just-eval-instruct _1k	1.59	1.92	1.72	1.97	2.13	2.23	3.02

表 17：结果表 3