C-Eval是適用于大語言模型的多層次多學科中文評估套件,由上海交通大學、清華大學和愛丁堡大學研究人員在2023年5月份聯(lián)合推出,包含13948個多項選擇題,涵蓋52個不同的學科和四個難度級別,用在評測大模型中文理解能力。通過零樣本(zero-shot)和少樣本(few-shot)測試,C-Eval 能評估模型在未見過的任務上的適應性和泛化能力。

C-Eval的主要功能
多學科覆蓋:C-Eval 包含 52 個不同學科的題目,涵蓋 STEM、社會科學、人文科學等多個領域,全面評估語言模型的知識儲備。
多層次難度分級:設有四個難度級別,從基礎到高級,細致評估模型在不同難度下的推理和泛化能力。
量化評估與標準化測試:包含 13948 個多項選擇題,通過標準化評分系統(tǒng)提供量化性能指標,支持不同模型的橫向對比。
如何使用C-Eval
<span class="token keyword">from</span> datasets <span class="token keyword">import</span> load_dataset
dataset <span class="token operator">=</span> load_dataset<span class="token punctuation">(</span><span class="token string">"ceval/ceval-exam"</span><span class="token punctuation">,</span> name<span class="token operator">=</span><span class="token string">"computer_network"</span><span class="token punctuation">)</span>
<span class="token function">wget</span> https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
<span class="token function">unzip</span> ceval-exam.zip
<span class="token keyword">from</span> transformers <span class="token keyword">import</span> AutoModelForCausalLM<span class="token punctuation">,</span> AutoTokenizer
model_name <span class="token operator">=</span> <span class="token string">"your-model-name"</span>
tokenizer <span class="token operator">=</span> AutoTokenizer<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>model_name<span class="token punctuation">)</span>
model <span class="token operator">=</span> AutoModelForCausalLM<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>model_name<span class="token punctuation">)</span>
以下是中國關于{科目}考試的單項選擇題,請選出其中的正確答案。
以下是中國關于{科目}考試的單項選擇題,請選出其中的正確答案。
inputs <span class="token operator">=</span> tokenizer<span class="token punctuation">(</span>prompt<span class="token punctuation">,</span> return_tensors<span class="token operator">=</span><span class="token string">"pt"</span><span class="token punctuation">)</span>
outputs <span class="token operator">=</span> model<span class="token punctuation">.</span>generate<span class="token punctuation">(</span><span class="token operator">**</span>inputs<span class="token punctuation">)</span>
response <span class="token operator">=</span> tokenizer<span class="token punctuation">.</span>decode<span class="token punctuation">(</span>outputs<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">,</span> skip_special_tokens<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
answer <span class="token operator">=</span> extract_answer<span class="token punctuation">(</span>response<span class="token punctuation">)</span> ?<span class="token comment"># 自定義函數(shù),提取答案選項</span>
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">import</span> accuracy_score
<span class="token comment"># 假設 `predictions` 是模型的預測結果,`labels` 是真實答案</span>
accuracy <span class="token operator">=</span> accuracy_score<span class="token punctuation">(</span>labels<span class="token punctuation">,</span> predictions<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"Validation Accuracy: </span><span class="token interpolation"><span class="token punctuation">{</span>accuracy<span class="token punctuation">:</span><span class="token format-spec">.2f</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token property">"chinese_language_and_literature"</span><span class="token operator">:</span> <span class="token punctuation">{</span>
<span class="token property">"0"</span><span class="token operator">:</span> <span class="token string">"A"</span><span class="token punctuation">,</span>
<span class="token property">"1"</span><span class="token operator">:</span> <span class="token string">"B"</span><span class="token punctuation">,</span>
<span class="token punctuation">}</span><span class="token punctuation">,</span>
<span class="token punctuation">}</span>
C-Eval的應用場景
語言模型性能評估:全面衡量語言模型的知識水平和推理能力,幫助開發(fā)者優(yōu)化模型性能。
學術研究與模型比較:為研究人員提供標準化的測試平臺,分析和比較不同語言模型在各學科的表現(xiàn),推動學術研究和技術進步。
教育領域應用開發(fā):助力開發(fā)智能輔導系統(tǒng)和教育評估工具,用模型生成練習題、自動評分,提升教育領域的智能化水平。
行業(yè)應用優(yōu)化:在金融、醫(yī)療、客服等行業(yè),評估和優(yōu)化語言模型的領域知識和應用能力,提升行業(yè)智能化解決方案的效果。
社區(qū)合作與技術評測:作為開放平臺,促進開發(fā)者社區(qū)的交流與合作,為模型競賽和技術評測提供公平的基準測試工具。

[超站]友情鏈接:
四季很好,只要有你,文娛排行榜:https://www.yaopaiming.com/
關注數(shù)據(jù)與安全,洞悉企業(yè)級服務市場:https://www.ijiandao.com/