From bb24a6f43d52c6efd93edb0233175c779db5c369 Mon Sep 17 00:00:00 2001
From: mrT23 <tal.r@codium.ai>
Date: Tue, 4 Mar 2025 08:24:48 +0200
Subject: [PATCH] docs: update evaluation dataset size in finetuning benchmark
 documentation

---
 docs/docs/finetuning_benchmark/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/finetuning_benchmark/index.md b/docs/docs/finetuning_benchmark/index.md
index b479ff0a..c9f0aa63 100644
--- a/docs/docs/finetuning_benchmark/index.md
+++ b/docs/docs/finetuning_benchmark/index.md
@@ -68,7 +68,7 @@ Here are the prompts, and example outputs, used as input-output pairs to fine-tu
 
 ### Evaluation dataset
 
-- For each tool, we aggregated 100 additional examples to be used for evaluation. These examples were not used in the training dataset, and were manually selected to represent diverse real-world use-cases.
+- For each tool, we aggregated 200 additional examples to be used for evaluation. These examples were not used in the training dataset, and were manually selected to represent diverse real-world use-cases.
 - For each test example, we generated two responses: one from the fine-tuned model, and one from the best code model in the world, `gpt-4-turbo-2024-04-09`.
 
 - We used a third LLM to judge which response better answers the prompt, and will likely be perceived by a human as better response.