Have you built your first Copilot Studio agent and are wondering how you can ensure it provides good responses? In this post, I’ll show you how you can use the new Evaluation feature to automate testing, save time, and get better control of quality.
Of course, you can perform testing by testing it manually by asking questions and checking the answer, or maybe sharing it with a bunch of colleagues who are testing. Both can work, and maybe you should still do that type of testing in addition to more automated methods.
But now there are good methods for automating testing a Copilot Studio Agent.

Evaluation
Now you have a new section in Copilot Studio, “Evaluation”. Here you can create test kits to automate testing of your agent.

A test set consists of a collection of questions that your agent is expected to be able to answer. The test kit consists of frequently asked questions and you can create these in several ways:
- You can add questions manually
- You can use questions from manual testing that you have just done in the Test panel in Copilot Studio
- You can let the AI generate 10 questions based on what it finds from the description, instructions, and functionality in the agent
- Upload a CSV file with questions, the file can contain up to 100 questions. You first download a CSV template to paste your questions into.

We will take a look at my test agent, this is a simple agent that answers questions from www.bouvet.no website
We try to let AI generate 10 more or less relevant questions for us by clicking on “Generate 10 questions”

After a few moments, we are then given 10 questions we can test with. If you want more questions, you can add these manually or choose to generate more questions in batches of 10, 25 or 50 questions.
For each question, you can choose from three test methods, text match, similarity, or quality.
Text match, here compares the test with the expected answer that you enter into the test set. You configure whether it must match exactly what you enter or whether it contains some of the words or phrases you enter
Similarity, here, the test compares the agent’s response to the similarity of what you configure as the expected response in the test set. Here you can adjust how similar the meaning of the answer must be before the agent passes the test.
Quality, this method uses a language model (LLM) to assess the response the agent gives is of good enough quality.
You can find more details about the testing methods at Microsoft. Evaluate agent performance (preview) – Microsoft Copilot Studio | Microsoft Learn
Run an evaluation
After generating 10 questions, I run the test set by clicking on the “Evaluate” button.
The evaluation runs and has a status of “Running” until it is finished.

After a few minutes, the test set is fully evaluated and you can check the status.

As the screenshot shows, we got a score of 80%, by clicking on the test set you can check each question and rate the ones that got the status Fail.
In our example here, it failed, among other things, the question “How do I escalate an issue to a human representative?”

This is correctly set to “Fail” as our agent is not set up to escalate to a human so blacks do not seem relevant.
Summary
Evaluate functionality in Copilot Studio is a huge step forward in making more structured and automated agent testing available to agent developers.
By creating a test set with relevant questions, you have an easy way to test changes to an agent before making it available to users.
You still have good automated testing capabilities available in the Copilot Studio Kit, but Evaluate in Copilot Studio is easier to set up and much easier to access.
Copilot Studio Kit overview – Microsoft Copilot Studio | Microsoft Learn
Evaluate agent performance (preview) – Microsoft Copilot Studio | Microsoft Learn