Evaluation metric example: Correctness (judged by AI)

Nodes

24e3b914-15fa-444f-80e3-ca29bdacaf40eb4d7d31-2fbb-4c15-b894-0022be7a4ecf+1

Created by

DaDavid Roberts

Last edited 39 days ago

AI evaluation in n8n

This is a template for n8n's evaluation feature.

Evaluation is a technique for getting confidence that your AI workflow performs reliably, by running a test dataset containing different inputs through the workflow.

By calculating a metric (score) for each input, you can see where the workflow is performing well and where it isn't.

How it works

This template shows how to calculate a workflow evaluation metric: whether an output matches an expected output (i.e. has the same meaning).

The workflow takes questions about the causes of historical events and compares them with the reference answers in the dataset.

  • We use an evaluation trigger to read in our dataset
  • It is wired up in parallel with the regular chat trigger so that the workflow can be started from either one. More info
  • If we're evaluating (i.e. the execution started from the evaluation trigger), we calculate the correctness metric using AI
  • We pass this information back to n8n as a metric
  • If we're not evaluating we avoid calculating the metric, to reduce cost

New to n8n?

Need help building new n8n workflows? Process automation for you or your company will save you time and money, and it's completely free!