promptfoo #❓｜questions

How to test dynamic multi-turn conversations in Promptfoo?

Đức Duy

10/23/2025, 3:28 PM

Hi everyone 👋 I’m testing a medical chatbot agent that starts from a symptom (e.g. “I have stomach pain”), then asks several related questions, and finally recommends a suitable clinic. The problem is that each test run may have different question wording or order, so I can’t predefine all user inputs in advance. I’d like to dynamically provide user replies based on the agent’s last question — for example, if the agent asks about pain_location, I return the predefined answer for that property. Is there any recommended way in Promptfoo to handle this kind of dynamic multi-turn input-output flow? Thanks!

Environment variable substitution only working some places

crizm

10/23/2025, 11:57 PM

I'm trying to use environment variables defined in a .env file to specify a default provider for llm-rubric: defaultTest: options: provider: id: "azure:chat:{{ env.MY_DEPLOYMENT}}" config: apiVersion: "{{ env.API_VERSION }}" apiHost: "{{ env.AZURE_ENDPOINT }}" For some reason, only env.MY_DEPLOYMENT gets replaced. "{{ env.AZURE_ENDPOINT }}" does not (nor does API_VERSION, and there doesn't appear to be a way to affect that through preset environment variables) and results in a "Failed to sanitize URL" error. Any idea what's wrong here?

OpenRouter - API error: 401 // message: No auth credentials found

haveles

10/24/2025, 2:39 PM

Hi all, I'm encountering persistent 401 Unauthorized errors when trying to use OpenRouter providers in my self evaluation and model comparison configs, despite having a working API key and successful direct API calls. Error Details: [ERROR] API error: 401 Unauthorized {"error":{"message":"No auth credentials found","code":401}} What's Working: OpenRouter API key works perfectly with direct curl calls Successfully configured and ran deterministic A/B testing for 3 LLMs using OpenRouter Environment variable OPENROUTER_API_KEY is properly set Current Configuration (that works for A/B testing): providers: - id: openrouter:anthropic/claude-3.5-sonnet config: temperature: 0.0 max_tokens: 2000 apiKey: ${OPENROUTER_API_KEY} What's Failing: Self-grading config with identical provider setup Model comparison config with identical provider setup All attempts result in 401 errors Attempted Fixes: Variable syntax variations: ${OPENROUTER_API_KEY}, "{{ env.OPENROUTER_API_KEY }}" Provider ID variations: different model names and versions Configuration approaches: Direct OpenRouter, OpenAI with custom base URL, Anthropic with custom base URL Environment handling: shell variables, --var flag, --env-file flag Removed llm-rubric assertions in attempt to fix authentication issues System Info: Promptfoo version: 0.118.17 OS: macOS Any insights on what might be causing this inconsistent behavior would be greatly appreciated!

Clarification regarding Red Team configuration

tanktg

10/28/2025, 12:10 PM

Hi all, I am working for a cybersecurity service provider and we would like to use Promptfoo to test LLM applications of our customers. Data privacy is of major importance to us, and we therefore don't want to send any data or requests of any sort to PromptFoo's cloud services. In practice, this means that adversarial input generation, response evaluation and grading of attacks should all happen in our systems, and that all telemetry should be disabled. Looking at the documentation (https://www.promptfoo.dev/docs/red-team/configuration/#how-attacks-are-generated), we have several questions regarding the correct configuration to use: Will setting the PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION env var to true prevent adversarial input generation requests to be sent to promptFoo's API, while allowing us to use our own remote LLM deployed in our cloud environment? Or should we specify our own attacker model provider in the config file, while leaving PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION to its default value, false? Additionally, I understand that it is possible to override the default grader by specifying a custom one in the config file: https://www.promptfoo.dev/docs/red-team/troubleshooting/grading-results/#overriding-the-grader. Will making those two configuration changes (specifying a custom attacker model provider, and a custom grader) be enough to ensure that no data (including telemetry of usage data) is ever sent to promptFoo's services? If not, what additional configuration is needed to achieve this? Thanks

How to hook context in YAML?

Alex1990

10/28/2025, 3:45 PM

Hi, everyone. I spent around 3-4 hours to understand how dynamic context works, but whatever I did, every time I got an error. I connected to my own RAG using custom call_api

Copy code

def call_api(prompt, options=None, context=None):
.................. some logic......
    data = response.json()
    contexts = [source.get('content', '')
                for source in data.get('sources', [])]

    return {
        "output": data.get('content', ''),
        "context": context_text
    }

and part of YAML for this metric

Copy code

assert:
      - type: context-relevance
        contextTransform: context
        value: ''

But when I tried to catch this context field from the RAG response, I got an error below Whatever I did, I tried to use a string or array, just context or output.context, every time I had an error

Copy code

Error: Failed to transform context using expression 'context': Invariant failed: contextTransform must return a string or array of strings. Got object. Check your transform expression: context

Error: Failed to transform context using expression 'context': Invariant failed: contextTransform must return a string or array of strings. Got object. Check your transform expression: context
    at resolveContext (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/contextUtils.js:60:19)
    at async handleContextRelevance (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/contextRelevance.js:23:21)
    at async runAssertion (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/index.js:353:24)
    at async /Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/index.js:400:24

In documentation, it looks pretty simple, but look like it doesn't work correctly https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/context-relevance/ Any suggestions, how I can handle that? https://cdn.discordapp.com/attachments/1432757147405651968/1432757147648786584/image.png?ex=69023693&is=6900e513&hm=e3561b5fac664cff41e9131fc0c4327ce0fa1634c74a9240f06dff1d91c6ffb1&

_conversation / previous messages for Simulated User and Assistant

Elias_M2M

10/29/2025, 9:38 AM

Hello, I would like to test a multi-turn conversation between an assistant and a simulated user. The prescribed conversation flow of the assistant is very long and for my current test cases I just need to test the end of the conversation. For these tests, the previous messages are very important, so the simulated user and the assistant need to know what "they" said before. I saw in the docs, that there is an option of adding a variable "messages" or "_conversation", but I don't know how this is behaving with the simulated user provider. Is it possible the define the previous messages for both the assistant and the simulated user, so they know where to continue the conversation? And how can I do this?

prompts generation only

b00l_

10/29/2025, 2:36 PM

hello, I have a redteam.yaml file with a bunch of plugins enabled, is it possible to just generate prompts, and save them in a file based on all plugins enabled? can I do it local only and even with openai key?

How to add dynamic prompt with multiple placeholders inside promptfooconfig?

curious_battle

11/04/2025, 4:50 AM

My prompt looks like {"role":"system", "content": < {company}, {company_description}, ... , {previous_context},} then again repeated at user level with a few placeholders How can I use this reliably inside promptfooconfig with variables separately from another file such that the prompt gets build up completely and then we can test against user_input, currently the prompt part allows prompt with placeholders but no support for passing placeholders variables values, and input csv only allows a single column input ??

Retrying tests

thomas_romas

11/04/2025, 1:04 PM

I am running a basic redteaming evaluation with some plugins enabled. Nothing crazy, I am just trying to see how it works. I pointed promptfoo at my Azure OpenAI model. The evaluation doesn't finish for multiple hours and is stuck at the following output

Copy code

...
Chunk 72 partial success: 24/8 test cases, retrying missing ones individually
...

The chunk number increases. I am not sure what it means. Is there anything I can do to omit the retrying or at least see partial results of the evaluation?

Any advice for really long-running models like GPT-5-pro?

CasetextJake

11/04/2025, 9:30 PM

I'd like to run some evals with GPT-5-pro, and I'd say usually 50% of them error out. I get a variety of errors: API call error: Error: Request failed after 4 retries: TypeError: fetch failed (Cause: Error: getaddrinfo ENOTFOUND api.openai.com) API call error: Error: Request failed after 4 retries: Error: Request timed out after 300000 ms API call error: Error: Error parsing response from https://api.openai.com/v1/responses: Unexpected token '<', " <h"... is not valid JSON. Received text: 502 Bad Gateway 502 Bad Gateway cloudflare API call error: Error: Error parsing response from https://api.openai.com/v1/responses: Unexpected token 'u', "upstream c"... is not valid JSON. Received text: upstream connect error or disconnect/reset before headers. reset reason: connection termination Presumably I can resolve one of these by increasing the amount of time per completion, but the other ones... Curious if there are tips for working with models like these. Thanks!

OSS version limits

Man

11/06/2025, 8:07 PM

What are the adversarial prompt generation limits in the open source version?

Mitigation

Jan!

11/09/2025, 10:32 AM

Is there a way to enable the mitigation option on the open source version? I'd be very happy to know how to fix the issues my application has haha.

Does anyone else has a Python provider problem

Monini

11/13/2025, 9:00 AM

I'm using promptfoo version: 0.119.6. In my yaml I have configured provider like that: providers: - id: 'file://retrieve_answer.py' I get an error:

Copy code

[logger.js:324] Python worker stderr: ERROR handling call: [Errno 2] No such file or directory: 'C'
Traceback (most recent call last):
  File "C:\Users\x\AppData\Roaming\npm\node_modules\promptfoo\dist\src\python\persistent_wrapper.py", line 191, in handle_call
    with open(request_file, "r", encoding="utf-8") as f:
         ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C'


[logger.js:324] Python worker stderr: ERROR: Failed to write error response: [Errno 22] Invalid argument: '\\Users\\x\\AppData\\Local\\Temp\\promptfoo-worker-req-1763023104223-7f0714e90c0d6.json:C:\\Users\\x\\AppData\\Local\\Temp\\promptfoo-worker-resp-1763023104223-8d908bcfbd3be.json'                                                                                                                                                              

[logger.js:324] Shutdown complete
[logger.js:324] Failed to read response file after 16 attempts (~18s). Expected: promptfoo-worker-resp-1763023104223-8d908bcfbd3be.json, Found in C:\Users\x\AppData\Local\Temp: promptfoo-worker-req-1762959388946-f0ac9b4f1dd7e.json, promptfoo-worker-req-1762959433064-8d405bf3e7d72.json, promptfoo-worker-req-1763023104223-7f0714e90c0d6.json

I didn't have this problem some time ago, so I think it started after the update to Promptfoo.

Multiple prompts with each mapped to separate set of images & test cases/assertions

nkhatwani

11/13/2025, 10:56 AM

Can someone please look into https://github.com/promptfoo/promptfoo/issues/6206 and reply accordingly.

Is it possible to set `response_format` per test, rather than per prompt?

m0ltz

11/16/2025, 5:19 PM

I have a single prompt that I want to use, but the JSON Schema for the response is different for each test. I see from the docs that it's possible to set the schema on the provider, or a prompt, but not for a test. Are there any known hacks of workarounds to make that work?

Can we integrate and fetch langfuse datasets directly from promptfoo config file like prompts?

curious_battle

11/17/2025, 4:45 AM

For example, The way We do langfuse://{prompt_name}

Streamable HTTP MCP Server Testing

Anupam Patil

11/19/2025, 12:44 PM

Hi Team, @User I am very new to promptfoo. I need to test couple of MCP Servers. I have one login end point and need to use the token from that API for further requests. I have search and fetch tools which would require that token as an authorization. How to achieve it using promptfoo?

Cost Metric - Bedrock

ellebarto

11/19/2025, 3:38 PM

Hi - is there any plans to have the cost metric work for other providers? AWS Bedrock?

Custom provider context for model graded asserts

dracesw

11/19/2025, 5:42 PM

tl;dr: providers don't seem to be sent test context when used for asserts Hi, I'm trying to use promptfoo to evaluate some agentic workflows. I have a custom python provider that does some environment setup. I need to be able to pass information from the provider completing the prompt to the provider when it is evaluating an llm-rubric assert that doesn't belong in the prompt response. The context seems to always be empty when the provider is used for asserts. Is this working as intended and if so, is there an intended way to pass this information to the assert provider?

How to Show Markdown Instead of JSON + How to Expose OpenAI Response IDs?

IdoRozin

11/19/2025, 7:50 PM

Hey all — two related questions: 1) Prompt display in Promptfoo When using messages: in promptfoo.yaml, the Promptfoo results page shows the full prompt as an ugly JSON array, like: [ { "role": "system", "content": ".... long markdown ...." }, { "role": "user", "content": "...." } ] Is there a way to make Promptfoo show the actual Markdown inside the content fields, instead of the raw JSON structure? Ideally I'd like to see the formatted prompt (headings, lists, etc.) the same way a user would see it — not the full message object. 2) Getting OpenAI Response IDs in Promptfoo Is there a way to extract the OpenAI response id from each run so that I can click/open that response inside the OpenAI API logs? I don’t see the response ID in the result JSON, even when using the OpenAI provider with logprobs or raw: true. Is there a config option or hook for surfacing the model’s id (e.g., resp.id like chatcmpl-abc123) in the Promptfoo results?

Bedrock Provider

ellebarto

11/20/2025, 3:01 PM

I am reaching out to check if the Bedrock provider response includes the input/output token count of all the models on bedrock.

Set reasoning effort for open router models

CYH

11/24/2025, 7:41 PM

Does the open router config support setting reasoning effort? something like this

Copy code

config:
  reasoning:
    effort: minimal

Using my own LLM for generating plugin inputs - unclear docs?

dulax

11/26/2025, 1:43 AM

Hi, I am looking at https://www.promptfoo.dev/docs/red-team/plugins/ and I've configured a redteam config that uses the pii plugin - which according to the docs doesn't require promptfoo's servers for inference. I've setup my provider as vertex:gemini-2.5-flash but when I run promptfoo redteam generate with -v I see calls to promptfoo's APIs for remote inference even with

PROMPTFOO_DISABLE_REMOTE_GENERATION=true

and

PROMPTFOO_SELF_HOSTED=1

Is using my own LLM for redteam generation just not supported at all?

Does using different provider for each test case is supported,global provider is overriding

Nithya

11/26/2025, 4:17 PM

Iam trying to use two different providers,but my global provider is overriding my test case provider,what may be the reason?

Adding more customization

Sarra

11/27/2025, 12:26 PM

Hello, Can anyone help me find tutorials or documentation that could help me to do more customization on the redTeaming part. I am struggling to adapt the promptfoo features to my application and the documentation provided online is not sufficient and I cannot find many videos on custom plugins and custom tests. And whether there is an other way to customize the tests, I could any information that you have. Thank you for your help!

Number of Input Tokens

Sarra

11/27/2025, 12:52 PM

Hello, Does anyone know whether it's possible or not to limit the number of input tokens. If so please let me know how can it be done exactly? Thank you for your help!

STDErr - Python Worker Crash after evaluation is completed during promptfoo run

curious_battle

11/28/2025, 5:46 AM

This issue happens inconsistently, initially we were using homebrew for installation and was getting this issue after some time, later on we used npm installation and the issue was resolved for some time, but again got the same issue. I have attached the config yaml file for reference, someone please help with this. We're on the final stages of self-hosting this for internal POC to the larger team, and consistently getting this issue leading to OOM both in local as well as production server https://cdn.discordapp.com/attachments/1443840296810057778/1443840297867018442/chitchat_langfuse_promptfoo.yaml.tmpl?ex=692a8890&is=69293710&hm=5bd28de4c7b73796703aeadc7b7877992493d021e6bdf82775700980c8b5c255&

Evaluating LLM Responses using MCP servers

David

12/03/2025, 2:34 PM

Hi! I’ve been using Promptfoo for a while to create agents and evaluate models. Now, I’m looking to create an agent that calls some tools from a remote MCP in order to interpret the results and produce a specific output. I have configured my provider as follows: - id: anthropic:messages:claude-3-7-sonnet-20250219 label: claude-3-7-sonnet config: temperature: 0.5 max_tokens: 7000 mcp: enabled: true server: url: verbose: true debug: true However, when I create tests, the output always only shows the first tool call: {"type":"tool_use","id":"toolu_01LFZRH5cb34ZZDu6SF3jHES","name":"getBrandSettings","input":{}} Is there any configuration that allows the LLM to execute or plan multiple tool uses inside Promptfoo? I have searched a bit about this and everyone talks about doing a wrapper, but I would like to know if there's an alternative

Problem with MCP call

Matteo

12/05/2025, 1:24 PM

Title: MCP Server Returns Invalid Tool Schema -

outputSchema.additionalProperties

Must Be Boolean Description: I'm encountering a validation error when connecting to an MCP server via Streamable HTTP transport. The connection succeeds and ping works, but the

listTools()

call fails with a schema validation error. Error Message:

Copy code

Failed to connect to MCP server web-search: [
  {
    "code": "invalid_type",
    "expected": "boolean",
    "received": "object",
    "path": [
      "tools",
      0,
      "outputSchema",
      "additionalProperties"
    ],
    "message": "Expected boolean, received object"
  }
]

How should google sheets handle multiple providers?

smaclell

12/05/2025, 11:00 PM

We want to output test results to Google Sheets with multiple providers. It is currently only outputting the last provider. I had expected it would match the CSV output instead. How should this behave? I'll share a simple PR ([link](https://github.com/promptfoo/promptfoo/pull/6528)) to fix it by showing all responses as unique columns. Happy to provide a more extensive change if you would like it to match the CSV format. Context: We are trying to improve the workflow for less technical users. We planned to have them write questions in a Google Sheet, then post the responses to a new sheet. p.s. We love promptfoo. Thank you for the fantastic framework.