How we approach AI deployments in three steps: audit, evaluation and deployment

Matthias Olieslagers

Chief of Staff

June 16, 2026

Over the past years, we've worked on AI deployments across enterprises in different industries. The technology stacks differ, the use cases differ, but the way we structure the work has converged on roughly the same three phases: audit, evaluation, deployment. This blog is a short description of what each one looks like in practice at Panenco and we share some leading principles to guide good decision-making along the way.

Phase 1: Audit

At the beginning of each engagement, we spend time with the teams who'll eventually use the system and we monitor the as-is way of working. The goal is to understand the actual workflow end-to-end, not the version of it that gets described in a kickoff. In our experience, this step is a necessity for eventual adoption of the solution further down the road, as business operators feel engaged and in control from day one.

When deciding what processes are worth automating, we apply a few checks.

Enough volume to matter: Automations that run a handful of times per month rarely pay for themselves. We prioritise workflows where partial automation moves a measurable number. In a recent engagement, we automated an invoice processing workflow that ran ~250 times/day. That kind of scale is meaningful to tackle.

Rules are clear, inputs are messy: If a process can be described in rules but the inputs arrive as PDFs, emails, scans, and spreadsheets in different formats, that's a good candidate for an agent. If rules and inputs are both predictable, a deterministic script is usually faster and cheaper. Where applicable, we combine the best of both: deterministic scripts where possible and LLM calls where relevant.

The audit ends with a prioritised shortlist of use cases, each with a feasibility score, effort estimate, and expected impact on ROI. It also includes a list of processes we'd recommend leaving manual for now.

Phase 2: Evaluation

Before anything goes near production, we build an evaluation suite and put that in the hands of the business operators from the beginning, not just after deployment. Two main reasons for this.

Building a golden dataset with the domain expert: We sit with someone who does the work today and define what a good answer looks like for 20 to 50 real cases. That set becomes the benchmark. We re-run it every time we change a prompt, a tool, or a model, and we monitor the calls in production with an observability tool to catch drift.

Grading the process, not just the output: People don't solve problems in one step, so an agent that arrives at the right answer through the wrong reasoning will eventually fail in ways that are hard to debug. We map the intermediate steps a competent human would take and grade the agent on whether it hits the same checkpoints.

The eval suite ends up serving two purposes. Internally, it tells us whether changes are improvements or regressions and gives us clear action points on how we can better steer the agent's behaviour. Externally, it gives the client's leadership something concrete to look at when deciding whether to expand the rollout and it concretely measures the impact achieved across the build, for example by observing metrics such as error rates, throughput time, and savings.

Phase 3: Deployment

Every company has their own software stack, legacy systems, and data assets. We embrace that variety and build solutions around it, rather than trying to force on new tooling that doesn't fit the as-is way of working. Two principles generally shape how we put things into production.

Avoid data migrations, minimize new tooling: Most of our enterprise clients have spent years getting onto their current ERP or cloud provider. Where we can, we build as minimal as possible from scratch and maximally work with existing solutions in the client's software stack. For a recent project for an audit firm, we maximally leveraged the existing Azure suite: Azure Service Bus as the task manager between the backend API and the AI engine, Azure Foundry for the build of the AI engine, Blob Storage for handling documents, and so on. The existing stack stays in place, with minimal new outside tooling and overhead added. This also helps significantly with compliance, cybersecurity, and change management.

Start with a small unit of autonomy, only then expand upon success: Initial versions typically observe and recommend rather than act, with a human in the loop for verification. An agent will detect, analyze, and prepare certain actions, with a human still pressing the button. Once that's been stable long enough and we've used the operator's input to improve the agent and build confidence, we extend the agent's permissions to make it more autonomous, especially for cases where confidence is extremely high. Each expansion gets the same evaluation and observability treatment as the original deployment. This approach allows for good change management, helping operators adopt the tooling faster and staying in control of the process. It also helps to avoid critical errors that make everyone lose their trust in the newly adopted AI solution.

Wrapping things up

Once the system is in production, we run it alongside the client's team until they're comfortable owning it. That usually means a handover period with documentation, a re-runnable eval suite, observability dashboards, and a clear set of metrics to track.

The work isn't finished when the agent ships. It's finished when the client's team can run it, learn from it, and improve it without us. Ultimately, building the solution is not the hardest part. Making sure it gets adopted and gradually improves while being used in production through continuous evals and business input, is where the real value accrues.

If you're working on something in this space and want to learn more or share findings, please feel free to get in touch.

Let's build. Together!

We'll be happy to hear more about your latest product development initiatives. Let's discover how we can help!