Assessing payer ready economic model quality: Acumetis Model Builder vs AI

A controlled experiment comparing AI-generated models with model builder

Executive summary

Generative Artificial Intelligence (AI) is increasingly being explored as a tool for health economic modelling. In principle, the ability to recreate a published cost-effectiveness model directly from a paper and generate an Excel implementation within a short time is highly attractive. However, for Health Technology Assessment (HTA) and to fulfil payer requirements, speed alone is not enough. Models must also be transparent, methodologically sound, reproducible, and suitable for further development into submission-ready evidence.

The aim of this experiment was to evaluate whether AI (our experiment was conducted on a top 3 AI agent, heretofore referred to as “AI Agent”) can recreate a high-quality cost-effectiveness model from a publication within one hour, and to benchmark its outputs against those generated using Acumetis Model Builder (our online platform designed for the development of Excel-based economic models). We define quality as stability, transparency, and reproducibility.

→ Acumetis Model Builder is an online, cloud-based platform which automates the design and building process of Excel cost-effectiveness and budget impact models. The intuitive online interface leads to the quick and easy generation of self-contained, customized models in Excel. Debuted in 2018 and used in over 250 payer and HTA submissions, the Model Builder reduces the risk of human error and consistently produces HTA and payer ready excel models. 

To assess this, we prompted AI Agent to build an Excel model directly from a published paper describing a relatively simple Markov model in endometriosis-related pain. As a reference, we recreated the same model using Model Builder technology. While the intended test horizon for AI Agent was time bounded to one hour, the overall testing time was extended in order to assess replicability across repeated runs.

The results show that AI can be for generating an early model draft and for extracting inputs from publications. However, the outputs varied substantially between runs, assumptions changed silently, one attempt failed entirely, and one version achieved an exact numerical match only by calibrating missing inputs to force agreement with the publication results. None of the AI Agent-generated models was HTA or payer ready. All would have required substantial further quality control, methodological review, refinement of assumptions, and extensive additional work to introduce the scenarios, analyses, and flexibility expected by HTA bodies.

In contrast, the Model Builder approach delivered a payer ready reference model with a user-friendly structure and reusable analytics.

The findings suggest that, at present, AI Agent is best viewed as a tool for early support and prototyping rather than a reliable standalone route to robust HTA and payer modelling on its own.

1. Background

Health economic models used in HTA and payer submissions must meet high standards of methodological rigor. They are expected to be transparent, reproducible, internally consistent, and suitable for extension into scenario analyses, sensitivity analyses, and submission-ready materials.

Model Builder is a proprietary online platform designed for the development of Excel-based economic models. It debuted at ISPOR in 2018 and over the last seven years, Acumetis (formerly FIECON) has used the platform technology with clients to develop health economic models for over 250 HTA and payer submissions. In that time, it has become synonymous as an industry standard model building technology. The value of this technology has consistently rested on three principles:

    Generative AI now presents a compelling opportunity. It can read publications, summarize evidence, and generate Excel structures rapidly.

    In addition, HTA acceptance to AI is warming, with a recently completed a series of stakeholder workshops exploring the potential applications, opportunities, and challenges of Generative AI (GenAI) in health economic evaluation (HEE). These discussions build on a 2024 Position Statement on the Use of AI in Evidence Generation, which acknowledges that AI methods are likely to play an increasing role in future HTA submissions — provided they are used transparently, responsibly, and give clear added value.

    The rapid evolution of this technology, together with its growing acceptance among HTA bodies and payers, means that the potential of AI cannot be ignored. At the same time, in the context of HTA and payer evidence, robust and reliable scientific evidence remains essential.

    Transposing that requirement to economic modelling, the cardinal question is not whether a model can be produced quickly (for speed or efficiency’s sake), but whether one can trust what has been produced, understand how it was built, and rely on it in a high-stakes context. The fundamental question is: Is a model produced by AI competent and reliable to payer and HTA standards in the same way a Model Builder output is?

    2. Objective

    The objective of this experiment was to assess whether AI Agent can create a good-quality cost-effectiveness model from a publication within one hour.

    More specifically, we aimed to evaluate whether AI Agent could:

    • Recreate the published model structure in Excel
    • Generate results reasonably close to the publication
    • Do so within approximately one hour
    • Provide a model of sufficient quality to serve as a solid early basis for further development

    To provide a reference point, the same model was created using Model Builder technology. Although the primary AI Agent test was framed around the one-hour target, the overall testing time was extended to allow multiple runs and thereby assess replicability.

    This distinction is important: the purpose was not merely to investigate whether AI Agent could produce something plausible once, but whether it could do so reliably and consistently.

    3. Source publication

    The publication selected for the exercise was an economic analysis in endometriosis-related pain by Grand et al. (2019). The paper was chosen because it is publicly available, it described a relatively simple cost-effectiveness model with a Markov structure and reasonably well-reported assumptions.

    The model included four health states:

    • No pain
    • Mild pain
    • Moderate pain
    • Severe pain

    Many inputs were reported and publicly available (including transition probabilities, utility values, unit costs, resources). However, two important elements were not fully specified:

    • The baseline distribution across health states,
    • The detailed assumptions underlying pain-management costs.

    The base-case results reported in the publication were as follows:

    These results served as the publication benchmark.

    4. Experimental setup

    All AI-based model reconstructions were performed using AI Agent (details available on request), accessed via the AI Agent interface in March 2026 (thinking mode). AI Agent was prompted to recreate the model directly from the publication PDF and generate an Excel model using the instruction: “Recreate an Excel model based on the attached publication.” A PDF file containing the full text of Grand et al. (2019) was provided as input. As a non-deterministic system, repeated runs may yield different outputs. The exercise was repeated five times and five distinct outputs were generated.

    The initial aim was to see whether a good-quality model could be created within one hour. However, because some runs produced issues, and because replicability itself became an important part of the exercise, the total testing time was extended to approximately 2.5 hours across all attempts. The purpose of the repeated AI runs was therefore twofold:

    • To test feasibility within a short timeframe
    • To evaluate consistency and replicability across separate attempts

    In parallel, a reference model was built using Model Builder. This served as the comparator against which the AI Agent outputs were assessed.

    5. Overview of results

    5.1 Summary table

    6. Characterizing the models

    To characterize the distinct behaviour observed across runs, we assigned each AI Agent output a label reflecting its principal methodological features. These form the potential outputs one can expect on any “random walk” of an AI tool prompt.

    6.1 Deceity— Try 1

    Try 1 (Deceity) appeared credible and reasonably close to the target at first glance, giving the impression of a sound reconstruction. It produced a functioning model in only about 16 minutes, included marked input cells, discounting, and half-cycle correction, and did not reveal obvious errors on an initial review.

    However, the incremental QALY gain was substantially higher than in the publication, and later stress testing showed that some costs were not properly linked to state occupancy. As a result, the model looked more reliable than it actually was. The name reflects this misleading first impression: a model that appeared convincing on the surface, but whose weaknesses only became apparent on closer inspection.

    6.2 Sycophanty — Try 2

    Try 2 (Sycophanty) told us what we wanted to hear.

    This version reproduced the publication results almost exactly, but it did so by calibrating missing inputs to match the target outputs. In particular, it assumed a baseline distribution including approximately 25% of patients in the “no pain” state, which is not clinically plausible in a population defined as having endometriosis-related pain.

    This raises an important methodological concern. Although the objective in an exercise of this kind is clearly to recreate the model and obtain results as close as possible to those reported in the publication, one would not normally expect an exact numerical match when working solely from the paper. Published articles rarely provide every modeling assumption in full detail. For example, even if mortality is included, the original analysts may have relied on life tables from a different year, an earlier data source, national rather than regional estimates, different compliance assumptions, or other practical conventions that are not fully documented. Even relatively small differences of this sort would normally be expected to generate some divergence in the results.

    An exact match in such circumstances is therefore not necessarily a sign of methodological success. In this case, AI Agent effectively moved away from reconstructing the original model from reported inputs and instead recreated the missing inputs so that the outputs would align perfectly. That is not what this exercise was intended to test. It is methodologically flawed because it prioritizes agreement with the result over fidelity to the original evidence base.

    6.3 Tardy — Try 3

    Try 3 (Tardy) consumed more than 30 minutes and still failed to deliver any usable output. It returned a run-time error and no model was created. In a practical setting, this matters. A tool that appears fast in principle but sometimes fails completely cannot be relied upon in a production workflow without time buffers and rework.

    6.4 Incompletey — Try 4

    Try 4 (Incompletey) generated an output that was not complete enough to support interpretation. This attempt, generated in around 25 minutes, failed to extract the utilities and therefore could not calculate total QALYs. It also produced higher oral contraceptive costs than expected. Because a central outcome measure was missing, the model could not be treated as a reliable recreation.

    6.5 Admirably — Try 5

    Try 5 (Admirably) was the strongest of the AI Agent attempts. It was produced in around 27 minutes and made a more defensible baseline assumption:

    • 0% no pain
    • 33% mild pain
    • 33% moderate pain
    • 33% severe pain

    This version also had the clearest and most user-friendly structure among the AI Agent outputs, with better separation of assumptions, traces, and checks. Even so, its cost results remained different from the publication, it did not include PSA by default, and it still represented only an early model rather than an HTA-ready product. The name reflects that it performed admirably relative to the other AI attempts, not that it fully met HTA standards.

    7. Detailed findings

    7.1 AI Agent generated early models quickly

    With the exception of the failed run, each AI Agent attempt generated some form of Excel model in roughly 16 to 35 minutes. This suggests that AI Agent can indeed produce a first-pass model shell from a publication within about one hour.

    That is a meaningful result. For early exploration, prototyping, or extraction support, this is potentially useful.

    7.2 Replicability was poor

    Across the five attempts, AI Agent produced:

    • Different layouts (usually static, not user-friendly)
    • Different sheet structures
    • Different assumptions
    • Different levels of completeness
    • Materially different results

    Thus, even if a single run appears promising, one cannot assume that the same prompt will produce a comparable result the next time. This lack of reproducibility is a major limitation.

    7.3 None of the AI Agent outputs was HTA-ready

    This point is important. Regardless of the relative quality of the individual attempts, none of the AI Agent-generated models could be considered ready for HTA use.

    At best, they were early model drafts. Even the best-performing version would have required substantial further work, including:

    • Detailed QC of formulas and linkages
    • Review and justification of all assumptions
    • Refinement of the structure
    • Robust scenario functionality
    • Sensitivity analyses
    • Better reporting
    • Improved usability and user-friendliness
    • Other features typically expected by HTA agencies

    In other words, the outputs should all be treated with caution. The models may be useful as starting points, but they do not provide robust submission-ready evidence.

    7.4 Quality control was necessary every time

    Each AI Agent model required review. This included checking:

    • Assumptions
    • Extraction accuracy
    • Formulas
    • Links between sheets
    • Behaviour in extreme scenarios
    • General structural logic

    This means any time savings from automatic generation are partly offset by the need for careful QC. In high-stakes contexts, this review is not optional.

    Another important challenge is that, although Admirably proved to be the strongest of the five attempts, there is no reliable way to know at the outset which version AI Agent has actually produced in any given run. A user may receive a model that appears well structured and plausible, but without careful quality control and methodological review it may be impossible to determine whether it is genuinely the “best” version or one that contains hidden weaknesses, incomplete extraction, or flawed assumptions. In practice, this means that every AI-generated model requires thorough review and validation. The issue is not only that quality varies across runs, but that this variation is not immediately visible, making careful QC essential every single time.

    7.5 The exact-match problem is especially important

    Among all findings, the most concerning was the behaviour seen in Try 2, Sycophanty.

    A user could easily interpret the exact match to the publication as a strong success. In reality, it reflected calibration of missing inputs to achieve the target result. Without careful review, this could go unnoticed. That creates a risk of false confidence: the model appears validated by agreement, when in fact its internal assumptions have drifted away from a defensible reconstruction.

    8. Model builder as a reference

    Using Model Builder, it took around 30 minutes to create a shell and update the inputs. This process was supported by an input extraction file prepared separately for convenience (which took around 30 min).

    The resulting model delivered:

    • QALY results broadly aligned with the publication
    • Same cost discrepancy as seen in AI Agent models, likely driven by reporting gaps in the paper
    • A stable and reproducible structure
    • A user-friendly layout
    • Tested and validated mechanics
    • Confidence that the calculations were behaving as intended
    • Scenario analysis, one way sensitivity analysis and probabilistic sensitivity analysis already included

    Importantly, Model Builder produced the same results every time, provided the same inputs were used.

    This makes it a more appropriate reference approach for robust modelling. It offers control, consistency, and a platform that can be extended into the full set of analyses typically required in HTA.

    9. Summary and implications

    This experiment suggests a balanced conclusion.

    AI Agent can be useful in health economic modelling, particularly for:

    • Extracting inputs from publications
    • Helping create an initial shell
    • Accelerating early exploration

    AI can be a useful tool for early-stage tasks, including input extraction, rapid prototyping, and supporting initial model development.

    However, these benefits are accompanied by important limitations. Across repeated runs, AI Agent produced models that differed in structure, assumptions, and results. The outputs were not reproducible, and key assumptions were sometimes introduced without clear justification or visibility. Errors were not always apparent in the base case and only became visible under closer inspection or stress testing. Notably, a numerically perfect match to the publication may reflect a flawed reconstruction rather than a sound one, for example when missing inputs are calibrated to force agreement with reported results.

    For HTA, this matters enormously. What is needed is not just a plausible model, but a model whose logic is controlled, whose assumptions are defensible, whose outputs are reproducible, and whose structure can support the wide range of analyses required by payers and HTA bodies.

    None of the AI Agent-generated models in this experiment can be considered HTA- or payer-ready. At best, they represent early drafts that require extensive quality control, expert review, and substantial further development. Additional work would be needed to validate assumptions, correct structural issues, and implement scenario analyses, sensitivity analyses, reporting tools, and user-friendly functionality required for formal submissions.

    In contrast, the Model Builder approach provided a stable, reproducible, and transparent reference model. It offers full control over assumptions, consistency of results, and a validated structure that can be readily extended into a submission-ready model. This approach is further supported by validation across more than seven years of application and over 250 payer submissions. This distinction is fundamental in HTA, where the credibility of evidence depends not only on results, but on the ability to understand, reproduce, and defend them.

    Taken together, these findings suggest that AI Agent is not currently a replacement for structured model-building technology. It should instead be viewed as a support tool within the modelling workflow rather than a standalone solution for HTA modelling. For robust, transparent, and submission-ready models, structured modelling tools and methodological oversight remain essential.

    10. Conclusion

    This experiment examined whether or not AI Agent could deliver competent and reliable evidence in an economic model at parity to Model Builder in a time bounded (one-hour) period.

    The answer is that AI Agent can often create a very early model draft within that timeframe, but not one that is a competent, reliable or reproducible HTA or payer-ready model.

    None of the AI Agent models could be viewed as directly suitable for HTA use. All required caution, extensive QC, expert review, and substantial further development. Additional work would then be needed to build the scenarios, sensitivity analyses, reporting tools, and user-friendly functionality expected in formal submissions.

    By contrast, the Model Builder reference model provided transparency, reproducibility, methodological control, and a stronger basis for reliable decision modelling, supported by the validation of outputs from over 7 years of work and 250+ payer submissions.

    AI clearly has an important role to play. But, at present, that role is best understood as supporting model development rather than replacing robust model-building workflows.

    11. Discussion

    A publication can be turned into something that looks like a model in less than an hour using advanced technologies.

    The conundrum is whether that model is one that you can trust.

    Each time AI is engaged there is an element of a “random walk” – whereby the steps taken (e.g. prompts, clarification, inputs, and outputs) will be unpredictable as each step occurs independently of the previous step.

    A structured, validated tool such as Model Builder invites an equally structured input process, resulting in a consistent, replicable output.

    It is this central element that determines the difference between an exploratory model and a robust HTA model.

     

    Meet The Author

    Ewa Dlotko

    Technical Director, Health Economics

    HEOR Director with over 15 years of experience in health economic modelling. Brings deep expertise across HTA submissions, disease and economic modelling, and evidence synthesis. Experienced in systematic reviews, data search and analysis, and leading projects that support evidence-based decision-making across healthcare and life sciences.

    Request a Consultation

    Our experts can help you reframe what’s possible. 

    Let's Talk

    "*" indicates required fields

    Learn More