A blind, criterion-at-a-time comparison against best-of GPT-5 and Claude Opus 4.7 on home inspection analysis.
| Criterion | OfferWise | Best-LLM | Winner |
|---|---|---|---|
|
1. Issue Detection (Recall)
Did the system find every issue in the inspection report?
|
High | High | TIE |
|
2. Categorization Accuracy
Did issues land in the right category (electrical, plumbing, foundation, etc)?
|
High | High | TIE |
|
3. Severity Calibration
Critical marked critical, cosmetic marked cosmetic — neither over- nor under-flagging.
|
Superior | Inconsistent | OW WIN |
|
4. Cost Estimate Accuracy
Within 25% of HomeAdvisor reference ranges for the property's ZIP code.
|
Superior | High Variance | OW WIN |
|
5. Contradiction Detection
Disclosure says "no known leaks", inspection finds leaks — did the system catch and actionize it?
|
Superior | Moderate | OW WIN |
|
6. Hallucinations (fewer is better)
Invented findings, fabricated cost numbers, citations to fictional standards.
|
Near zero | Moderate | OW WIN |
|
7. Actionability
Does the output give a buyer a clear decision path, or just a summary?
|
Expert | Generalist | OW WIN |
On Case B (the contradiction case, a ~$800K property with foundation displacement and recently-replaced roof showing installation defects), OfferWise produced a targeted repair exposure of ~$75,000 based on regional Oakland contractor rates. GPT-5, lacking regional calibration, returned a range of $120,000 to $300,000+ — functionally useless for a buyer trying to write a counter-offer.
Bare LLMs can access HomeAdvisor pages, but they cannot access ZIP-level contractor pricing that's been cleaned, deduplicated, and structured into a cost model. That's the dataset moat — and it's the single most defensible dimension of the comparison.
Across Cases B and C, the test scenarios embedded 25 specific disclosure-vs-inspection contradictions (e.g., disclosure says "no known leaks", inspection finds active basement moisture; disclosure says "roof replaced 2022 by licensed contractor", inspection finds improper installation). Both OfferWise and the LLMs identified the contradictions in text.
What separated them: OfferWise converted the findings into a Negotiation Playbook — specific scripted language a buyer or agent could use to request seller credits. The LLMs presented the contradictions as observations. Identifying a contradiction is necessary but not sufficient; converting it into deal leverage is the product.
Hallucinations in this context are fabricated cost numbers, invented findings, or citations to standards that don't exist. OfferWise produced near zero across all three test cases. The LLM baseline produced moderate counts, most commonly in the form of plausible-sounding but unsourced cost numbers (GPT-5 Case B's $300K+ high-end estimate being the most egregious example).
For financial risk assessment, this matters more than any other criterion. A buyer acting on a fabricated $30K cost estimate makes a worse offer than a buyer acting on a calibrated $8K estimate. The grounding — real cost records, inspector-validated finding labels, regional calibration — is what prevents OfferWise from confabulating.
The bake-off was designed to be hard for OfferWise to win. We chose the adversarial version of the comparison on every methodological decision:
15+ wins out of 21 = decisive. 12-14 = edge. 9-11 each = tie. LLM 12+ = OfferWise loses. The final result — OfferWise 5W-2T-0L across 7 criteria (effectively 15 wins, 6 ties, 0 losses out of 21 head-to-head comparisons) — falls into the "decisive" bucket by a pre-committed threshold.A bake-off is worth the paper it's written on only if the caveats are disclosed. Three worth naming explicitly:
1. Test cases were synthetic. Three inspection report + seller disclosure pairs were generated by Claude (in a separate conversation from the test itself) to embody specific scenarios — a clean property, a contradiction-heavy property, and a complex multi-issue property. Real inspection reports may produce different relative performance. We'd like to run this again on real reports as soon as a prospect provides them; the methodology is designed to be repeatable.
2. OfferWise was scored by its founder. The protocol specifies blind, criterion-at-a-time scoring to mitigate confirmation bias, and that discipline was applied. But the ultimate test is whether an independent third party reproduces the result on the same methodology. Until that happens, treat these numbers as a founder-run study, not an audited report.
3. "Regional cost data" is a structural moat, not an exclusivity moat. Bare LLMs could theoretically pull HomeAdvisor data live — the reason they didn't, in this test, is that they weren't prompted to. A LLM with function-calling against a structured cost API would close some of the cost-accuracy gap. OfferWise's advantage is having already done that structuring work across 111K+ records, and having it available in-workflow without additional tool calls. That's real, but it's workflow differentiation, not data exclusivity.
None of these invalidate the result. They explain what it is and isn't.
The results map to the dimensions that matter when property condition drives financial outcomes:
For insurers: OfferWise's severity calibration and insurability flagging translate physical defects into financial risk the way underwriters already think about it. Critical-severity issues are tagged with their insurance implications — a polybutylene plumbing finding, for instance, surfaces as both a repair cost and a potential policy exclusion.
For lenders: The regional cost accuracy is the underwriting signal. A renovation loan priced against a $120K-$300K repair range has very different loss exposure than one priced against $75K. OfferWise's cost output is narrow enough to be actionable in LTV calculations without requiring a $500-$800 fee appraisal.
For buyers and agents: The Negotiation Playbook output is what the 21-comparison scorecard actually measures. It's not "summarize this inspection" — it's "tell me what to do about it." That's the actionability win.
The methodology is repeatable. Send us a real inspection report and disclosure packet, and we'll run it head-to-head against the same LLM baseline — with you doing the scoring.
Request a Live Comparison