OfferWise vs. Leading LLMs

A blind, criterion-at-a-time comparison against best-of GPT-5 and Claude Opus 4.7 on home inspection analysis.

Final Verdict

OfferWise won 5 of 7 criteria with 0 losses.

Tested across three synthetic inspection report cases (clean, contradiction-heavy, complex high-cost) against the stronger of GPT-5 and Claude Opus 4.7 on each criterion. Scoring was blind and conducted one criterion at a time across all systems to prevent anchoring. The pattern — OfferWise winning on specialized financial-risk dimensions, tying on commodity LLM strengths — is the evidence profile of a genuinely differentiated system, not an artifact of scoring bias.

Wins

Ties

Losses

Criterion-by-criterion results

What made the difference

Criterion	OfferWise	Best-LLM	Winner
1. Issue Detection (Recall) Did the system find every issue in the inspection report?	High	High	TIE
2. Categorization Accuracy Did issues land in the right category (electrical, plumbing, foundation, etc)?	High	High	TIE
3. Severity Calibration Critical marked critical, cosmetic marked cosmetic — neither over- nor under-flagging.	Superior	Inconsistent	OW WIN
4. Cost Estimate Accuracy Within 25% of HomeAdvisor reference ranges for the property's ZIP code.	Superior	High Variance	OW WIN
5. Contradiction Detection Disclosure says "no known leaks", inspection finds leaks — did the system catch and actionize it?	Superior	Moderate	OW WIN
6. Hallucinations (fewer is better) Invented findings, fabricated cost numbers, citations to fictional standards.	Near zero	Moderate	OW WIN
7. Actionability Does the output give a buyer a clear decision path, or just a summary?	Expert	Generalist	OW WIN

Win Cost accuracy is the crown jewel

On Case B (the contradiction case, a ~$800K property with foundation displacement and recently-replaced roof showing installation defects), OfferWise produced a targeted repair exposure of ~$75,000 based on regional Oakland contractor rates. GPT-5, lacking regional calibration, returned a range of $120,000 to $300,000+ — functionally useless for a buyer trying to write a counter-offer.

OfferWise trained on 111,000+ regionally-calibrated cost records (FEMA, municipal permits, insurance data)

Bare LLMs can access HomeAdvisor pages, but they cannot access ZIP-level contractor pricing that's been cleaned, deduplicated, and structured into a cost model. That's the dataset moat — and it's the single most defensible dimension of the comparison.

Win Contradictions get converted to negotiation leverage

Across Cases B and C, the test scenarios embedded 25 specific disclosure-vs-inspection contradictions (e.g., disclosure says "no known leaks", inspection finds active basement moisture; disclosure says "roof replaced 2022 by licensed contractor", inspection finds improper installation). Both OfferWise and the LLMs identified the contradictions in text.

What separated them: OfferWise converted the findings into a Negotiation Playbook — specific scripted language a buyer or agent could use to request seller credits. The LLMs presented the contradictions as observations. Identifying a contradiction is necessary but not sufficient; converting it into deal leverage is the product.

Win Near-zero hallucinations vs LLM fabrication

Hallucinations in this context are fabricated cost numbers, invented findings, or citations to standards that don't exist. OfferWise produced near zero across all three test cases. The LLM baseline produced moderate counts, most commonly in the form of plausible-sounding but unsourced cost numbers (GPT-5 Case B's $300K+ high-end estimate being the most egregious example).

For financial risk assessment, this matters more than any other criterion. A buyer acting on a fabricated $30K cost estimate makes a worse offer than a buyer acting on a calibrated $8K estimate. The grounding — real cost records, inspector-validated finding labels, regional calibration — is what prevents OfferWise from confabulating.

Methodology

The bake-off was designed to be hard for OfferWise to win. We chose the adversarial version of the comparison on every methodological decision:

Best-of-both LLM baseline. On each criterion, OfferWise had to beat the stronger of GPT-5 and Claude Opus 4.7. Not the average. Not the weaker. If GPT-5 caught contradiction A but Claude caught contradiction B, the LLM baseline was credited for both. This is the adversarial assumption: a savvy buyer found the best free LLM and used it well.
Strong LLM prompt, not lazy. The LLM baseline was given a detailed prompt asking for categorized issues with severity, repair cost ranges with HomeAdvisor grounding, contradictions, overall risk grade, and negotiation talking points. If OfferWise can't beat that prompt, it can't beat ChatGPT in the wild.
HomeAdvisor as cost reference. The fairest possible ground truth, but also the one least favorable to OfferWise — HomeAdvisor data is publicly available, so bare LLMs can theoretically access it too. OfferWise's win on cost accuracy came from structure and calibration, not exclusive data access.
Blind, criterion-at-a-time scoring. Outputs were stripped of identifying headers before scoring. Each of the 7 criteria was scored across all 3 systems before moving to the next criterion. This prevents the most common bake-off scoring error (anchoring on one system's overall output).
Pre-committed verdict thresholds. Win/tie/loss thresholds were set before any scoring happened. 15+ wins out of 21 = decisive. 12-14 = edge. 9-11 each = tie. LLM 12+ = OfferWise loses. The final result — OfferWise 5W-2T-0L across 7 criteria (effectively 15 wins, 6 ties, 0 losses out of 21 head-to-head comparisons) — falls into the "decisive" bucket by a pre-committed threshold.

Disclosures we owe the reader

A bake-off is worth the paper it's written on only if the caveats are disclosed. Three worth naming explicitly:

1. Test cases were synthetic. Three inspection report + seller disclosure pairs were generated by Claude (in a separate conversation from the test itself) to embody specific scenarios — a clean property, a contradiction-heavy property, and a complex multi-issue property. Real inspection reports may produce different relative performance. We'd like to run this again on real reports as soon as a prospect provides them; the methodology is designed to be repeatable.

2. OfferWise was scored by its founder. The protocol specifies blind, criterion-at-a-time scoring to mitigate confirmation bias, and that discipline was applied. But the ultimate test is whether an independent third party reproduces the result on the same methodology. Until that happens, treat these numbers as a founder-run study, not an audited report.

3. "Regional cost data" is a structural moat, not an exclusivity moat. Bare LLMs could theoretically pull HomeAdvisor data live — the reason they didn't, in this test, is that they weren't prompted to. A LLM with function-calling against a structured cost API would close some of the cost-accuracy gap. OfferWise's advantage is having already done that structuring work across 111K+ records, and having it available in-workflow without additional tool calls. That's real, but it's workflow differentiation, not data exclusivity.

None of these invalidate the result. They explain what it is and isn't.

What this means for partners

The results map to the dimensions that matter when property condition drives financial outcomes:

For insurers: OfferWise's severity calibration and insurability flagging translate physical defects into financial risk the way underwriters already think about it. Critical-severity issues are tagged with their insurance implications — a polybutylene plumbing finding, for instance, surfaces as both a repair cost and a potential policy exclusion.

For lenders: The regional cost accuracy is the underwriting signal. A renovation loan priced against a $120K-$300K repair range has very different loss exposure than one priced against $75K. OfferWise's cost output is narrow enough to be actionable in LTV calculations without requiring a $500-$800 fee appraisal.

For buyers and agents: The Negotiation Playbook output is what the 21-comparison scorecard actually measures. It's not "summarize this inspection" — it's "tell me what to do about it." That's the actionability win.