A Case Study in the Limits of Takeoff Automation, Andrew Ngo

During my time consulting for a steel frames manufacturer, I found that the one problem that kept coming up was the expensive takeoff workflow, which is the process of measuring areas, lengths and counts of materials off a set of construction drawings so they can be priced. This paper documents my attempt to automate takeoff for my client as part of a broader quote automation project, targeting an accuracy rate of >80% against the human takeoff. The target accuracy rate was selected as minimum proof that the extractor could potentially be commercialised.

The central finding turned out to be a "catch-22". The only input that reliably lets the model reproduce a full takeoff is a document called the takeoff breakdown containing the human's takeoff working, which only exists once the work the model was supposed to automate has already been done. Even when handed everything but the final numbers, the highest accuracy rate recorded was 77.5% ignoring labels and 68% with exact-label scoring (labels must be correctly assigned or scored at zero). Both rates are well above the unaided baseline (extracting just from the drawings), but still fall short of the target. However, the accuracy rate is not the important finding; it is that relevant information lives outside the drawings which cannot be extracted, creating a structural ceiling that blocks the model from achieving a commercially viable accuracy rate.

What I built and tested

This project involved building a vision-language model (VLM) extractor structured as a multi-stage pipeline (Claude Opus 4.7 + Sonnet 4.6). A VLM is simply a model that you feed the drawings and text instructions to, which replies in text. The extractor was ultimately tested on the drawings of five real multi-townhouse builds from five different customers of my client to extract measurements and labels. The final test compared the accuracy rates recorded between raw extraction and extraction that was also loaded with a "convention spec". The convention spec is a reference document that compiles the recurring habits of each customer's past jobs, such as scoping decisions, customer-specific labelling conventions and decomposition patterns.

Across the four scoreable builds in the final convention spec test, accuracy ranged between ~40-73%, with the convention spec giving at most a small and inconsistent lift in accuracy. The spread was driven by differences between builds, not run-to-run noise. The fifth build in the testing set returned no scoreable canonical assemblies (a standard group of related line items) and was excluded from this range as an anomaly. Because the convention spec was built partly from the customers' own past jobs, it had an unfair advantage on the very builds it was tested against. A properly built, unbiased spec would score lower in absolute terms and would lift accuracy even less, which is why I treated the spec's effect as an upper bound and not a clean measurement of how much it helps. These figures are also case study illustrations from a small sample and should not immediately be taken as representative of the greater population.

The final accuracy rate is not the key finding that should be taken away. Takeoffs cannot currently be automated as a commercial product if they are similar to the ones my client does, where customer conventions are non-standardised and drawings do not contain all required information. The finding does not claim that other takeoffs cannot be automated. The limitation documented in this paper is specific to takeoffs with non-standardised conventions and incomplete drawings.

The breakdown test

One of the most revealing tests conducted separate from the final test involved two builds in the final testing set. This test engaged with a document called the takeoff breakdown, which is a typically disorganised display of the human's takeoff work. The testing batch for this breakdown test was reduced to two builds due to funding limitations. The key features of a takeoff breakdown are:

Decomposition pattern: how the customer chooses to split line items per assembly
Labelling of each assembly
Multi-segment math: how each value of the assembly was calculated
In-scope identification: which regions of the architectural drawings are in the scope of the specific build
Final measurement numbers extracted by the human

It is important to note these features of a takeoff breakdown because they cannot fully be interpreted by the extractor from analysing the drawings alone.

Note the definitions for the three different test variants before continuing:

Baseline: extracts from the drawings alone with no breakdown
Full: extracts from the drawings and the breakdown (breakdown features 1-5)
Stripped: extracts from the drawings and the breakdown with final measurements redacted (breakdown features 1-4)

On build #1, the extractor recorded a 'baseline' accuracy rate of ~53%; an unaided attempt to extract labels and measurements from the drawings alone. When the extractor was aided with the full breakdown including the numbers, effectively giving it a convoluted route to getting the answers, it recorded an ~81% accuracy rate (read section 'how accuracy is scored in the experiment' to learn why it does not reach 100%), labelled as the 'full' accuracy rate. This is not a measure of what the extractor can produce on its own due to it having access to the human measurements. When the extractor was aided with the full breakdown but with all quantity values redacted, it recorded a 'stripped' ~68% accuracy rate. On build #2, the extractor recorded a stripped accuracy rate of ~60% and a full accuracy rate of ~80%. Each test variation was run three times per build and the recorded accuracy rates were medians within a spread of 10 percentage points. Each build contributed around 22 scored assemblies from several hundred line items in the human breakdown, including working.

This reveals that the extractor was most effective when it was effectively given the answers, in this case, the takeoff breakdown. Just as importantly, 'stripped' conditions aided the extractor with all available information apart from the final numbers and still did not reach the target accuracy rate despite a meaningful increase from the baseline. Given that a takeoff breakdown can only be acquired if a human has already completed the process, an extractor like mine that depends on the breakdown is commercially non-viable for takeoffs of this kind as it would need the very output it was meant to produce.

How accuracy is scored in the experiment

One might question why the extractor could not reach 100% accuracy with the breakdowns. It is worth being explicit about how these accuracy rates are measured. A deliberate design decision was made to only score canonical wall, floor and roof assemblies, while excluding non-canonical items using a word list, which can be considered brittle but was determined to be a reasonable scoping choice for the experiment due to the extra impracticalities of including non-canonical items.

Each scored assembly gets a score that measures how close the extracted measurement was against the human's number, with a floor of 0%, calculated by score = max(0,1 - |extracted - human| / human). A measurement is only credited if its assembly label exactly matches the human's, so a correct number filed under a different label will score zero with no partial credit for getting the number correct. Because the breakdown is convoluted and inconsistent, the model routinely labels assemblies differently from the human, zeroing otherwise correct measurements, which is why scores land below 100%. This is a worked example of the scoring process from an illustrative assembly.

External Wall GF   human 65.67 m²   extracted 58.2 m²
  -> 1 - |58.2-65.67|/65.67 = 0.886   (88.6%)
Floor Truss        human 96.4 m²   extracted 96.4, labelled "Floor Panel"
  -> exact-label miss -> 0.0
Roof Truss Upper   human 42.0 m²   extracted 51.0 m²
  -> 1 - |51-42|/42 = 0.786   (78.6%)
Build score = mean(0.886, 0.0, 0.786, …)

Is the failure caused by inaccurate labelling or measurements?

Every test until this point used exact-label scoring, where a measurement only counts if both its value and label match the human results, and if labels are assigned incorrectly, it gets scored as a zero, which was called the 'non-lenient' variation. To test whether labelling inaccuracies are the dominant cause of the failure of the full test, I ran a 'lenient' variation of the stripped and full tests that ignored labels completely and solely measured the accuracy rate of the measurements without label assignment. The results were as follows:

	Stripped (final human takeoff redacted)	Full* (final human takeoff unredacted)
Build #1 non-lenient (exact-label scoring)	68%	81%
Build #1 lenient (ignores exact-label scoring, only focuses on measurement extraction accuracy)	77.5%	89.5%
Build #2 non-lenient	60%	80%
Build #2 lenient	71.7%	95.5%

* Not evidence that takeoff can be automated on a commercial scale as the extractor was provided with the human's measurement numbers.

Between the full variant of the non-lenient and lenient tests, the significant increase in accuracy rate in the lenient test provides a strong indication that labelling inaccuracies were a huge factor in why the extractor struggled to accurately replicate the human takeoff despite having the breakdown. There remains the possibility that my engineering is an underlying reason for labelling inaccuracies. The lenient test accounts for this directly; by ignoring labels entirely, poor engineering that results in labelling errors cannot drag the results of the lenient test down. Yet, the lenient stripped test was still unable to clear the target accuracy rate. This implies that the model's extraction capabilities are a real barrier to reaching the target accuracy rate, rather than labelling errors that could either be the fault of my engineering or the model.

More interestingly however, the stripped test reveals more about VLM extraction capabilities because it suppresses the ability of the extractor to simply copy measurements that have already been extracted by a human. With labelling being ignored under the lenient stripped test, the extractor was only able to accurately extract 77.5% of the measurements in build #1 and 71.7% of the measurements in build #2. Even under the most favourable conditions short of being given the answers, with the human's decomposition, labels, scope and working all provided, the model landed at 77.5% and 71.7%, near the 80% target but below it. With exact-label matching, the extractor performed even worse at 68% and 60%.

The limit that cannot be fixed by better extraction

These are early indications that VLM extraction capabilities are not at a level where they can be used for my client's takeoffs, but this is a capability observation that improvements in models could move. The more durable limit to takeoff automation is independent of extraction capability, because convention information is not present in the drawings. However, only further testing will prove or disprove this theory as this claim is currently on the basis of two multi-townhouse builds.

It is also worth conceding that my engineering design could have had oversights or flaws that lowered the accuracy rates. However, the finding does not rest precisely on clearing or missing the target 80% accuracy rate, nor does it rely on flawless engineering. The fact is that the per-job scopes that decide the takeoff are not present in the drawings, so no extractor, however well built, can recover them from the drawings alone, which is the key commercial finding. Stronger engineering could potentially raise the measurement accuracies but it cannot extract relevant information that is not in the drawings, yet both are prerequisites to achieving the commercial product that automates takeoff for my client.

The process of evaluating the numbers

Initially, testing returned results that were heavily biased on both extreme ends before converging towards the final results. The volatility came from rushing the test design and result review. Progressing throughout the experiment disciplined me to interrogate every result before building on it, asking what a number actually measured and whether the test could be fooled before trusting the result. This discipline shaped how I read every result that followed.

It truly surfaced when I identified an evaluation leak, which was a distinct and more severe issue than the convention spec bias. One of the tests recorded three builds (from an earlier, contaminated run that was not in the final testing set) with median accuracy rates of 86.5%, 85.5% and 83.4%. Their inflated numbers were the result of a technical oversight in the extractor's few-shot architecture. Few-shot architecture involved storing a handful of breakdowns in the exemplar pool to retrieve the relevant ones when extracting measurements. The breakdowns of these three tested builds were also loaded into the few-shot exemplar pool, returning biased results when using the extractor on those builds.

These builds were subsequently removed from the testing set but kept inside the few-shot exemplar pool so that the extractor could use the findings on other multi-townhouse builds without skewing the results. After applying this fix, the conclusive median accuracy rates were recorded on the adjusted testing set after adding the convention spec to the extractor, while keeping the few-shot architecture. These conclusive accuracy rates still carry the upward bias from the contaminated convention spec from earlier, but this particular evaluation leak was excluded to ensure the conclusive accuracy rates were not affected by it.

Why historical context was insufficient

The convention spec was a document that described the patterns identified across all the multi-townhouse builds from my client's history and was loaded into the extractor as a separate test against the control. It included typical scoping decisions and customer-specific labelling conventions and decomposition patterns. Both the few-shot exemplar pool and the convention spec are intended to train the extractor on real data from past breakdowns.

As the convention spec produced median results that were far below the bar of 80%, it was concluded that despite loading the extractor with few-shot retrieval and the convention spec, the failures were caused by the extractor missing information that is specific to the job the extractor is working on. This backs up the thesis that the breakdown finding brought about. Concisely, generalised context on historical data with few-shot retrieval is not able to replace the human insight on unique specifications of every build that is run through the extractor.

Interpreting the results and the real-world implications

Synthesised, these findings have real-world implications that conclude that takeoffs similar to the ones my client does cannot currently be automated on a commercial scale because the takeoff requires information that is not just in the drawings, but also lives in per-job human decisions. Thus, this can only change if per-job scope decisions standardise to the point where they can be encoded in advance, or if model capability becomes advanced enough to guess them based on historical data, both of which are highly unrealistic scenarios in the near future.

Even when the extractor was able to get the full set of specific customer conventions, the extractor fell short of the target accuracy rate when it had to extract and classify the measurements itself. This fits how takeoff automation tools actually work. Tools such as Togal AI automate the extraction layer using a specialised machine vision model and hand the scope and non-standard, job-specific assembly classifications back to the human estimator (1, 2). Its self-proclaimed 98% accuracy rate is for that extraction layer alone (3). This is a narrower task than a fully labelled and scoped takeoff, which was attempted to be replicated in this experiment. Thus comparing these results to Togal's marketed accuracy rates would be an unfair comparison. I tried and failed to use Claude's flagship vision-LLMs to automate the whole process, and as of June 2026, despite how AI-assisted takeoff tools such as Togal, Kreo and Beam market themselves, no publicly available tool is currently capable of producing a fully labelled and scoped takeoff from the drawing alone, because the conventions are not in the drawings (4, 5). Those tools still require the human to define the scope and verify the output (6).

Every independent and well-funded takeoff tool stops at the same boundary. They can all claim to automate extraction but they always end up handing the final scope decisions back to the human because no tool can extract this information. This is evidence that this blocking condition to full takeoff automation is not just specific to my client, but all users of takeoff software. The workflow of the takeoff estimator will thus be persistently human-in-the-loop and there is little evidence to support the notion that this will change anytime soon. That is not to say that specialised AI-assisted takeoff software is not useful; on the contrary, it does save human estimators a significant amount of time. However, new users need to know that they are not getting a fully labelled and scoped takeoff tool before making an investment in this type of software and current users still need to be checking over the output.

Moreover, ambitious investors or entrepreneurs looking to innovate this space need to know the structural reasons why achieving a full automation of traditional takeoff (where scope and conventions are decided after receiving the drawings) is probably unfeasible given the results of this experiment before committing to it. Takeoff and other extraction problems are not just a matter of how capable the model is but also whether the input holds enough information to extract the desired output. This is a critical lesson for aspiring vertical AI startups due to the typically document and extraction-heavy nature of industries that are being dominated by vertical AI as of writing. No model can extract data that it does not get from its inputs.

References

Togal.AI, homepage. https://www.togal.ai/. "an AI-powered takeoff tool... that automatically detects, measures, and compares directly from your drawings."
Bidi Contracting, "AI Quantity Takeoff Software: A GC's Practical Guide." https://www.bidicontracting.com/blog/ai-quantity-takeoff-software. "No current AI takeoff tool handles scope interpretation... That judgment lives with your estimator."
Togal.AI, "Leading AI Solutions for Blueprint Measurements." https://www.togal.ai/blog/ai-blueprint-reading-accuracy. "This delivers 98% accuracy on floor plans."
Kreo, homepage. https://www.kreo.net/. "AI-powered tool that automatically detects and measures rooms, walls, doors, windows... turns measurements into cost estimates automatically."
Beam AI, homepage. https://www.ibeam.ai/. "delivering accurate takeoffs... with a human-in-the-loop QA process."
University of Kansas / Togal.AI, "Peer-Reviewed Study: Togal.AI vs On-Screen Takeoff" (study hosted on togal.ai). https://www.togal.ai/case-study/peer-reviewed-study-togal-ai-vs-on-screen-takeoff. "relying solely on the AI-automated results is not advisable... AI is just another tool."