We Let the AI Classify Alone – No Safety Net

We pulled every human off the wheel and let our AI classify 7,344 products with no answer key. It scored 98.4% at HS-6, never missed a chapter, and even fixed the answer key.

TL;DR: We pulled every human hand off the wheel and let a single AI formula – =TWHSHINT – classify 7,344 real products on its own. It put the correct six-digit code in its shortlist 98.4% of the time, never once left the right chapter, and quietly corrected mistakes in the answer sheet we were grading it against. And yet 98.4% is a number we would never put in front of a customs officer. Here’s the whole honest story – and why it changes the economics of global trade.

There’s a moment every customs broker knows. A shipment is sitting at the port. A six-digit code is in dispute. And the difference between a clean release and a five-figure penalty comes down to a single distinction the product description never bothered to spell out: was the yarn single or folded? Was the truck 6 tonnes or 7? Was the fish fresh, chilled, or frozen?

Harmonized System classification is one of the most quietly brutal jobs in global trade. Over five thousand subheadings. Hairline splits between “knitted” and “woven,” between a finished article and a part of one, between a battery-electric car and a plug-in hybrid. The logic is beautiful – and unforgiving. Get it right and nobody notices. Get it wrong and it follows your company around for years, in the form of reassessments, audits, and a compliance file with your name on it.

That single, non-negotiable fact shapes everything we build: in tariff classification, there is no acceptable error rate. Not 5%. Not 1%. Not 0.1%. We don’t treat HS classification as a clever feature. We treat it as a compliance obligation, and we comply with it seriously.

So when it came time to tell you what our engine can actually do, we refused to write a brochure. We built a courtroom instead. We took the answer key away, removed every human from the loop, and let the raw AI classify 7,344 real products entirely on its own. Then we graded it, line by line, against the actual HS-2022 nomenclature.

This article is the honest verdict.

Download the internal HS Hint Classification Audit Report.

Enter your details, and we will email you the report.

First, be precise about what you’re looking at

TariffWolf lives where your trade team already works: the spreadsheet. The toolkit ships as four spreadsheet formulas, and this audit is about exactly one of them – =TWHSHINT.

The name is doing honest work. It returns a Hint: a tight, ranked shortlist of candidate HS-2022 codes for a product, generated entirely by AI from the title and description. Not a verdict. Not a stamped declaration. A hint – the codes most likely to be correct, so a classifier never starts from an empty cell. Read every figure below through that lens. We are stress-testing the quality of a hint, alone, with nothing and no one assisting it.

We took the human out of the loop on purpose

The rules we set ourselves were deliberately merciless:

7,344 real product lines, spanning all 21 HS Sections and 96 chapters – from a live breeding stallion to a battery-electric truck.
Zero human intervention. No analyst reviewed, nudged, or corrected the engine. What you’re scoring is the raw, unassisted output of =TWHSHINT.
No answer key. Every line was independently re-classified to the full six-digit level under the General Interpretative Rules (GRI 1–6), and only then compared with what the AI had proposed.
Total enumeration, not a flattering sample – 100% of the qualifying population, so no easy categories got cherry-picked and no average had anywhere to hide.

This is as close as an audit gets to watching the machine think with its hands tied behind its back.

The raw score

Metric	Result
Correct HS-6 code in the hint’s shortlist	98.4% (7,223 / 7,344)
Correct code ranked first	81.1%
Right four-digit heading and two-digit chapter	100%
Deleted / invalid HS-2022 codes emitted	0
Genuine misses	121 (1.6%)

The correct six-digit code sat in the hint 98.4% of the time, and it was the very first candidate in roughly four of every five cases. Every single one of the 7,344 lines landed in the correct heading and chapter. And across the entire run, the engine never once reached for a retired, pre-2022, or otherwise invalid code.

But the most revealing result isn’t in the table. When the engine did miss, it missed by one neighbouring subheading inside the correct heading – never by wandering off and filing a refrigerator under livestock. That containment is the whole game.

How it actually thinks (this is the part that matters)

It would be easy to assume a 98.4% comes from sophisticated keyword matching. It doesn’t – and the error pattern is the proof.

A keyword matcher fails randomly: a stray word drags “building block” into construction one minute and “apple” into fruit the next, scattering mistakes across the tariff. This engine fails structurally. Its candidate codes cluster, with 100% agreement, inside the correct four-digit heading and two-digit chapter. It first decides what the product fundamentally is – a garment, a vehicle, a machine – locks onto the right family, and only then reasons toward the terminal six digits. That’s GRI reasoning, not text overlap.

You can see the same intelligence in how current it is. The HS-2022 edition introduced codes for things that barely existed a decade ago, and the engine reaches for them confidently and correctly: 8485 for additive-manufacturing (3-D printing) machinery, the 8704.4x–5x hybrid and 8704.60 electric goods-vehicle bands, 8703.80 for battery-electric cars, 8517.13 for smartphones, 0309 for fish flours fit for human consumption, 8541.51 for semiconductor-based transducers. It isn’t pattern-matching against a frozen snapshot of the past – it’s classifying against the tariff as it stands today.

And it knows when it’s sure. 95.8% of its determinations rest on a decisive attribute or direct nomenclature validation – a stated power rating, a fuel type and gross weight, a species and state. The remaining 4.2% are honest “I’m working from a thin description” calls that resolve to a heading’s residual Other subheading. The engine is calibrated, not overconfident.

And it does all of this 20× faster than a human can

Here is why the spreadsheet matters. =TWHSHINT is an add-on function that calls the engine over an API straight from a cell. You point it at your product-description column, drag it down, and the candidate codes come back in bulk. The work that swallows a skilled classifier’s entire day – researching, cross-referencing, keying one SKU at a time – collapses into the time it takes to fill a column.

Across a real catalog, that’s comfortably a 20× speed-up on the classification process. Not by replacing judgement, but by deleting its slowest, most repetitive layer: the cold-start research on every single line. Your experts stop hunting for candidates and start doing the only thing that truly needs a human – choosing between them.

The plot twist: it corrected the answer key

This part we genuinely did not expect. On several lines, our own test sheet marked the AI “wrong” – and when a human went back to the nomenclature, the AI was right and the sheet was wrong. Frozen salmon (0304.81, not 0304.89). Live lobster (0306.32). Fish meal (2301.20). Liquid eggs (0408.99). Water heaters (8516.10). Engineered oak flooring (4418.75). A compound microscope (9011.20). A 100 kVA generating set (8502.12). In each case the correct code was already sitting inside the hint, quietly waiting for someone to notice the cheat sheet had a typo.

An engine that occasionally out-reasons its own ground truth is not doing lookup. It is doing classification.

Where it sweats – and we won’t pretend it doesn’t

All 121 misses cluster in exactly the places that torment expert human classifiers too: the engineering-heavy chapters – machinery, electronics, vehicles, precision instruments, aircraft – where a single heading splits on one hard number the description often omits. A 7.5-tonne refrigerated van gets mis-banded across the goods-vehicle weight thresholds. A drone lands in the wrong weight class. The rest are fine biological or material splits – live tilapia read toward a carp code instead of 0301.99, knitted-textile-upper footwear offered only as rubber/leather variants instead of 6404.19 – plus the occasional semantic trap, like a plastic toy “building block” tugged toward construction materials.

We’re publishing the weak spots, not burying them, because they aren’t an embarrassment – they’re a map. Every one tells you precisely where a thin product description needs one more attribute to be airtight. Sixteen of twenty-one Sections scored a flat 100%. Consumer-facing trade – garments, footwear, jewellery, food, cosmetics – was essentially flawless.

So why would we never ship 98.4%?

Because a hint that’s right 98.4% of the time is a phenomenal starting line – and a completely unacceptable finish line.

Return to the first principle: in compliance there is no acceptable error rate. =TWHSHINT was never designed to be the final word. It is the fast, edition-current, brilliantly-reasoned first draft that a trained classifier then reviews and confirms. And once you put that human back into the loop – the loop we deliberately removed for this audit – the number stops behaving like a probability at all.

It becomes 100%.

Not 99.9%. Not 99.99%. Not even 99.999%. One hundred. Because “almost always correct” is a phrase we are not willing to set beside a customer’s customs declaration. The AI does the heavy lifting at machine speed; the human guarantees the outcome. That pairing is the only configuration we are willing to call delivered.

Why this changes the game for the whole industry

For decades, trade compliance has been forced to pick one of two bad options. Slow and safe: armies of experts classifying by hand, accurate but expensive and impossible to scale to a catalog of fifty thousand SKUs. Or fast and risky: crude keyword tools and stale code lists that move quickly and quietly seed errors that surface years later in an audit.

=TWHSHINT collapses that trade-off. It is fast like a machine and structured like an expert – edition-current, GRI-grounded, calibrated about its own confidence, and contained enough that even its rare misses stay inside the right heading. It runs natively in the spreadsheet your team already uses, scales to an entire catalog in a single column, and – crucially – it is verifiable: you can audit it exactly the way we just did. When the bottleneck of global trade becomes 20× cheaper to move through without loosening the compliance standard, the economics of an entire industry shift. That’s not a feature. That’s a new baseline.

Don’t trust us. Test us.

One last piece of honesty: we ran this audit ourselves. It’s an internal review – our engine, our methodology, our hands on the wheel. We were careful, we classified independently of the engine’s output, and every figure recomputes straight from the raw data we’re handing you. But we’re the home team grading our own exam, and you’d be entirely right to keep one eyebrow raised.

Good. Keep it there – then settle it the honest way.

Drop a column of your products into =TWHSHINT and run the exact audit we ran, on your own catalog, against your own known-good codes. Watch how often the right answer is already in the hint. Time how much faster your team moves. Find your own 1.6% before it ever costs you a clearance – that is the entire point of a hint you can verify.

We are not asking you to believe a percentage. We are handing you the red pen.