47,802 baby-food lot results. How they got online.

California AB 899 made the data public. Public did not mean accessible. Two hundred-plus brand microsites each hosting their own PDF disclosures, each with a different schema, each behind a different login. Here is how we got it into one searchable table.

200+ brand sites scraped47,802 lot results extracted15mo of coverage
01

The compliance theater.

AB 899 requires manufacturers to publish lot-level heavy-metal results in a "consumer-accessible" format. Most brands chose the smallest interpretation: a PDF, deep-linked from a footer, hosted on a brand microsite that was not indexed and not aggregated. Technically public, practically invisible.

!

The PDF format is not accidental.

A PDF cannot be cross-brand searched without OCR + extraction. Brands rely on the gap between "published" and "findable" to keep the data quiet without breaking the law.

02

The extraction stack.

1

Crawl the brand directory.

Indexed the AB 899 brand registry, expanded to the disclosed microsites.

2

Fetch every disclosure PDF.

Including the ones that change URL monthly. Light cron job.

3

OCR + table extraction.

Custom extractors per brand because each brand chose a different table layout.

4

Normalize the panel.

Lead, cadmium, mercury, arsenic in µg/kg and µg/serving. One schema, no brand-specific exceptions.

5

Cross-reference with USDA serving sizes.

A lot-level µg/kg means nothing without the realistic-serving math.

6

Open-source the result.

Search at <a href="/check-your-dish">/check-your-dish</a>. JSON API at the same path.

03

What it means in practice.

The data was always there. Now it is sortable. You can see which brands disclose under the child IRL and which never have. You can see how lots within a brand vary. You can pick the safer lot at the grocery store.

Public data is not the same as accessible data. The point of putting it in one table is that the parent can use it.

DetectLead extraction note