lead database / baby food / extraction story

how 47,802 baby-food test results became one searchable database

california passed a law that forced every baby-food brand to publish heavy-metal test results for every lot they sell. the brands complied. they posted the data. just on their own websites, in their own formats, with no unified index. so i wrote a scraper.

this page tells how that scrape worked, end to end. it links every script, every endpoint, and every legal citation. if you want to verify the numbers in the lead database, this is the audit trail.

47,802metal-readings extracted
18,124unique lots scraped
29brands extracted (of 39 audited)
50 minwall clock for the full crawl

2018: consumer reports buys 50 baby foods and tests them

in August 2018, consumer reports published Heavy Metals in Baby Food: What You Need to Know. they bought three samples of fifty nationally distributed baby foods and toddler foods in spring 2018, jarred fruits, vegetables, cereals, snacks, entrees, and sent them to a lab. every single one came back with measurable arsenic, cadmium, and/or lead. about two-thirds had at least one metal at what CR’s scientists called "worrisome" levels. organic was no different from conventional. rice cereals and sweet potato were the worst categories.

brands tested included beech-nut, gerber, earth’s best, happy baby, plum, sprout, ella’s kitchen, and walmart’s parent’s choice. the report was the first nationally-distributed testing where you could open the appendix and see specific products, specific concentrations, specific brands.

a year later, in October 17, 2019, healthy babies bright futures put out What’s in My Baby’s Food?, 168 baby foods, 95% positive for at least one toxic metal, one in four contained all four. that’s the report that turned a magazine investigation into a policy fight. media picked it up. parents called pediatricians. pediatricians called legislators.

2021: congress reads the brands’ own internal emails

in february 2021, the house subcommittee on economic and consumer policy released Baby Foods Are Tainted with Dangerous Levels of Arsenic, Lead, Cadmium, and Mercury. they had subpoenaed internal documents from beech-nut, gerber, hain (earth’s best), nurture (happy baby), sprout, and walmart. the documents showed that the brands’ own internal testing routinely returned heavy-metal levels above the brands’ own internal guidelines, and that some brands shipped product anyway.

the subcommittee’s recommendations: require manufacturers to test, require manufacturers to publish, require labels. set FDA action levels. give parents and pediatricians the data they needed to comparison-shop.

2023: california passes AB 899

AB 899 was signed by governor newsom on october 7, 2023. it took effect in two phases:

  • january 1, 2024, every baby food maker selling in california must test each production aggregate at least monthly for arsenic, cadmium, lead, and mercury, using an ISO/IEC 17025 accredited lab.
  • january 1, 2025, every test result must be publicly available on the manufacturer’s own website for the shelf life of the product plus one month. the disclosure must include the metal name, the level found, and enough lot identification to match the test back to the package.

california is one state, but it’s also the largest infant-food market in the country. nobody runs a separate national supply chain for the other 49 states. AB 899 is a national disclosure law dressed up as a state law.

the california department of public health hosts an index of manufacturer disclosure pages, not the data itself, just a list of links. each manufacturer is on the hook for hosting their own results. one law. one mandatory schema (metal + level + lot). zero standardization on how the data gets published.

2025: the data is published. the data is unusable.

by january 2025, every brand was technically in compliance. the data was on the internet. but it was scattered across roughly forty manufacturer websites, each one wearing a different costume.

nine of them embedded the data inline in their page source. four of them stood up custom WordPress REST endpoints. five paid a SaaS company called LightLabs to host the data on a shared API. three used a shopify app called Brij.it. two used jQuery DataTables and dumped the entire dataset into the visitor’s browser. one used a public google sheet with a comment in cell A1 reading "THIS SHEET FEEDS THE LIVE AB899 WEBPAGE", they knew it was public. four sat behind a Laravel + Livewire SaaS on traceabilitybabyfood.com with cryptographically signed session state. one published results as JPEG product photos that you couldn’t parse without OCR.

so a parent who wanted to compare lead levels in rice cereal across three brands had to: open three different websites, learn three different lot-code formats, mentally normalize three different ways of writing "non-detect," and probably give up. a researcher who wanted the full dataset had no full dataset to work from. the law had created the disclosure mandate without creating the disclosure database.

the scrape pipeline, script by script

i went one brand at a time, opened devtools, watched the network tab, and figured out where each brand was actually serving its data from. then i wrote a scraper for that pattern, ran it, and saved the raw output as {brand}_full.csv in ~/Desktop/baby_food_ab899/data/.

the scrapers themselves are short, most are 50–150 lines of python with requests + json. the work was the reverse-engineering, not the code. the actual scrape (every brand, every product, every lot) took 50 minutes wall-clock from cold start to 18,124 lots.

the per-brand scrapers (in ~/Desktop/baby_food_ab899/scripts/)

  • scrape_all.py, the multi-brand harness. covers babylife, pumpkin tree, ready set food, and amara in one pass.
  • scrape_earthsbest.py, walks hain celestial’s /producttesting index, downloads each PDF, parses the certificate of analysis layout into rows.
  • scrape_holle.py, hits holle’s batch-search endpoint, paginates through every batch code, parses the html result blocks.
  • scrape_nestum.py, nestlé’s goodnes.com backend. nestum + cerelac live here. same backend powers gerber.
  • scrape_gerber.py + scrape_gerber_step3.py + scrape_gerber_step3_curl.py, three iterations on gerber. nestlé infrastructure, same as nestum, but the action name was different. still returning zero rows; left as a known gap.
  • fix_earthsbest.py, fix_holle_parser.py, fix_nestum_bestbefore.py, second-pass cleanup scripts that re-parse a few outliers (one earth’s best mercury value at 3,156 ppb was a unit mix-up; the holle parser was reading the action-level reference column as a value).
  • enrich_beechnut_upcs.py, walks the beech-nut WP REST endpoint a second time to attach UPCs to each lot, since the first pass didn’t fetch them.
  • analyze.py, the rollup. reads every {brand}_full.csv in data/ and emits the brands.json, ingredients.json, temporal.json, and normalized.csv files in website_module/.
  • summary.py, prints the final brand-by-brand counts and exceedance tally to stdout.

the per-brand snippets (in ~/Desktop/baby_food_ab899/scrape_scripts/)

  • dump_kroger_simple_truth.js, a one-line console snippet that pulls the full vue store: copy(JSON.stringify(window.$vueRoot.$data.datos)). simple truth’s entire dataset was already in the visitor’s browser; the lot-code search box was decoration.
  • dump_aldi_little_journey.js, same pattern, different markup. aldi’s rusks were already in the page DOM.

the patterns that repeated

i ended up with nine reusable patterns. they’re documented in ~/Desktop/baby_food_ab899/MASTER_REPORT.md, but the short version:

  1. inline JSON in page source, 6 brands. their entire dataset was already in the html, sometimes behind a fake search box.
  2. WordPress REST API, 4 brands. each had a custom WP plugin exposing an unauthenticated endpoint that the brand’s own front-end already called.
  3. LightLabs SaaS (5 brands), shared compliance vendor. GET /api/companies/{UUID}/skus then GET /api/companies/{UUID}/skus/{SKU_ID}/lots. no auth. brand UUIDs visible in page source.
  4. Brij.it SaaS (3 brands), shopify app. brand IDs sit in localStorage.getItem(’brij-local’).brijProductDetails.brand.id. once you have the ID, the whole dataset is one GET away.
  5. jQuery DataTables (2 brands), jQuery(’table’).DataTable().rows({search:’none’}).data() dumps everything.
  6. public google sheet (1 brand), the spreadsheet itself was world-readable; the public website was just a styled view of it.
  7. nestlé shared infra (2 brands), nestum and cerelac on the goodnes.com backend.
  8. predictable static PDFs (3 brands), stonyfield posts monthly results at /wp-content/uploads/{YYYY}/{MM}/heavymetals_result-{Month}.pdf. enumerate the months, fetch the pdfs.
  9. browser-only "load more" buttons (1 brand), cerebelly. eight buttons across eight categories. clicked them all, then dumped the dom.

none of this involved bypassing authentication or forging session state. every endpoint i hit was an endpoint the brand’s own public front-end was already calling, with no credentials. when i ran into endpoints that required forged signed state (the four traceabilitybabyfood.com brands, target, walmart, albertsons, love child), i stopped and put those brands on the crowdsource list instead.

normalize everything to one schema

after every brand was scraped into its own per-brand csv, i ran scripts/analyze.py over data/. that’s the script that does the unification. it reads every {brand}_full.csv, parses the brand-specific column conventions, and writes one wide-format row per (brand, product, lot) with one column per metal:

brand, product, lot, test_date, best_by, product_format, serving_g_assumed, lead_ppb, lead_detection_limit_ppb, lead_below_detection, arsenic_ppb, arsenic_detection_limit_ppb, arsenic_below_detection, cadmium_ppb, cadmium_detection_limit_ppb, cadmium_below_detection, mercury_ppb, mercury_detection_limit_ppb, mercury_below_detection, lead_ug_per_serving, arsenic_ug_per_serving, cadmium_ug_per_serving, mercury_ug_per_serving, source_file

output: ~/Desktop/baby_food_ab899/website_module/normalized.csv, 18,124 rows, 24 columns. one row per lot. four metals per lot. that’s the file every other downstream tool reads from. the per-metal "below detection" flags matter because a non-detect at a 5 ppb detection limit is genuinely different from a non-detect at a 0.5 ppb detection limit, the former tells you almost nothing.

then unify with the other eight sources

the normalized csv flows into the lead database via ~/Desktop/lead_database/build_unified_v3.py. that’s the second-stage script. it reads the AB 899 normalized.csv, melts it from "wide" (one row, four metals) to "long" (one row per metal-reading, dropping below-detection rows), and stitches it together with eight other sources, HBBF’s 2019 baby food report, HBBF’s 2025 rice report, NYC department of health’s non-pharmaceutical surveillance database, king county’s store-shelf surveys, pure earth’s global lead exposure surveys, CPSC recalls, FDA recalls, and EU safety gate alerts.

after melt + stitch + categorization, AB 899 contributes 47,802 metal-readings to the unified file (each lot was tested for ~2.6 metals on average that came back above the lab’s detection limit). that’s the largest single contribution to the lead database. spices come second at 4,843. cookware third at 3,523.

the result

one csv. 47,802 AB 899 readings alongside everything else, all normalized to ppb (µg/kg), all keyed back to the brand’s own disclosure page. you can open it in any of these:

  • /pages/lead-database-babyfood, the searchable, sortable, filterable table view of just the baby-food slice.
  • /pages/lead-database, the hub. baby food sits next to spices, cookware, cosmetics, supplements, and ten other categories.
  • /pages/babyfood, a separate analytical page. dose-band scoring, brand worst-offender lists, ingredient-level breakdowns. that page is the analysis layer; this page is the extraction layer.
  • babyfood.csv, just the baby-food category, raw csv.
  • unified_v3.csv, every category, all 67,497 rows, 12 MB. CC-BY-SA 4.0.

regulating ingredients works. asking manufacturers to be transparent doesn’t, not unless somebody assembles what they publish into something usable. this is that something.

verifying or extending the scrape

everything is on disk. nothing is in a database. nothing is in a queue. if you want to verify a number, the path is short:

  1. open ~/Desktop/baby_food_ab899/data/{brand}_full.csv for the brand you care about. that’s the raw scraper output. find the lot.
  2. open ~/Desktop/baby_food_ab899/website_module/normalized.csv and grep for the same lot. the metal columns should agree (modulo unit conversions if the brand published in mg/kg).
  3. open the brand’s own AB 899 disclosure page (links in MASTER_REPORT.md) and find the same lot. the published value should agree with the scraped value.
  4. if any of those three disagree, that’s a real bug. email eric@fluorospect.com with the lot id and the source url.

re-running the scrape from scratch

each script in ~/Desktop/baby_food_ab899/scripts/ is independent. run any one to refresh that brand. then re-run scripts/analyze.py to rebuild the normalized csv. then re-run ~/Desktop/lead_database/build_unified_v3.py to push the change into the unified file. then re-run ~/Desktop/lead_database/build_category.py babyfood to refresh the customer-facing table. takes about a minute end-to-end if everything’s warm.

extending it

if you want to add a brand i missed, or a brand that started publishing after this scrape ran: write a new scrape_{brand}.py, drop the output into data/{brand}_full.csv with the same column conventions, re-run analyze.py, and the new rows flow through. four of the brands i couldn’t crack, target/good & gather, walmart/parent’s choice, albertsons/o organics, love child organics, all sitting on the same Laravel + Livewire SaaS, are still on the wishlist. they require either crowdsourced lot codes or AB 899 records requests.

license

the scraper code, the normalized csv, the unified database, and this page are all CC-BY-SA 4.0. the underlying values are facts that california’s law put in the public record, not copyrightable in their own right (Feist v. Rural). attribute the brands as the source of any individual measurement; attribute this site as the source of the unification. derivative works should use the same license so the next person can extend it too.

a quick word on the legal posture

everything here was extracted from endpoints that each brand’s own public front-end already calls, with no authentication, using publicly-accessible URLs. Van Buren v. United States (2021) and hiQ Labs v. LinkedIn (9th Cir. 2022) settled the CFAA question for unauthenticated public endpoints. Feist Publications v. Rural Telephone settled that factual concentrations and lot codes are not copyrightable. truth is an absolute defense to defamation. every number on every downstream page is the brand’s own published value, with the brand’s own disclosure page one click away.

this is the same posture consumer reports published their 2018 audit under, the same posture HBBF used in 2019, and the same posture the house subcommittee used in 2021. extending their methodology with more rigor is the entire point.