Food Bot

I'd had the idea since 2012 of creating a piece of software to optimize for various criteria, particularly nutrition and cost, given a set of recipes with nutrition data and cost defined. I've worked on this project on and off since then. This was a simple idea that was disproportionately difficult for me to actually see through. It took until this year, 2025, to produce a minimal, pared-down version of what I'd had in mind. It does not currently optimize for cost, as defining nutrient content has proven challenging enough for me, but tacking on that functionality sometime in the future shouldn't be difficult; in theory, I just need to add an additional "nutrient" column.

Nutrition Data

I spent a fair amount of time gathering nutrition data from the USDA and massaging it into a usable form. I was interested in Go, the programming language, when I did this part, so that's what I used for writing the program that generates the SQL that creates and fills the database.

The process was a lot less straightforward than I'd thought it would be, so hopefully this will help others trying to make similar use of the USDA's datasets.

Recipe Gathering

I spent some time looking for recipe collections released under the Creative Commons. I initially found Foodista and crawled the entire wiki, generating YAML files from the HTML. I haven't used any of these yet, though, so far preferring to generate meal plans from the recipes I later pulled from Grimgrains.

Crawling Foodista

First, I gathered recipe URLs using this shell script I wrote:

#!/usr/bin/env sh

set -e

# The "pause" indicates how many seconds to wait between pages.
# Pages are 0-indexed. To start from the beginning, pass "0" as the start.
# The "end" is exclusive. To pull through page 282, pass "282" as the end.
# (Page 282 is at index 281.)

[ "$3" ] || { >&2 echo "usage: $0 <pause> <start> <end>" && exit 1; }

pause="$1"
page="$2"

while [ "$page" != "$3" ]; do
  # only show diagnostic output if this is an interactive terminal
  [ ! -t 1 ] || echo "Fetching page $(expr $page + 1)..."

  curl -s "https://www.foodista.com/browse/recipes?page=$page" \
  | grep -oP '<a href="/recipe/\K[^\"]+' \
  | sed 's|^|https://www.foodista.com/recipe/|'

  sleep $pause

  page=$(expr $page + 1)
done

I made it a little more user-friendly than was necessary for my own personal use because I wanted it to be somewhat easy for others to follow in my footsteps if they so chose.

I then fed the URLs into a script that downloaded the HTML for each recipe's page and plopped it into a directory structure matching the structure of the URL given:

#!/usr/bin/env sh

[ "$2" ] || { >&2 echo "usage: $0 <cache directory> <url list file>" && exit; }

export CACHE_DIR="$1"

tmp="$(mktemp)"
trap 'rm "$tmp"' EXIT INT HUP

cat > "$tmp" <<"EOF"
url="$1"
path="$CACHE_DIR/$(echo "$url" | sed 's|https\?://||')"

if [ -f "$path" ]; then
  echo "Already exists, skipping: $path"
else
  echo "Caching to $path"

  dir="$(dirname "$path")"
  mkdir -p "$dir"
  curl -s -o "$path" "$url"

  # rate limit, don't be *too* obnoxious
  sleep 1
fi
EOF

chmod +x "$tmp"

cat "$2" | xargs -P 10 -n 1 "$tmp"

As I converted these files from HTML to the YAML-based recipe format I'd settled on, based on the format used by Chowdown, I discovered there were about 372 naming collisions:

$ find html-cache -type f -exec basename {} \; > recipe-names
$ sort recipe-names | uniq > recipe-names-unique
$ expr $(wc -l recipe-names) - $(wc -l recipe-names-unique)
372
$

I resorted to incorporating each recipe's unique ID number into its filename, and I decided to put all the files in the same folder, flattening the hierarchy I'd earlier created. If my effort had been less piecemeal, I might've thought to use this convention during the downloading phase, but for whatever reason I was still working on this project in small pieces with long breaks in between.

#!/usr/bin/env sh

set -e

find "$1" -type f | while read f; do
  tmp="$(echo "$f" | rev | cut -d/ -f-2 | rev)"
  new="$(echo "$tmp" | cut -d/ -f2)-$(echo "$tmp" | cut -d/ -f1)"
  mv "$f" "$1/$new"
  rmdir "$(dirname "$f")"
done

Finally, after a false start with grep and sed, I put together a shell script using pup to turn the HTML files into YAML files, and to download each recipe's photo:

#!/usr/bin/env sh

# Dependency: pup

set -e

[ "$2" ] || { echo "usage: $0 <recipe dir> <html source>"; exit 1; }
echo "Converting $2"

mkdir -p "$1/images" || true
img="$1/images/$(basename "$2").jpg"
imgurl="$(pup -f "$2" 'div.featured-image img attr{src}')"

[ -f "$img" ] || curl -s -o "$img" "$imgurl"

title="$(pup -f "$2" '#page-title text{}')"
author="$(pup -f "$2" '.username text{}')"
imgcredit="$(pup -f "$2" 'div.featured-image a text{}')"

if [ "$imgcredit" ]; then
  imgcrediturl="$(pup -f "$2" 'div.featured-image a attr{href}' | tail -n1)"
else
  imgcrediturl=""
  imgcredit="$author"
fi

description="$(pup -f "$2" 'div.field-type-text-with-summary text{}' \
  | sed -z 's/\n\n\+/\n\n/g')"

ingredients="$(pup -f "$2" "div[itemprop="ingredients"]" \
  | tr -d "\n" \
  | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \
  | sed 's/^ \+//g; s/^/- /g' | tr -s ' ')"

directions="$(pup -f "$2" "div[itemprop="recipeInstructions"].step-body" \
  | tr -d "\n" \
  | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \
  | sed 's/^ \+//g; s/^[0-9]\+\. \+//g; s/^/- /g' | tr -s ' ')"

tags="$(pup -f "$2" 'div.field-type-taxonomy-term-reference a text{}' \
  | tr "\n" "," | sed 's/,$//g; s/,/, /g;')"

cat > "$1/$(basename "$2").yml" <<EOF
---

layout: recipe
title: $title
author: $author
license: https://creativecommons.org/licenses/by/3.0/
image: $img
image_credit: $imagecredit
image_credit_url: $imagecrediturl
tags: $tags

ingredients:
$ingredients

directions:
$(echo "$directions" | sed 's/ / /g')

---

$(echo "$description" | sed 's/ / /g')
EOF

Running this script on each HTML file and saving the output under a "recipes" directory produced a 733MB archive of YAML files and images, with 564MB of it being images.

Crawling Grimgrains

I found myself stalled out on the project after Foodista, and eventually reasoned that I would do better to focus on a small set of recipes that I would be able to "complete." So I turned my attention to Grimgrains. This proved to be a correct intuition; while it still took me some time after crawling the recipes to get around to actually analyzing their nutrition data, I did find having a clear end goal and a way to measure my progress against it helped me expend more effort than I might have done otherwise once I got the task underway.

For this, I wound up using make, curl, and pup. I created a git repository on Codeberg called grimgrains-yaml to document and share the process, but as redundancy combats bitrot, and as they're short anyway, I've inlined the scripts that did the crawling here.

The HTML to YAML shell script was a bit simpler than the Foodista one:

#!/bin/bash

cat <<EOF
title: $(pup -f "$1" "main.recipe h1 text{}")
author: Hundred Rabbits (100r.co)
license: https://creativecommons.org/licenses/by-nc-sa/4.0/
image: images/$(basename "$(pup -f "$1" "main.recipe img[src^=../media/recipes] attr{src}" | head -n1)")
image_credit: Hundred Rabbits (100r.co)
image_license: https://creativecommons.org/licenses/by-nc-sa/4.0/

tags:
- vegan

servings: $(pup -f "$1" "main.recipe h2 text{}" | grep -o "^[0-9]\+")

ingredients:
$(paste <(pup -f "$1" "dl.ingredients dt b text{}") \
				<(pup -f "$1" "dl.ingredients dt u text{}") \
	| sed 's|\t| (|; s|^|- |; s|$|)|;')

directions: $(
  pup -f "$1" "ul.instructions" \
  | tr -d '\n' \
	| sed "s/'\;/'/g; s/<li>/\n-/g; s/<[^>]\+>//g; s/ \+/ /g; s/ \([\.,!]\)/\1/g;")
EOF

And for crawling the website, grabbing all the HTML and images, and invoking the script above, I used this makefile:

html/:
	mkdir html
	curl --silent "https://grimgrains.com/site/home.html" \
		| pup "ul.recipes li a attr{href}" \
		| grep -v "basic_toothpaste.html" \
		| while read link; do \
			curl --output-dir html -O "https://grimgrains.com/site/$${link}"; \
		  sleep 1; \
		done

images/: html/
	mkdir images
	ls html/*.html \
	| while read file; do \
		pup -f "$$file" "main.recipe img[src^=../media/recipes] attr{src}" \
		| head -n1 \
		| sed 's|^../|https://grimgrains.com/|' \
		|	xargs curl --output-dir images -O ; \
    sleep 1; \
		done

yaml/todo/: html/
	mkdir -p yaml/todo
	ls html/*.html \
		| while read file; do \
			./html-to-yaml.sh "$$file" > "yaml/todo/$$(basename "$${file%.html}.yml")"; \
		done

Analyzing Recipes

I spent a fair amount of time trying to put together a TUI-based utility for editing recipe YAML files, much more than I'd anticipated, and with only minor success. This was another point at which I wound up putting this project on the back burner once again.

I meanwhile learned Guile Scheme and at the same time found myself enjoying writing Javascript-free web apps in the language. After writing some simple personal utilities, I wrote a web app for viewing and searching my nutrition database and my recipe YAML files, with the intention of adding editing capability. In the process of working on the web app, though, I found that having a browser window open for the ingredients search, a browser window open for the recipe being worked on, and a window for editing the recipe YAML in vim, was really all I needed for a pleasant-enough editing experience. I finished the Grimgrains nutrition data analysis using this workflow.

The "foodbot" proper

I may write more about this later, but for now, as it's getting late: with the recipes gathered and their approximate nutrient profiles determined, the rest was fairly simple. I added a "recipes" table to my nutrition database and added a function to my recipes web app which wrote an indicated recipe to that table, as a means of selecting which recipes should be "in the running" for the meal plans. I wrote a script in Guile Scheme to generate random meal plans, using an elitist genetic algorithm to sift out the ones which best fit my nutrition goals. I have it up on Codeberg as guile-foodbot.

Right now it's hardcoded with my own nutrition goals and genetic algorithm parameters, but I hope to make these more readily user-selectable before I'm done.

Backlinks

Page Created: 2023-05-12

Last Updated: 2025-05-18

Last Reviewed: 2025-03-22