How I automated the creation of my grocery list from a bunch of recipe websites with Python
The problem and motivation
I am not a natural or creative cook so when I want to do some cooking I always need a recipe to follow. My typical process before going on any grocery trip:
- Look through internet to find interesting recipes
- Aggregate the required ingredients
- Manually group ingredients by food type to optimise shopping approach
This process becomes tedious with increasing numbers of recipes (and more ingredients). Personally, I can’t handle more than 3 recipes at a time (and admire the super dedicated people who can)!
This results in frequent shopping trips which is undesirable because of the extra time involved and the current COVID-19 situation in Melbourne.
Existing solutions are not ideal
A quick Google search shows that there are apps out there which address this problem. However, (1) the good ones cost money and (2) most apps have some sort of restrictions on the websites they accept as input.
I started looking at ways to write a simple program with to achieve the following:
- Accept as Input: list of recipes from any website
- Produce as Output: list of ingredients, grouped by type, required quantity and associated recipe
The problem consists of two components:
- Ingredient Extraction: Extract ingredients from different websites and combine into a single list
- Food Group Extraction: From the ingredients list, find and extract the associated food groups
My full code, with instructions to run, can be accessed here.
(1) Ingredient Extraction
This is the hardest part but luckily a lot of work has already been put into solving this problem!
I passed each recipe’s URL into an open-source engine created by Zack Scholl and saved the combined output into a data frame, converting all measurements to metric.
This engine takes any recipe website and outputs structured recipe data and in the blog I have linked, Zack describes his approach as well as general challenges with performing this ingredient extraction task.
(2) Food Group Extraction
I used the USDA National Nutritional Database because their food groups roughly aligned to ‘shopping categories’ (I also considered the Australian version but the food group titles were too detailed to be useful).
I stored the tables of interest (Food Group and Food Description) into an SQLite database so that I could use SQL queries to match ingredients with their associated food group. My query below:
SELECT group_name, COUNT(*) as count
LEFT JOIN FD_GROUP ON FD_DES.group_id = FD_GROUP.group_id
WHERE shortdes LIKE ?
AND FD_GROUP.group_id NOT IN ('0300', '0800',
GROUP BY group_name
ORDER BY count DESC
Logic of the query:
- for each ingredient, find all short descriptions which that ingredient appears in and its associated food groups
- exclude any food groups with ingredients I almost never use (e.g. ‘0300 — Baby Foods’, ‘0800 — Breakfast Cereals’)
- output the food group which the ingredient appears most
The ‘?’ parameter allows me to alter the query with Python as I iterate over the ingredients list:
- for all ingredients, remove any plurals
- for two-word ingredients, first query both words together and if nothing results, then query the second word by itself — this is based on the assumption that the second word is usually more descriptive of a food category (e.g. ‘blue cheese’, but clearly this logic would not work for something like ‘egg whites’)
After extracting the food groups, I sorted my final list first by food group and then by ingredient name.
Using the code
To use the code, I just need to create a text file with a list of recipe URLs (an example below) in the same directory as the code.
Then I run the following in my command prompt:
C:\Users\plcpi\Google Drive\Personal Projects\Ingredients extractor> python grocery.py 'recipes0908.txt'
And here’s an excerpt of the results :
I think the code gets 80–90% of the job done. The initial limitations I found are described below.
The ingredient parser excluded the crucial ingredient ‘choy sum’ from this recipe . This is probably because the parser didn’t assess this as an ‘ingredient line’ as ‘choy sum’ did not exist in the corpus of ingredients to assess probability of something being an ingredient.
‘Choy sum’ was also the first in the ingredients list — the algorithm uses a bottom up extraction approach which is why ‘kimchi’ and ‘gochujang’ in this recipe made it through because they were in the middle of the list (and not necessarily because they were actually in the corpus).
So now I know to look for any ‘unusual’ ingredients at the bottom or top of my chosen recipe!
Food group categorisation errors
The errors represent the limits of the SQL logic e.g. ‘sugar’ appeared the most in the ‘Beverage’ food group because of the ‘sugar added’/’sugar free’ descriptions for all drinks in the food database. In hindsight, I could have removed the ‘Beverage’ group as it is not common for ‘Beverage’ to be an ingredient type anyway.
Using string similarity (e.g. Levenschtein distance) is another approach I could have used to extract food group— I did not test this out on the US database but did so on the Australian one — the results did not improve accuracy by much (on the Australian one at least) and slowed the program down quite a bit.
Further, ‘Asian Food’ did not exist as a category in the database which led to some ingredients not being included (‘kimchi’) or incorrectly categorised (‘shaoxing wine’).
In the above example, I put in 5 recipe URLs which resulted in a csv file of 67 lines. It then only took me a few minutes to delete the ingredients already in my pantry and re-categorise the couple of ingredients that were in wrong groups.
Overall, this definitely saves me a lot of time compared to my original manual process — of course, running this code is not worth it for just 1 or 2 (if both are simple) recipes.
Now … on to the actual shopping (😫).
Side note: Improvements
- An easy extension is to consolidate ingredients, but personally I like to keep this separate in case I don’t want to shop for a certain recipe while I am out (which sometimes happens when I can’t find a certain ingredient which occurs more often now with COVID-19 shortages).
- The code, which requires a Python environment, is not great for sharing or implementing without a desktop — making it a web app (and allowing the user ability to customise for which food groups to exclude and what units of measurements to output) would be a great next step.