Importing libraries
We mainly used Pandas
in order to do exploratory analysis. For mathematical operations, and arrays, we used NumPy
. For plotting we used Matplotlib
and Seaborn
, and also for the cleaning part with the country names, the packagings and the labels to translate, we used PyCountry
, gettext
, translate
and re
. We finally needed Folium
and json
for the map visualizations. We added a few more libraries during the image analysis: requests
to download the images, os
to read them locally. imutils
and time
were necessary in the tutorial we followed. cv2
does a lot about image processing and we used sklearn
for kmeans.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry as pc
import gettext as gt
import re
from translate import Translator
import folium
import json
import requests
import os
from imutils.object_detection import non_max_suppression
import time
import cv2
from sklearn.cluster import KMeans
Loading the dataset
DATA_FOLDER = 'data/'
MAPS_FOLDER = 'maps/'
df = pd.read_csv(DATA_FOLDER + 'en.openfoodfacts.org.products.csv', sep="\t", encoding="utf-8", low_memory=False)
df.head()
We see from here that there is a lot of information, and it seems that there is a lot of NaN values for certain columns.
More precisely, we can compute the number of columns in the dataframe:
print('Number of columns in the DataFrame:', df.columns.size)
There are many features, but we definitely won't work with all of them for the purpose of this project.
As we explore the dataset, we will only keep a few columns from all those we have at our disposal.
NaN proportion for each column
As mentioned earlier, it looks like there are many NaN values. Let's see that. Here we compute the proportion of NaN values for each columns, and only plot the columns that have less than 50% of NaNs.
nan_proportion = df.isnull().mean(axis=0)
nan_proportion_50 = nan_proportion[nan_proportion<0.5]
print('Number of columns that have a proportion of NaN less than 50%: ',
nan_proportion_50.size, '/', df.columns.size, 'columns.')
plt.figure(figsize=(15, 5))
nan_proportion_50.plot(kind='bar', color='orange')
plt.title('Proportion of NaN for different columns')
plt.show()
Quite surprising! Actually, only 30 columns are more than half filled on the 173 that we have at our disposal. This suggest that maybe we will need to review some of things we planned to do because of the missing values. Indeed, if a huge amount of values are missing, we won't get very precise insights about the data for these columns.
Choosing the columns to keep
Now let's see with what columns we are going to work with. Of course, we keep the information about the products such as the name, brand, url.
Then we also want to keep the information about the countries, since it was one of the main focus of the project.
We will also keep all those related to the composition of the products (serving quantity, additives, fat, sugar, ...) as it will help us get insights about which country eats what (note that in the end, for the report and the project in general, we decided to focus less on this part to deepen our analysis on packagings).
# Columns that we keep for the exploratory analysis
cols = ['product_name', 'url', 'code', 'brands_tags', 'countries_en', 'origins', 'serving_quantity', 'additives_n',
'fat_100g', 'carbohydrates_100g', 'sugars_100g',
'proteins_100g', 'salt_100g', 'labels', 'packaging']
# Filter to only keep the columns we need
df = df[cols]
Cleaning the countries columns
The columns origins
and countries_en
need a lot of cleaning to be usable later.
# Put everything in lowercase in order to have more consistency
df['countries_en'] = df['countries_en'].str.lower()
df['origins'] = df['origins'].str.lower()
df['packaging'] = df['packaging'].str.lower()
df['labels'] = df['labels'].str.lower()
For outlier exploration, we defined a function to check wether the columns with countries are in the right format or not.
def countries_format_check(df, column):
"""Counts the number of occurences of some unexpected
types or values for entries in the specified column
"""
counter_wrong_type = 0
counter_empty_string = 0
counter_nan = 0
for entry in df[column]:
if not(type(entry) is str):
counter_wrong_type += 1
elif entry == "":
counter_empty_string += 1
if entry != entry: # entry is NaN <=> entry != entry
counter_nan += 1
print("%d values in " %counter_wrong_type + column + " are not of type 'str'")
print("%d values in " %counter_nan + column + " are nan")
print("%d values in " %counter_empty_string + column + " are empty strings")
countries_format_check(df, 'countries_en')
As we can see, some entries in countries_en
are actually not strings. Actually, these outliers are always NaN (so they have float type). Since this value is needed we can drop these samples.
countries_format_check(df, 'origins')
In origins
, it is even worse. We rarely even have anything else than a NaN. When we are interested in the origins of a product, we need to drop these outliers. Otherwise we need to keep them (or we would lose too much data).
Exploding the countries columns
def explode_dataframe(df, column, separator):
""" Duplicates the rows of the input DataFrame df, by splitting on 'separator' the string contained in 'column'
For example, when using this function on column C, with separator ',' on the following DataFrame
A B C
1 1 2 a,b
2 3 4 c,d,e
We expect a return value like this:
A B C
1 1 2 a
1 1 2 b
2 3 4 c
2 3 4 d
2 3 4 e
"""
# Helper function that duplicates a single 'row' and appends the result to 'rows'
def explode_row(row, rows):
splitted_string = row[column].split(separator)
for s in splitted_string:
new_row = row.to_dict()
new_row[column] = s
rows.append(new_row)
rows = []
df.apply(lambda row : explode_row(row, rows), axis=1)
new_df = pd.DataFrame(rows)
return new_df
First, we split countries_en
and origins
, since they may contain more than one country.
df_with_countries = explode_dataframe(df.dropna(subset=['countries_en']), 'countries_en', ',')
print('New shape of our DataFrame:', df_with_countries.shape)
As we can see, we end up with quite a lot more data when splitting to get all the countries separated.
df_with_origins = explode_dataframe(df_with_countries.dropna(subset=['origins']), 'origins', ',')
print('Shape of the DataFrame with the origins:', df_with_origins.shape)
But due to the number of NaN values in origins
, we have a lot less data in df_with origins
. Now we'll either use df_with_origins
when we need the origins and the countries where the products are sold, df_with_countries
when we need only the country where the products are sold, and df when we do not need neither of them. Thus we can drop origins
and countries_en
in df
, and origins
in df_with_countries
df_with_countries = df_with_countries.drop(['origins'], axis=1)
df = df.drop(['countries_en', 'origins'], axis=1)
Formatting the countries columns
Now the main issue is that the format of these strings is not consistent at all. Some country names are written in french, some are in english, etc... Even in countries_en
(where everything is supposed to be written in english, some entries are in french or in some other language, or even some kind of code). In origins
it's even more frequently in some other language than english. Also some are very specific (country + region + county for example), and we will not be interested in so much spatial detail. We also still have outliers from which we cannot retrieve the country it corresponds to (like strings that do not even contain any spatial information).
First we load the main used languages that we will translate from. That choice is based on a previous result where we have got an idea of the countries of the products. We then have chosen the 12 top languages in those results. English is not in the list because we will translate from it.
languages = ['french', 'german', 'spanish', 'italian', 'dutch', 'russian', 'swedish',
'portuguese', 'danish', 'romanian', 'hungarian']
We then use gettext
to load the country names translations that are within PyCountry
.
country_translations = []
for language in languages:
country_translations.append(gt.translation('iso3166', pc.LOCALES_DIR, languages=[language]))
Now we need to be able to read the countries lists from PyCountry. We defined 4 functions for that.
# Read the country names in english
def read_country_names_en(lowercase=True):
country_names = []
for country in pc.countries:
if lowercase:
country_names.append(country.name.lower())
else:
country_names.append(country.name)
return country_names
# Read the country names translated in another language
def read_country_names_translated(language):
country_names = []
for country in pc.countries:
country_names.append(language.gettext(country.name).lower())
return country_names
# Read the country alpha2 codes
def read_country_alpha2s():
country_names = []
for country in pc.countries:
country_names.append(country.alpha_2.lower())
return country_names
# Read the country alpha3 codes
def read_country_alpha3s():
country_names = []
for country in pc.countries:
country_names.append(country.alpha_3.lower())
return country_names
Now we can apply that to get a dictionary to translate the different country names or codes into the alpha3 code of the country. We need to do this since we will later need it when visualizing maps with Folium (in order to match countries id in GeoJSON file). We will also want to have the English name of the country, but that it will be trivial to translate from the alpha3 to the english name. We also manually defined a dictionary for the strings with the most occurences that the other dictionaries could not translate (like regions of some countries, or country codes that are not the actual alpha2 code of the country it is supposed to represent).
# Transform a list of keys and a list of values to a dictionary and add it to the dictionary list
def append_dict(dicts_list, keys_list, values_list):
dicts_list.append(dict(zip(keys_list, values_list)))
# Concats all the dictionaries to one single dictionary
def concat_dicts(dicts_list):
total_dict = {}
for i in range(0, len(dicts_list)):
total_dict = {**total_dict, **dicts_list[i]}
return total_dict
country_names_dicts = []
countries_en = read_country_names_en()
countries_a3 = read_country_alpha3s() # Values to which the names will map (alpha3 codes of the country names)
append_dict(country_names_dicts, countries_en, countries_a3)
for translation in country_translations:
append_dict(country_names_dicts, read_country_names_translated(translation), countries_a3)
append_dict(country_names_dicts, read_country_alpha2s(), countries_a3)
append_dict(country_names_dicts, read_country_alpha3s(), countries_a3)
manual_dict = {
'en' : 'gbr',
'uk' : 'gbr',
'czech republic' : 'cze',
'south korea' : 'kor',
'eua' : 'usa',
'e.u.a.' : 'usa',
'palestine' : 'psx',
'syria' : 'syr',
'bretagne' : 'fra',
'normandie' : 'fra',
'sud-ouest' : 'fra',
'lait origine france' : 'fra',
'ecosse' : 'gbr',
'provence' : 'fra',
'equateur' : 'ecu',
'thailande' : 'tha',
'sud-ouest de la france' : 'fra',
'république dominicaine' : 'dom',
'ciudad de méxico' : 'mex',
'sicile' : 'ita',
'écosse' : 'gbr',
'camargue' : 'fra',
'bourgogne' : 'fra',
'corse' : 'fra',
'alsace' : 'fra',
' españa' : 'esp',
'bolivie' : 'bol',
'porc origine france' : 'fra',
'fr:lait de france' : 'fra',
'pays de la loire' : 'fra',
'aquitaine' : 'fra',
'québec' : 'can',
'ardèche' : 'fra',
'andalucía' : 'esp',
'agriculture france' : 'fra',
'angleterre' : 'gbr',
'ile de la réunion' : 'fra',
'etats-unis' : 'usa',
'val de loire' : 'fra',
'rhône-alpes' : 'fra',
'aveyron' : 'fra',
'auvergne' : 'fra',
'california' : 'usa',
'franche-comté' : 'fra',
'savoie' : 'fra',
'russie' : 'rus',
'alaska' : 'usa',
'fr:deutschland' : 'deu',
'ch:suisse' : 'che',
'loire-atlantique' : 'fra',
'de:allemagne' : 'deu',
'fr:schweiz' : 'che',
'fr:frankreich' : 'fra',
'republic of macedonia' : 'mkd',
'fr:quebec' : 'can',
'fr:dom-tom' : 'fra',
'democratic republic of the congo' : 'cog',
'republic of the congo' : 'cog'}
country_names_dicts.append(manual_dict)
countries_dictionary = concat_dicts(country_names_dicts)
We also define a dictionary from proper alpha3 code to English name. We keep the case because these names may appear on some plot so they need to look nice!
a3_to_en_dict = dict(zip(countries_a3, read_country_names_en(lowercase=False)))
We can now easily define an efficient function that translates a name with a dictionary.
# Translate the parameter 'word' to its corresponding value in the dictionary, or NaN if it is not found in the dictionary.
def translate(word, dictionary):
return dictionary.get(word, np.NaN)
# Translate a whole column of a DataFrame with a dictionary.
def translate_column(df, column, to_column, dictionary):
# Translate every word in the column
df[to_column] = df[column].map(lambda word : translate(word, dictionary))
# Get rid of the new NaN values induced by impossibility to translate
df = df.dropna(subset=[to_column])
return df
For example, when used on 'allemagne' and 'deu' with the other dictionary, we get:
print("Alpha-3 code of 'allemagne': " + translate('allemagne', countries_dictionary))
print("Country (in english) corresponding to the code 'deu': " + translate('deu', a3_to_en_dict))
# Translate the 'countries_en' column in df
df_with_countries = translate_column(df_with_countries, 'countries_en', 'countries_a3', countries_dictionary)
df_with_countries = translate_column(df_with_countries, 'countries_a3', 'countries_en', a3_to_en_dict)
# Translate the 'countries_en' and the 'origins' columns in df_with_origins
df_with_origins = translate_column(df_with_origins, 'countries_en', 'countries_a3', countries_dictionary)
df_with_origins = translate_column(df_with_origins, 'countries_a3', 'countries_en', a3_to_en_dict)
df_with_origins = translate_column(df_with_origins, 'origins', 'origins_a3', countries_dictionary)
df_with_origins = translate_column(df_with_origins, 'origins_a3', 'origins', a3_to_en_dict)
print('Shape of our DataFrame with the countries:', df_with_countries.shape)
print('Shape of the DataFrame with the origins:', df_with_origins.shape)
As we can see, a few rows were lost in df_with_countries
(~600 out of ~728k), but many were lost in df_with_origins
(~22k out of ~62k) because the format of origins
is very inconsistent.
Packaging and Labels columns formatting
For the packaging
and labels
columns, it is even more complicated to perform the cleaning.
Imagine if a product has packaging plastic, bag
. Then exploding the packaging would result in two different rows, one with packaging plastic
, the other with packaging bag
. Since we want to make statistics on the packagings, this would give erroneous proportion for each of them.
For the country-related columns, it was not an issue because we are not interested in the proportion of products produced in each country, but we would rather want to make statistics on other interesting information for each country.
Our decision is to define different accessors to get each interesting label and each interesting packaging type in the dataset. With this approach we will avoid the problem stated above. The results will still be slightly biased, because we only consider those written in one of the top languages. We will not give any country-specific result for the countries that do not have enough datapoints, so this bias will be very small.
First we define a function that returns a boolean mask for a dataframe, computed from wether or not the input string is contained in each row (in the specified column).
def contains_str(df, column, string):
return df[column].str.contains(string, na=False, flags=re.IGNORECASE)
Now we can use this function to define a higher level one, that returns all the elements of the input dataframe such that at a specified column, at least one of the included_strings
parameter is contained, and none of the excluded_strings
parameter is contained.
def filter_str_column(df, column, included_strings, excluded_strings=[]):
df = df.reset_index(drop=True)
boolean_mask = pd.Series(np.full((len(df)), False, dtype=bool))
for string in included_strings:
boolean_mask = boolean_mask | contains_str(df, column, string)
for string in excluded_strings:
boolean_mask = boolean_mask & ~contains_str(df, column, string)
return df[boolean_mask]
We also defined a function to translate a word to a list of translations in the input languages. This was used to quickly make translations for the labels and packagings that we are interested into, but since it is API-based (and thus slow) we hard-coded the translations (so multi_translate
is not used anymore). We also simplified them sometimes since we only need the string to be included (so a prefix works as well).
def multi_translate(word, languages):
translations = []
for language in languages:
translator = Translator(to_lang=language)
translations.append(translator.translate(word))
return translations
The final keywords are here:
# Usual words or prefixes for organic products
# The regex \\bAB\\b is to find the acronym AB alone (a popular organic label in France)
organic_included = ['bio', 'organi', 'orgáni', 'Økologisk', 'szerves', '\\bAB\\b']
# Most of the gluten free products contain the word 'gluten' in their labels, or 'glutine' in Italian
# 1 product is labeled "avec gluten" so we have to get rid of it
gluten_free_included = ['gluten', 'glutine', 'клейковина', 'glúten', 'sikér']
gluten_free_excluded = ['avec gluten', 'mit gluten', 'con gluten', 'con glutine',
'met gluten', 'с глютеном', 'med gluten', 'com glúten', 'cu gluten', 'a glutén']
# Most popular ways of writing the vegan label
# Some products are labeled "no vegan" so we have to get rid of them
vegan_included = ['vega', 'végétalien', 'vegetalien', 'végane']
vegan_excluded = ['no vegan']
# Containing 'halal' is necessary and sufficient for a product to be halal
halal_included = ['halal', 'xаляльное', 'helal']
# Fair trade is written down in different ways in the dataframe, so we have to gather most of them
fair_trade_included = ['commerce équitable', 'fairtrade', 'fair trade', 'fair-trade', 'commercio equo e solidale',
'eerlijke handel', 'честная торговля', 'feira comercial', 'comerț echitabil', 'tisztességes üzlet']
# Usual words or prefixes for plastic in the most used languages
plastic_included = ['plast', 'kunststof', 'plástico', 'пластик', 'műanyag']
# Usual words or prefixes for cardboard or paper in the most used languages
cardboard_included = ['cart', 'cardboard', 'karton', 'Картон', 'pap', 'бумага', 'hârtie']
# Usual words or prefixes for glass in the most used languages
glass_included = ['verre', 'glas', 'vidrio', 'стекло', 'sticlă', 'pohár']
# Usual words or prefixes for metal or aluminium in the most used languages
metal_included = ['metal', 'métal', 'металл', 'fém', 'alumini', 'alumíni', 'алюминий', 'can']
With all of that, we can super easily define functions to get all the elements of the dataframe that match what we are interested into. Most of the times we only need to include the translations of the info we want to have, but sometimes we also have to specifically exclude some results. The values of vegan_excluded
and gluten_free_excluded
explain this very well.
def organic_label(df):
return filter_str_column(df, 'labels', organic_included)
def no_organic_label(df):
return filter_str_column(df, 'labels', [''], organic_included)
def vegan_label(df):
return filter_str_column(df, 'labels', vegan_included, vegan_excluded)
def gluten_free_label(df):
return filter_str_column(df, 'labels', gluten_free_included, gluten_free_excluded)
def halal_label(df):
return filter_str_column(df, 'labels', halal_included)
def fair_trade_label(df):
return filter_str_column(df, 'labels', fair_trade_included)
def plastic_packaging(df):
return filter_str_column(df, 'packaging', plastic_included)
def cardboard_packaging(df):
return filter_str_column(df, 'packaging', cardboard_included)
def glass_packaging(df):
return filter_str_column(df, 'packaging', glass_included)
def metal_packaging(df):
return filter_str_column(df, 'packaging', metal_included)
Note that this section has not really been changed since milestone 2. Some of the analysis contained avoidable bias, and we decided to have a different approach in the later sections of the notebook (Stats per country). We have left some parts of the code that is still relevant (import - export) and we have removed what was not interesting anymore.
Import - Export
As a first analysis, we will try to visualize which countries import and export, for the main countries in the dataset.
Now let's get the destination and origin countries for the products, and try to first visualize which are the top destination countries:
dest_count = df_with_countries['countries_en'].value_counts()
We plot below the top-25 destination countries. Note that since France has a significantly higher number of products (this is a french dataset), we also plot in the log scale.
#Create subplots
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 7))
#Plot the count of the products for each destination country
dest_count[:25].plot(kind='barh', color='orange', ax=ax1)
ax1.set_xlabel('Number of products')
#Plot again in log scale
dest_count[:25].plot(kind='barh', color='orange', ax=ax2, log=True)
ax2.set_xlabel('Number of products')
f.suptitle('Number of products in the dataset, per country')
plt.show()
Not surprinsingly, the top countries are mainly those from Europe (again, the dataset focuses a lot on France and Europe), but we can also see the US, Canada and Mexico for example.
The analysis below was the one performed in Milestone 2, and we kept in the notebook the visualization with heatmaps. Note that later, we will present the new analysis, which is more accurate-
We decided to focus our analysis on the countries that have the most products in their inventory, so that we can do the analysis on countries that have the most data points.
So we will keep those top countries, and we will do the analyses on the first 15 of them. This will also help for the current visualizations that are mostly in bar charts (impossible to read if there are too many countries).
top_15 = df_with_origins['countries_en'].value_counts()[:15].index
Here, we will visualize the origins and destinations, for the top 15 countries in the destinations. We chose to represent this information as a heatmap. For each pair of countries, we compute the number of food products for that pair and add one to the corresponding cell in a matrix representing those pairs.
After the matrix is filled with the counts of the different pairs, we use the heatmap()
function of Seaborn to visualize the results.
def compute_heatmap(df, countries):
"""Given a dataframe, compute a
heatmap matrix for import-export
analysis of the given countries"""
# Create the matrix that will contain the counts
matrix = pd.DataFrame(np.zeros((countries.size, countries.size)), index=countries, columns=countries)
# Fill in the matrix (for the top countries)
for idx, row in df.iterrows():
origin = row['origins']
destination = row['countries_en']
if destination in countries and origin in countries:
matrix.at[origin, destination] += 1
return matrix
def plot_heatmap(matrix):
"""Plot the given matrix as a heatmap"""
# Create subplots
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))
# Use Seaborn to plot a heatmap using the matrix
sns.heatmap(matrix, linewidth=0.5, cmap='Reds', ax=ax1)
ax1.set_ylabel('Origin Country')
ax1.set_xlabel('Destination Country')
# Plot again in log scale
sns.heatmap(np.log(matrix+1), linewidth=0.5, cmap='Reds', ax=ax2)
ax2.set_ylabel('Origin Country')
ax2.set_xlabel('Destination Country')
f.suptitle('Import-Export count between countries')
plt.show()
plot_heatmap(compute_heatmap(df_with_origins, top_15))
As France has a count that is a lot larger, we also plotted on the right with a log scale to get better insights. What we see from this plot is very interesting.
With this dataset, it looks like many countries actually produce their own food (look at the diagonal). Again, as France has always more data in this dataset, we see that France import and export to many countries. This may definitely be biased, and that's why we prefer to focus also on the other countries.
Spain, Mexico, Australia and UK also produce a lot of their own food. As big exporters, we spot Spain, Germany, Italy, which makes sense for Europe, and we have the US for the American countries. As importers, the plot suggests that Germany, Switzerland, Australia and UK are the main ones.
In this section we compute multiple dataframes containing stats for each country, in order to compare the countries. These dataframes will be visualized in the maps of the next section.
Nutrition facts
We are interested in various nutrition facts and we want to compare some food compositions between countries.
composition_cols = ['additives_n', 'fat_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'salt_100g']
First we define a function to drop the 'obvious' outliers. That is: if a value is supposed to be in an interval, drop any value that is not in this interval.
def drop_outliers(df, column, min_value, max_value):
df = df.dropna(subset=[column])
df = df[(df[column] <= max_value) & (df[column] >= min_value)]
return df
Now we can define a function that computes the stats per country for a given column (for example fat_100g
)
def compute_stats_per_country(df, column, min_value, max_value):
df = drop_outliers(df, column, min_value, max_value)
df_stats = df[['countries_a3', column]].groupby('countries_a3').describe()
return df_stats
We'll also need a function to drop the nan values and keep only the countries with enough data samples:
def format_for_choropleth(df, column, min_count=0):
df = df[column].reset_index().dropna(subset=['mean'])
if min_count <= 0:
return df
else:
return df[df['count'] >= min_count]
We compute the stats for each food composition category and store that in a single dataframe df_nutrition
.
frames = []
for col in composition_cols:
frames.append(compute_stats_per_country(df_with_countries, col, 0, 100))
df_nutrition = pd.concat(frames, axis=1, sort=True)
df_nutrition.index = df_nutrition.index.map(lambda a3 : a3.upper())
Packaging and labels
Recall the packagings and the labels of interest.
packagings_of_interest = ['plastic', 'cardboard', 'glass', 'metal']
labels_of_interest = ['organic', 'vegan', 'gluten_free', 'halal', 'fair_trade']
We have defined a function to compute the proportion of products with a specified packaging or label, for a specified country. It also returns the number of products with non-nan entry for the corresponding column (labels
or packaging
) so that later on we do not display countries with not enough data samples.
def proportion(df, category, kind): # Result is a count and a percentage
total_size = df.size
df = df.dropna(subset=[category])
# In case we are looking at the packaging we only need to divide by the size without the nan values
if category == 'packaging':
total_size = df.size
if kind == 'plastic':
quantity = plastic_packaging(df).size
elif kind == 'cardboard':
quantity = cardboard_packaging(df).size
elif kind == 'glass':
quantity = glass_packaging(df).size
elif kind == 'metal':
quantity = metal_packaging(df).size
elif kind == 'organic':
quantity = organic_label(df).size
elif kind == 'vegan':
quantity = vegan_label(df).size
elif kind == 'gluten_free':
quantity = gluten_free_label(df).size
elif kind == 'halal':
quantity = halal_label(df).size
elif kind == 'fair_trade':
quantity = fair_trade_label(df).size
return total_size, (quantity * 100 / total_size)
Now we define a function to compute the proportion of something per country.
keys = np.unique(df_with_countries['countries_a3'].values)
def compute_proportion_per_country(df, kind, category):
proportions = []
for key in keys:
df_country = df[df['countries_a3'] == key]
proportions.append(proportion(df_country, kind, category))
columns = pd.MultiIndex(levels=[[category], ['count', 'mean']], labels=[[0, 0], [0, 1]])
return pd.DataFrame(proportions, keys, columns=columns)
We compute the proportions of each label and packaging of interest and store that in a single dataframe df_proportions
.
frames = []
for packaging in packagings_of_interest:
frames.append(compute_proportion_per_country(df_with_countries, 'packaging', packaging))
for label in labels_of_interest:
frames.append(compute_proportion_per_country(df_with_countries, 'labels', label))
df_proportions = pd.concat(frames, axis=1, sort=True)
df_proportions.index = df_proportions.index.map(lambda a3 : a3.upper())
Below is the new analysis for the import-export. Note that now, the way we do it is as follows:
We define a function that given a specific country, it returns a DataFrame with countries from which it imports, and countries from which it exports, with the corresponding proportions.
Imports and exports
We define a function to compute a dataframe with the import and export proportions for each country.
def compute_import_export(df, country):
# Export count
exporting_to = df[(df['origins'] == country) & ~(df['countries_en'] == country)]
export_count = exporting_to.groupby('countries_a3').size()
export_percentage = export_count * 100 / export_count.sum()
# Import count
importing_from = df[(df['countries_en'] == country) & ~(df['origins'] == country)]
import_count = importing_from.groupby('origins_a3').size()
import_percentage = import_count * 100 / import_count.sum()
columns_export = pd.MultiIndex(levels=[['export'], ['mean']], labels=[[0], [0]])
columns_import = pd.MultiIndex(levels=[['import'], ['mean']], labels=[[0], [0]])
# Convert to dataframe, and set the alpha3 to uppercase
df_export = pd.DataFrame(export_percentage, columns=columns_export)
df_import = pd.DataFrame(import_percentage, columns=columns_import)
df_export.index = df_export.index.map(lambda a3 : a3.upper())
df_import.index = df_import.index.map(lambda a3 : a3.upper())
frames = [df_export, df_import]
df_import_export = pd.concat(frames, axis=1, sort=True)
return df_import_export
As an example, we use it with France (the country with the most data samples).
df_import_export_fr = compute_import_export(df_with_origins, 'France')
This DataFrame is used in the map vizualizations.
Here, we compute the proportion of organic products that are local in Switzerland. It is easily adaptable for any country, just like all the functions we defined.
imported_color = '#718281'
local_color = '#40b515'
local_swiss_products = len(df_with_origins[(df_with_origins['countries_en'] == 'Switzerland') & \
(df_with_origins['origins'] == 'Switzerland')])
imported_swiss_products = len(df_with_countries[(df_with_countries['countries_en'] == 'Switzerland') & \
(df_with_origins['origins'] != 'Switzerland')])
plt.pie([local_swiss_products, imported_swiss_products], \
labels=['Local products', 'Imported products'], \
colors=[local_color, imported_color], \
explode=(0, 0.08), startangle=45)
plt.title('Origins of Swiss products')
plt.axis('equal')
plt.show()
importation_ratio_switzerland = imported_swiss_products / (imported_swiss_products + local_swiss_products) * 100
print(f'Importation ratio for Swiss products: {importation_ratio_switzerland:.2f}%')
For this plot we had enough data so we decided to do it with Swiss products.
It represents the proportion of food that is produced is Switzerland to the total number of products that are sold in Switzerland (remember here that we don't have actually the number of sell, but it is the number of products in the inventory). We can see that local products represents a big part of the chart so this is very good news as it means that food is not travelling too far before coming to Swiss consumers.
Note that what we call 'Local products' are made in Switzerland but this does not necessarly mean that they are made by a Swiss company.
Even if we do not have enough data to get meaningful stats on all countries, it will be good to visualize our results on a world map. Some countries will just be colored grey if they do not have enough data samples.
First we load the geo.json file that will allow us to color the countries.
geo_json_data = json.load(open(DATA_FOLDER + 'world_low.geo.json'))
europe_countries = folium.GeoJson(
geo_json_data,
name='geojson'
)
Then we define a function to draw a Folium choropleth on a map with a Multi-index dataframe of the form of df_nutrition
and df_proportions
.
# Draw the choropleth
def draw_choropleth(df, column, min_count, title, color='OrRd', nan_opacity=1, nan_color='#afafaf'):
m = folium.Map(
location=[46.8, 8.2],
tiles='cartodbpositron',
zoom_start=2
)
m.choropleth(geo_data=geo_json_data,
data=format_for_choropleth(df, column, min_count),
name=title,
columns=['index', 'mean'], key_on='feature.properties.adm0_a3',
fill_color=color, fill_opacity=0.5, line_opacity=0.2,
nan_fill_opacity=nan_opacity,
nan_fill_color=nan_color,
legend_name=title
)
return m
Now we can draw the maps with every different food composition information on them.
m = draw_choropleth(df_nutrition, 'additives_n', 100, 'Average number of additives in the food products')
m.save(MAPS_FOLDER + 'world_additives.html')
display(m)
draw_choropleth(df_nutrition, 'fat_100g', 100, 'Average proportion of fat in the food products (%)').save(MAPS_FOLDER + 'world_fat.html')
draw_choropleth(df_nutrition, 'sugars_100g', 100, 'Average proportion of sugars in the food products (%)').save(MAPS_FOLDER + 'world_sugars.html')
draw_choropleth(df_nutrition, 'carbohydrates_100g', 100, 'Average proportion of carbohydrates in the food products (%)').save(MAPS_FOLDER + 'world_carbohydrates.html')
draw_choropleth(df_nutrition, 'proteins_100g', 100, 'Average proportion of proteins in the food products (%)').save(MAPS_FOLDER + 'world_proteins.html')
draw_choropleth(df_nutrition, 'salt_100g', 100, 'Average proportion of salt in the food products (%)').save(MAPS_FOLDER + 'world_salt.html')
Same here for the different packaging proportions and label proportions. Unfortunately some of them (especially 'Halal', 'Fair-trade', 'Vegan' and 'Gluten-free') are very subject to outliers, rendering the results almost unusable.
m = draw_choropleth(df_proportions, 'plastic', 100, 'Proportion of packagings containing plastic (%)', 'Blues')
m.save(MAPS_FOLDER + 'world_plastic.html')
display(m)
draw_choropleth(df_proportions, 'metal', 100, 'Proportion of packagings containing metal (%)', 'Reds').save(MAPS_FOLDER + 'world_metal.html')
draw_choropleth(df_proportions, 'glass', 100, 'Proportion of packagings containing glass (%)', 'Blues').save(MAPS_FOLDER + 'world_glass.html')
draw_choropleth(df_proportions, 'cardboard', 100, 'Proportion of packagings containing cardboard (%)', 'Oranges').save(MAPS_FOLDER + 'world_cardboard.html')
draw_choropleth(df_proportions, 'organic', 100, 'Proportion of organic products (%)', 'Greens').save(MAPS_FOLDER + 'world_organic.html')
draw_choropleth(df_proportions, 'halal', 100, 'Proportion of halal products (%)', 'Reds').save(MAPS_FOLDER + 'world_halal.html')
draw_choropleth(df_proportions, 'fair_trade', 100, 'Proportion of fair-trade products (%)', 'Greens').save(MAPS_FOLDER + 'world_fair_trade.html')
draw_choropleth(df_proportions, 'vegan', 100, 'Proportion of vegan products (%)', 'Greens').save(MAPS_FOLDER + 'world_vegan.html')
draw_choropleth(df_proportions, 'gluten_free', 100, 'Proportion of gluten-free products (%)', 'Oranges').save(MAPS_FOLDER + 'world_gluten_free.html')
We can also draw a map of the exports and imports of France. Note that
m = draw_choropleth(df_import_export_fr, 'export', 0, 'Export destinations for France (%)', nan_opacity=0)
m.save(MAPS_FOLDER + 'france_exports.html')
display(m)
m = draw_choropleth(df_import_export_fr, 'import', 0, 'Import origins for France (%)', nan_opacity=0)
m.save(MAPS_FOLDER + 'france_imports.html')
display(m)
Links of the most important maps (they should all be displayed above if you are running our notebook as HTML):
Average proportion of fat in the food products in the world
Links of all the other maps (still interesting for most of them):
Average number of additives in the food products in the world
Average proportion of carbohydrates in the food products in the world
Average proportion of proteins in the food products in the world
Average proportion of salt in the food products in the world
Average proportion of sugars in the food products in the world
Proportion of packagings containing cardboard in the world
Proportion of packagings containing glass in the world
Proportion of packagings containing metal in the world
Proportion of vegan food products in the world
Proportion of organic food products in the world
Proportion of halal food products in the world
For this we will compute the proportion of each of the materials of interest in the packaging, conditioned on the organic criterion.
First we define a function to compute the 99.99% confidence interval. This function works only for binary categorical data.
def error_99_99(count, proportion):
numerator = 3.891 * (count)*(1-proportion)*(proportion)
denominator = (count-1) * np.sqrt(count)
return numerator/denominator
Now we can compute the proportions, errors and count for all products with a label that are not organic.
no_organic_proportions = [0, 0, 0, 0]
no_organic_count, no_organic_proportions[0] = proportion(no_organic_label(df), 'packaging', 'plastic')
no_organic_proportions[1] = proportion(no_organic_label(df), 'packaging', 'cardboard')[1]
no_organic_proportions[2] = proportion(no_organic_label(df), 'packaging', 'glass')[1]
no_organic_proportions[3] = proportion(no_organic_label(df), 'packaging', 'metal')[1]
no_organic_errors = [error_99_99(no_organic_count, no_organic_proportions[i] / 100) * 100 for i in range(0, 4)]
Same for products that are organic.
organic_proportions = [0, 0, 0, 0]
organic_count, organic_proportions[0] = proportion(organic_label(df), 'packaging', 'plastic')
organic_proportions[1] = proportion(organic_label(df), 'packaging', 'cardboard')[1]
organic_proportions[2] = proportion(organic_label(df), 'packaging', 'glass')[1]
organic_proportions[3] = proportion(organic_label(df), 'packaging', 'metal')[1]
organic_errors = [error_99_99(organic_count, organic_proportions[i] / 100) * 100 for i in range(0, 4)]
We can now display the results as bar charts with confidence intervals.
bar_width = 0.3
no_organic_pos = np.arange(4)
organic_pos = [x + bar_width for x in no_organic_pos]
no_organic_color = '#6d6d6d'
organic_color = '#5dd861'
edge_color = 'white'
capsize = 30
plt.figure(figsize=(15, 7))
# No-organic bars
plt.bar(no_organic_pos, no_organic_proportions, width=bar_width, color=no_organic_color, edgecolor=edge_color, yerr=no_organic_errors, capsize=capsize, label='No organic label')
# Organic bars
plt.bar(organic_pos, organic_proportions, width=bar_width, color=organic_color, edgecolor=edge_color, yerr=organic_errors, capsize=capsize, label='Organic label')
plt.xticks([r + bar_width/2 for r in range(4)], ['Plastic', 'Cardboard', 'Glass', 'Metal'])
plt.ylabel('Proportion (%)')
plt.title('Proportion of packagings containing each material conditioned on the Organic label')
plt.legend()
# Display
plt.show()
We can barely see the confidence interval, which is very good. That means we have enough data samples to be super confident about our results (remember that this is a 99.99% confidence interval). However, this is only true when assuming the datasets are unbiased, which is unfortunately not the case. We can still be pretty convinced that organic products tend to have different packagings than the average products (more glass and less plastic). This could be explained by a bias on the type of products that are more likely to be organic (fruits and vegetables for example) and at the same time less likely to be put in a plastic packaging (dual causality effect), or by marketing choices, or several other reasons we did not think of. Note that the total does not sum to 100%. This is because some packagings contains multiple materials (cardboard and plastic for example) and some others contain no explicit description of the meterial (bottle, for example, could be a plastic or glass bottle, so we did not make any conclusion from non-explicit packaging materials).
Downloading images
We had to go through another API to retrieve the images because the urls in the csv file were all wrong. The Open Food Fact database provides a JSON API which we used to get all the images url, and then downloading from the url to store images locally which was more convenient during our study.
# Retrieving front image url from barcode
def get_image_url(code):
url = 'https://world.openfoodfacts.org/api/v0/product/' + str(code) + '.json'
r = requests.get(url)
json = r.json()
if r.status_code == 200 and json['status_verbose'] == 'product found':
try:
return json['product']['image_url']
except KeyError:
try:
return json['product']['selected_images']['front']['display']['fr']
except KeyError:
return 'No image available'
else:
return 'No image available'
# Saving image locally from url to a specific folder which needs to be created beforehand
def save_image(name, image_url, output_folder):
image_data = requests.get(image_url).content
file_name = name + '.jpg'
with open(output_folder + file_name, 'wb') as handler:
handler.write(image_data)
# Downloading n images (or all by default) from a df to an output folder
def download_images(df_to_download, output_folder, n=0):
if not os.path.exists(output_folder):
os.makedirs(output_folder)
i = 0
if n > 0:
barcodes = df_to_download.tail(n)['code']
else:
barcodes = df_to_download['code']
length = len(barcodes)
for code in barcodes:
url = get_image_url(code)
if url != 'No image available':
save_image(code, url, output_folder)
i = i + 1
progress = str(i) + '/' + str(length)
print(progress, end='\r')
print('Download completed!')
n = 1000
print('Downloading', n, 'images corresponding to organic produts:')
download_images(organic_label(df), 'demo_bio/', n)
print('Downloading', n, 'images corresponding to non organic produts:')
download_images(no_organic_label(df), 'demo_non_bio/', n)
Computing dominant colors in a collection of images
This part is done in four steps:
Essentially, we thought that looking for color on the whole images would not give us good results as many images had different backgrounds. Because we were only interested in the packages colors we had the idea to first do a text detection and then select only areas containing text. So we are doing this for all images and we extract two main colors: hopefully text and background. Then we convert these colors into HSV format to keep only those with high enough Saturation and Value (i.e. the more colorful ones). Finally, we extract 50 main colors to draw a spectrum of the dominant colors over the image collection.
Text detection
In order to detect the text on the images, we used the OpenCV library and followed the following tutorial on how to use it https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/. This way, when extracting the colors, we only focus on colors that are around the text of the images and really get the main colors and not noise (from background, ...).
# https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
# Declare some constants used in the text detection functions
# A threshold to decide wheter an area contains text or not
DETECTION_CONFIDENCE = 0.5
# The dimension to which all images are resized during the detection
(SUB_IMAGE_H, SUB_IMAGE_W) = (320, 320)
# https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
# load the pre-trained EAST text detector
def load_network():
return cv2.dnn.readNet('east_detector.pb')
# https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
# Detect text on an image, returns the likelihood for each area to contain text
def detect(net, image):
# construct a blob from the image and then perform a forward pass of
# the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(image, 1.0, (SUB_IMAGE_W, SUB_IMAGE_H), (123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(['feature_fusion/Conv_7/Sigmoid', 'feature_fusion/concat_3'])
return (scores, geometry)
# https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
# This function is exactly the one used in the tutorial (cf reference)
def compute_rects_confidences(scores, geometry):
numRows, numCols = scores.shape[2:4]
rects = []
confidences = []
for y in range(0, numRows):
# extract the scores (probabilities), followed by the geometrical
# data used to derive potential bounding box coordinates that
# surround text
scoresData = scores[0, 0, y]
xData0 = geometry[0, 0, y]
xData1 = geometry[0, 1, y]
xData2 = geometry[0, 2, y]
xData3 = geometry[0, 3, y]
anglesData = geometry[0, 4, y]
for x in range(0, numCols):
# Decides whether there is text or not
if scoresData[x] < DETECTION_CONFIDENCE:
continue
# compute the offset factor as our resulting feature maps will
# be 4x smaller than the input image
(offsetX, offsetY) = (x * 4.0, y * 4.0)
# extract the rotation angle for the prediction and then
# compute the sin and cosine
angle = anglesData[x]
cos = np.cos(angle)
sin = np.sin(angle)
# use the geometry volume to derive the width and height of
# the bounding box
h = xData0[x] + xData2[x]
w = xData1[x] + xData3[x]
# compute both the starting and ending (x, y)-coordinates for
# the text prediction bounding box
endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
startX = int(endX - w)
startY = int(endY - h)
# add the bounding box coordinates and probability score to
# our respective lists
rects.append((startX, startY, endX, endY))
confidences.append(scoresData[x])
return rects, confidences
# https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
# Draw rectangular boxes around area containing text (for testing)
def draw_boxes(image, rects, confidences):
# Select non overlapping boxes with high enough confidence
boxes = non_max_suppression(np.array(rects), probs=confidences)
for (startX, startY, endX, endY) in boxes:
# Rescale the boxes to the original image dimensions
startX = int(startX / SUB_IMAGE_W * image.shape[1])
startY = int(startY / SUB_IMAGE_H * image.shape[0])
endX = int(endX / SUB_IMAGE_W * image.shape[1])
endY = int(endY / SUB_IMAGE_H * image.shape[0])
cv2.rectangle(image, (startX, startY), (endX, endY), (0, 0, 255), 2)
return image
We show below an example of how the text detection work for one image:
# Demonstrate how text detection works on an image
def demo_text_detection(east_net):
image = cv2.imread('demo/ketchup_bad_quality.jpg')
copy = image.copy()
scores, geometry = detect(east_net, image)
rects, confidences = compute_rects_confidences(scores, geometry)
image_with_detection = draw_boxes(copy, rects, confidences)
plt.imshow(image_with_detection[..., ::-1])
plt.axis('off')
plt.title('Result of text detection')
plt.show()
east_net = load_network()
demo_text_detection(east_net)
The red rectangle clearly show areas containing text. Now computing the dominant colors in these areas should give us a pretty good idea of the colors on the label.
Computing dominant colors, 1st KMeans
In order to get the main colors of an image, we decided to use KMeans in order to only get the main ones. Later, we will perform a 2nd Kmeans, but this time to get the dominant for all images.
# Returns a list indicating whether or not a pixel has to be considered a feature
# This function is also inspired by the tutorial (cf reference)
# It does the same as 'draw_rect' but instead of drawing rectangles on the image, it select the pixels inside these
def extract_features(rects, confidences, features, rW, rH):
# apply non-maxima suppression to suppress weak, overlapping bounding boxes
boxes = non_max_suppression(np.array(rects), probs=confidences)
# loop over the bounding boxes
for (startX, startY, endX, endY) in boxes:
startX = max(0, int(startX * rW))
startY = max(0, int(startY * rH))
endX = min(int(endX * rW), features.shape[1])
endY = min(int(endY * rH), features.shape[0])
for x in range(startX, endX - 1):
for y in range(startY, endY - 1):
features[y, x] = True
return features
# Creates a k x 3 array containing the k dominant colors in a given image
def get_dominant_colors(input_folder, filename, net, kmeans):
image = cv2.imread(input_folder + filename)
orig = image.copy()
(H, W) = image.shape[:2]
image = cv2.resize(image, (SUB_IMAGE_H, SUB_IMAGE_W))
rW = W / float(SUB_IMAGE_W)
rH = H / float(SUB_IMAGE_H)
(scores, geometry) = detect(net, image)
rects, confidences = compute_rects_confidences(scores, geometry)
text_in_image = np.full((H, W), False)
text_in_image = extract_features(rects, confidences, text_in_image, rW, rH)
text_in_image = text_in_image.reshape((H * W))
features = orig.reshape((H * W, 3))[text_in_image == True]
dominant_colors = None
if len(features) > 0:
dominant_colors = kmeans.fit(features.astype(float)).cluster_centers_
return dominant_colors
# Creates an image from a list of dominant colors
def create_dominant_image(dominants, width, height):
nb_colors = len(dominants)
height_per_color = height // nb_colors
color_image = np.zeros((height_per_color * nb_colors, width, 3), np.uint8)
for i in range(nb_colors):
color = dominants[i]
for y in range(height_per_color):
color_image[y + height_per_color * i, :] = color
return color_image
Here, we show again the same example as before, and show which colors we extracted:
# Demonstrate the result of computing the k dominant colors on a single image
def demo_kmeans_on_image(east_net, k):
kmeans = KMeans(k, random_state=0, n_init=4)
dominants = get_dominant_colors('demo/', 'ketchup_bad_quality.jpg', east_net, kmeans)
image_demo = create_dominant_image(dominants, 50, 300)
plt.imshow(image_demo[..., ::-1])
plt.axis('off')
plt.title('Result of KMeans with k = ' + str(k))
plt.show()
demo_text_detection(east_net)
demo_kmeans_on_image(east_net, k=2)
The KMeans algorithm was able to find two dominant colors on the label. This is a good example of how powerful our method is since it actually shows that the label is mainly green and white without taking into account the fact that the product is red. In the following, the white will also be discarded by the HSV filter and the remaining will be added to the list of colors on which to do the final KMeans.
Computing dominant colors, HSV filter and 2nd KMeans
When computing dominant color for the whole dataset, we used the HSV values to only focus on important colors. For example, we decided that black and white were outliers, since it is often used as color for text but it is more of a default choice rather than specific to a product.
# Filter an image and return only pixels with a high enough Value and Saturation
def hsv_filter(image, thresholdS=40, thresholdV=30):
initial_shape = image.shape
image = np.float32(image)
hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
hsv_image = hsv_image.reshape((initial_shape[0] * initial_shape[1], 3))
image = image.reshape((initial_shape[0] * initial_shape[1], 3))
filtered = image[(hsv_image[:, 1] > (thresholdS / 100)) & (hsv_image[:, 2] > (thresholdV * 255 / 100))]
return filtered
# Order a list of colors by Hue in order to draw a good looking spectrum
def hsv_order_colors(colors):
colors = np.float32(colors)
colors = colors.reshape((1, len(colors), 3))
hsv_colors = cv2.cvtColor(colors, cv2.COLOR_BGR2HSV)
hsv_colors = hsv_colors.reshape((hsv_colors.shape[1], 3))
# Sorting by Hue
hsv_colors = hsv_colors[hsv_colors[:,0].argsort()[::-1]]
return cv2.cvtColor(hsv_colors.reshape((1, len(hsv_colors), 3)), cv2.COLOR_HSV2BGR).reshape((len(hsv_colors), 3))
# Create a k x n x 3 array with the k dominant colors in n images
# If n = -1, it will take all images in folder
def get_dominants(input_folder, k, n=-1):
j = 1
empty_images = 0
dominants = np.full((1, 3), 255)
east_net = load_network()
kmeans = KMeans(k, random_state=0, n_init=4)
if n < 0:
n = len([i for i in os.listdir(input_folder) if i.endswith('jpg')])
for filename in os.listdir(input_folder):
if filename.endswith(".jpg"):
try:
dominant_colors = get_dominant_colors(input_folder, filename, east_net, kmeans)
if not dominant_colors is None:
dominants = np.append(dominants, dominant_colors, axis=0)
else:
empty_images += 1
except SystemExit:
raise
# Catching a KeyboarkInterrupt exception allow us to stop the function anytime and still retrieve intermediary results
except KeyboardInterrupt:
dominants = np.delete(dominants, 0, axis=0)
return dominants.reshape((len(dominants)//k, k, 3))
# For other exceptions we just skip the current image and move to the next one
except Exception as e:
print(e)
print(f'{j}/{n} ----> {empty_images} images with no text', end='\r')
j = j + 1
if j > n:
break
dominants = np.delete(dominants, 0, axis=0)
return dominants.reshape((len(dominants)//k, k, 3))
We chose to represent the main colors of the dataset as a spectrum, and we provide here the functions to compute it:
# Draw a spectrum of dominant colors from all images in a folder
def demo_spectrum(input_folder, title, k_colors_per_image, k_colors_for_spectrum):
dominants = get_dominants(input_folder, k=k_colors_per_image)
filtered = hsv_filter(dominants)
kmeans = KMeans(k_colors_for_spectrum, random_state=0)
global_dominant = kmeans.fit(filtered).cluster_centers_
global_dominant = hsv_order_colors(global_dominant)
dominant_image = create_dominant_image(global_dominant, 50, 300)
plt.imshow(dominant_image[...,::-1])
plt.axis('off')
plt.title(title)
plt.show()
demo_spectrum('demo_bio/', 'Spectrum of dominant colors in organic products package', \
k_colors_per_image=2, k_colors_for_spectrum=50)
demo_spectrum('demo_non_bio/', 'Spectrum of dominant colors in non organic products package', \
k_colors_per_image=2, k_colors_for_spectrum=50)
As we can see, organic products appear to contain more green colors on their package. Brown is a also quite present and colors in general are gloomier.
Reference:
We followed this tutorial in order to do our text detection since we couldn't find a library providing functions achieving the same results.
Plans for the presentation
For the presentation, we plan to:
Give an introduction about the problem of how the food travels, packaging issues and nutrition concerns.
Explain what our project was about (where the food comes from, organic products, packaging & nutritional values)
Show and explain the different insights we got together with the maps of the world
Show the machine learning part and indicate the patterns and dominant colors we found for the organic products