Image Processing x Machine Learning: Classifying Leaves

Tonee Bayhon
4 min readFeb 1, 2021

--

Visualizing an example

import numpy as np
import matplotlib.pyplot as plt
import skimage.io as skio
from skimage import img_as_ubyte, img_as_float
from skimage.io import imread, imshow
plt.figure(dpi=200)
plant = imread('plantA_1.jpg')
imshow(plant);

Let’s try preparing this image for segmentation. We also crop the borders of the image as there some to be extra pixels from the original picture:

from skimage.morphology import erosion, dilation, opening, closing
from skimage.measure import label, regionprops
from skimage.color import label2rgb
def multi_dil(im,num):
for i in range(num):
im = dilation(im)
return im
def multi_ero(im,num):
for i in range(num):
im = erosion(im)
return im
plt.figure(dpi=200)
leaves = 1.0 * ((plant/255) < 0.4)[5:-5,5:-5,0]
imshow(leaves);

Let’s now try to segment this image:

plt.figure(dpi=200)
label_im = label(leaves)
imshow(label_im);

Reading the Images

import glob
import pandas as pd
import re
plants = glob.glob('plant*')
df = pd.DataFrame(plants, columns = ['filename'])
df.head()

Let’s get the labels from the filenames:

def get_label(x):
try:
result = re.search('(?<=plant).+(?=\_)', str(x))[0]
except:
result = re.search('(?<=plant).+(?=\.)', str(x))[0]

return result
df['label'] = df.apply(lambda x: get_label(x), result_type = 'broadcast', axis = 1)df

Cleaning the Images

We define the following thresholds per label obtained through trial and error:

thresh = {'A':0.4, 'B':0.4, 'C':0.4, 'D':0.7, 'E':0.7}

We then proceed with define our cleaning by erosion and dilation:

def clean_image(img, t=0.5):

plant = imread(img)
leaves = 1.0 * ((plant/255) < t)
im_cleaned = multi_dil(multi_ero(leaves,3),3)[5:-5,5:-5,0]
return im_cleaned

Then we apply this to all the files and store the resulting image for further processing:

df['cleaned'] = df.apply(lambda x: clean_image(x.filename, thresh[x.label]), axis=1)

We then segment each image to their individual leaves:

def get_regions(im_cleaned):
label_im = label(im_cleaned)
return regionprops(label_im)
df['regions'] = df.apply(lambda x: get_regions(x.cleaned), axis=1)data = {}
index = 0
for x in df.index:

label = df['label'].iloc[x]
filename = df['filename'].iloc[x]
for i in df['regions'].iloc[x]:
data[index] = {'leaf':i,
'label': label,
'filename':filename}
index += 1
data_df = pd.DataFrame(data).Tdata_df

From here we can see that we’ve ended up with 260 leaves.

Extracting Features

Let us now extract features from the resulting individual leaves. We will get the following features:

  • Length
  • Width
  • Perimeter
  • Area
  • Centroid Location

Let us first define the functions to get these features:

def get_size_coordinates(props):
x1, y1, x2, y2 = props.bbox
return x1, y1, x2, y2
def get_length(coord):
return coord[2]-coord[0]
def get_width(coord):
return coord[3]-coord[1]
def get_peri(leaf):
return leaf.perimeter
def get_area(leaf):
return leaf.area
def get_centroid(leaf):
return leaf.local_centroid
def get_cx(length, centroid):
return length-centroid[0]
def get_cy(width, centroid):
return width-centroid[1]

Then we apply the above functions to each leaf:

data_df['coordinates'] = data_df.apply(lambda x: get_size_coordinates(x.leaf), axis=1)data_df['length'] = data_df.apply(lambda x: get_length(x.coordinates), axis = 1)
data_df['width'] = data_df.apply(lambda x: get_width(x.coordinates), axis = 1)
data_df['perimeter'] = data_df.apply(lambda x: get_peri(x.leaf), axis = 1)data_df['area'] = data_df.apply(lambda x: get_area(x.leaf), axis = 1)data_df['centroid'] = data_df.apply(lambda x: get_centroid(x.leaf), axis = 1)data_df['cx'] = data_df.apply(lambda x: get_cx(x.length, x.centroid), axis = 1)
data_df['cy'] = data_df.apply(lambda x: get_cy(x.width, x.centroid), axis = 1)
data_df[['length', 'width', 'area', 'perimeter', 'cx', 'cy', 'label']]

Let’s visualize these features through a pairplot:

import seaborn as sns
sns.set(style="ticks")
df = data_df[['length', 'width', 'area', 'perimeter', 'cx', 'cy', 'label']]
sns.pairplot(df, hue="label", diag_kind='kde')
data_df.groupby('label').size()

Machine Learning Models

We then pass the resulting dataframe though machine learning models namely:

  • KNN
  • Logistic Regression (L1 & L2 Regularizations)
  • SVC (L1 & L2 Regularizations)

But first let us compute for the proportion chance criterion:

sizes = data_df.groupby('label').size()
sizes_sum = sizes.sum()
((sizes/sizes_sum)**2).sum()*100

We end up with a PCC of 20.074573185948534%

Running our input through the aforementioned Machine Learning models, we acquire the following accuracies:

We then conclude that Logistic Regression with an L1 Regularization is the best model for our dataset with length as the top predictor across the board.

--

--