Classification

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper")
from PIL import Image

Risk for Classification

(diagram from Abu-Mustafa's Learning From Data Course) ml diagram

Lets define

$$E_{XY}[f(X,Y)] = \int f(x,y) dP_{XY}(x,y) = \int f(x,y) p(x,y)\,dx\,dy$$

where $P$ is the distribution (cdf) and $p$ is the local density. We'll use the Lebesgue measure for continuous rvs and the counting measure for discrete rvs, so as to use the integral symbol everywhere.

Then consider an estimator function (for regression, estimating $f$ per usual) or action (for classification) $g$. The risk accociated with $g$ is:

$$R(g) = E_{XY}[l(g(X),Y)]$$

where $l(g(x),y)$ is the loss associated with choosing action $g$ at x when $Y=y$. For example, for regression we can use $l = (g-y)^2$ and for classification we usually use the 1-0 loss $l = \mathbb{1}_{g \ne y}$.

$g$ is typically learnt on a given dataset $D_n$, well then subscript it $g_n$ to indicate that. So in general we are interested in $R(g_n)$, and sometimes, in a frequentist outlook $E_{D_n}[g_n]$.

For the 1-0 loss, which is a measurement of simple inaccuracy in classification:

$$R(g) = E_{XY}[l(g(X),Y)] = P_{XY}(g(X) \ne Y) = \int dP_X P_{Y|X}(g(X) \ne Y | X=x)$$

Thus:

$$R(g) = \int dP_X \left( 1 - P_{Y|X}(g(X) = Y | X=x) \right).$$

If we maximize this second term at each 'point' $X=x$ we will minimize the classifcation risk. This is the Maximum-a-priori estimate, so the 1-0 loss basically maximizes the mode of a posterior.

We can also consider more general risk functions:

$$R(g) = E_x[\int dP_{Y|X} l(g(X),Y)] = \int dP_X \left(\sum_c p(c|x) l(g(x),c)\right)$$

.

So if we minimize the "posterior expected loss" at each $x$ we are done. Notice that $g$ here is an action. By this we mean that its not the probabilty estimation function. It can be the prediction function. (while in regression g is the approximating function here it is just a map conveniently giving us 0's and 1's. We can even generalize to actions not in the set of possible classifications. Finally we might not choose to do the entire x integral if we are only interested in part of the sample space, such as the space corresponding to mortgage defaulters or direct-marketing conveteds.

In [2]:
df=pd.read_csv("https://dl.dropboxusercontent.com/u/75194/stats/data/01_heights_weights_genders.csv")
df.head()
Out[2]:
Gender Height Weight
0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801

Generative Classifier

In a generative model you are usually modelling $P_{XY}$ directly, usually with some parameters $\theta$. In the example below, one can model with two multivariate gaussians, although in the diagram I am faking it with a kdeplot. [POSSIBLE DERIVATION]

In [3]:
plt.figure(figsize=(10,6))
sns.kdeplot(df.Weight, df.Height);
In [4]:
dfdata = df[['Weight', 'Height']].values
dfresu = (df['Gender'] == "Male")*1
print dfdata.shape, dfresu.shape
(10000, 2) (10000,)

Discriminative classifier

In a discriminative classifier like logistic regression you directly model the posterior $P_{Y|X} = p(y| x)dy$ directly. The relationship between Logistic and a gaussian mixture model is that you tie the covariance matrices in the two groups together.

In [5]:
from sklearn.linear_model import LogisticRegression
def cv_optimize_logistic(X, y, n_folds=10):
    clf = LogisticRegression()
    parameters = {"C": [0.01, 0.1, 1, 10, 100]}
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    return gs
In [6]:
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])


def points_plot2(X, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold):
    h = .02
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    plt.figure(figsize=(10,6))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.2)
    plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, s=50, alpha=0.2,edgecolor="k")
    # and testing points
    yact=clf.predict(Xte)
    plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=0.5, marker="s", s=35)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    return plt.gca()
In [7]:
from matplotlib.colors import ListedColormap
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
def points_plot_prob(X, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold):
    h = .02
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    plt.figure(figsize=(10,6))
    plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cm_bright, s=10, alpha=0.2)
    # and testing points
    yact=clf.predict(Xte)
    plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cm_bright, alpha=0.2, marker="s", s=10)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=cm, alpha=.2)
    cs2 = plt.contour(xx, yy, Z, cmap=cm, alpha=.6)
    plt.clabel(cs2, fmt = '%2.1f', colors = 'k', fontsize=14)
    return plt.gca() 
In [8]:
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

dftrain, dftest, resutrain, resutest = train_test_split(dfdata, dfresu, train_size=0.8)
dflogis = cv_optimize_logistic(dftrain, resutrain)
print dflogis.best_estimator_, dflogis.best_params_, dflogis.best_score_, dflogis.grid_scores_
points_plot_prob(dfdata, dftrain, dftest, resutrain, resutest, dflogis)
sns.kdeplot(df.Weight, df.Height, shade=False, alpha=0.4);
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001) {'C': 0.01} 0.91975 [mean: 0.91975, std: 0.00915, params: {'C': 0.01}, mean: 0.91975, std: 0.00915, params: {'C': 0.1}, mean: 0.91975, std: 0.00915, params: {'C': 1}, mean: 0.91975, std: 0.00915, params: {'C': 10}, mean: 0.91975, std: 0.00915, params: {'C': 100}]
//anaconda/lib/python2.7/site-packages/matplotlib/text.py:52: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if rotation in ('horizontal', None):
//anaconda/lib/python2.7/site-packages/matplotlib/text.py:54: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif rotation == 'vertical':
In [9]:
df[df.Gender=='Male'].cov()
Out[9]:
Height Weight
Height 8.198843 48.879649
Weight 48.879649 391.294074
In [10]:
df[df.Gender!='Male'].cov()
Out[10]:
Height Weight
Height 7.269947 43.576404
Weight 43.576404 361.854281

Callibration

A classifier like Logistic regression outputs probabilities. You must check its callibration: ie the notion that a samples with probability p occur a fraction p of the time in test sets.

from Chris Beaumont's diagram in cs109 hw 3 (2013)

In [11]:
def calibration_plot(clf, xtest, ytest):
    prob = clf.predict_proba(xtest)[:, 1]
    outcome = ytest
    data = pd.DataFrame(dict(prob=prob, outcome=outcome))

    #group outcomes into bins of similar probability
    bins = np.linspace(0, 1, 20)
    cuts = pd.cut(prob, bins)
    binwidth = bins[1] - bins[0]
    
    #freshness ratio and number of examples in each bin
    cal = data.groupby(cuts).outcome.agg(['mean', 'count'])
    cal['pmid'] = (bins[:-1] + bins[1:]) / 2
    cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])
        
    #the calibration plot
    ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
    plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
    plt.ylabel("Empirical Fraction")

    
    #the distribution of P(fresh)
    ax = plt.subplot2grid((3, 1), (2, 0), sharex=ax)
    #calsum = cal['count'].sum()
    plt.bar(left=cal.pmid - binwidth / 2, height=cal['count'],
            width=.95 * (bins[1] - bins[0]),
            fc=p[0].get_color())
    plt.xlabel("Classifier Probability")
In [12]:
calibration_plot(dflogis, dftrain, resutrain)
In [13]:
calibration_plot(dflogis, dftest, resutest)

Classifier Evaluation

The confusion matrix is the basic starting point.

confmatrix

In [16]:
from sklearn.metrics import confusion_matrix
confusion_matrix(dflogis.predict(dftest), resutest)
Out[16]:
array([[878,  71],
       [ 94, 957]])

(from Data Science for business: Foster and Fawcwett:)

The "posterior expected loss" sets us up very nicely to evaluate classifiers based on the notion that out losses or gains should reflect our business scenario. (Note that we could have as well maximized utility or benefit instead of minimizing loss. Its the same formula.

expectedvalue

ROC Curve

Let $p(y=1|x) > t$ mean $g(x)=1$. $t$ is the threshold. We move it from 0 to 1 and get different TPR and FPR and plot TPR against FPR.

$$TPR = Recall = \frac{TP}{ObsvP} = \frac{TP}{TP+FN}.$$$$FPR = \frac{FP}{ObsvN} = \frac{FP}{FP+TN}$$

howto roc

In [17]:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
fprhw, tprhw, thresholdshw=roc_curve(resutest, dflogis.predict_proba(dftest)[:,1])
roc_auchw = auc(fprhw, tprhw)
In [18]:
print thresholdshw
[  9.99996090e-01   9.99995039e-01   9.99992241e-01 ...,   2.82408442e-05
   2.48731761e-05   1.91550691e-05]
In [19]:
plt.plot(fprhw, tprhw, 'o-', label='ROC curve (area = %0.2f)' % roc_auchw)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right");

Precision and Recall

Recall tells you how many of the true positives were recalled:

$$Recall = \frac{TP}{ObsvP} = \frac{TP}{TP+FN}.$$

Precision tells you how many of the returned hits (+ives) were truly positive

$$Precision = \frac{TP}{PredP} = \frac{TP}{TP+FP}.$$
In [98]:
phw,rhw,thw = precision_recall_curve(resutest, dflogis.predict_proba(dftest)[:,1])
print phw, rhw, thw
plt.plot(rhw, phw);
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
[ 0.55778622  0.55754615  0.557849   ...,  1.          1.          1.        ] [  1.00000000e+00   9.99027237e-01   9.99027237e-01 ...,   1.94552529e-03
   9.72762646e-04   0.00000000e+00] [ 0.00172081  0.00174447  0.00177031 ...,  0.99999224  0.99999504
  0.99999609]
Out[98]:
(0.0, 1.05)

The ATM Camera example

In [21]:
#wget https://dl.dropboxusercontent.com/u/75194/imag.pix.npy
#wget https://dl.dropboxusercontent.com/u/75194/imag.lbl.npy
data=np.load("./imag.pix.npy")
y=np.load("./imag.lbl.npy")
STANDARD_SIZE = (322, 137)
data.shape, y.shape
Out[21]:
((87, 132342), (87,))
In [22]:
def get_image(mat):
    size = STANDARD_SIZE[0]*STANDARD_SIZE[1]*3
    r,g,b = mat[0:size:3], mat[1:size:3],mat[2:size:3]
    rgbArray = np.zeros((137,322,3), 'uint8')
    rgbArray[..., 0] = r.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    rgbArray[..., 1] = b.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    rgbArray[..., 2] = g.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    return rgbArray

def display_image(mat):
    with sns.axes_style("white"):
        plt.imshow(get_image(mat))
        plt.xticks([])
        plt.yticks([])
In [23]:
display_image(data[5])
In [24]:
display_image(data[50])

The curse of dimensionality: Feature engineering

In [25]:
from sklearn.decomposition import PCA, RandomizedPCA
from sklearn.neighbors import KNeighborsClassifier
pca = PCA(n_components=50)
X = pca.fit_transform(data)
print pca.explained_variance_ratio_.sum()
0.905874558047
In [26]:
pca.explained_variance_ratio_
Out[26]:
array([ 0.35925967,  0.06293188,  0.04107783,  0.0311951 ,  0.0281696 ,
        0.02288316,  0.02101279,  0.0187405 ,  0.01732646,  0.01530238,
        0.01421597,  0.01318394,  0.01247015,  0.01163818,  0.01099583,
        0.01060732,  0.0100743 ,  0.00980239,  0.00960559,  0.00915359,
        0.00901856,  0.00852133,  0.00836746,  0.00796924,  0.00754898,
        0.00725056,  0.00708218,  0.00679677,  0.00660925,  0.00647721,
        0.00627392,  0.00594773,  0.00583031,  0.00574509,  0.00572782,
        0.00552617,  0.00538587,  0.00533432,  0.00516657,  0.00493432,
        0.00485946,  0.00477622,  0.00472842,  0.004562  ,  0.00444171,
        0.00440045,  0.00434604,  0.00428004,  0.00421957,  0.00410035])
In [27]:
df = pd.DataFrame({"pc1": X[:, 0], "pc2": X[:, 1], "y":y, "label":np.where(y==1, "check", "dollar")})
colors = ["red", "yellow"]
for label, color in zip(df['label'].unique(), colors):
    mask = df['label']==label
    plt.scatter(df[mask]['pc1'], df[mask]['pc2'], c=color, label=label)
plt.legend()
Out[27]:
<matplotlib.legend.Legend at 0x10baec050>
In [28]:
def normit(a):
    a=(a - a.min())/(a.max() -a.min())
    a=a*256
    return np.round(a)
def getNC(pc, j):
    size=322*137*3
    r=pc.components_[j][0:size:3]
    g=pc.components_[j][1:size:3]
    b=pc.components_[j][2:size:3]
    r=normit(r)
    g=normit(g)
    b=normit(b)
    return r,g,b
def display_component(pc, j):
    r,g,b = getNC(pc,j)
    rgbArray = np.zeros((137,322,3), 'uint8')
    rgbArray[..., 0] = r.reshape(137,322)
    rgbArray[..., 1] = g.reshape(137,322)
    rgbArray[..., 2] = b.reshape(137,322)
    plt.imshow(rgbArray)
    plt.xticks([])
    plt.yticks([])
In [29]:
display_component(pca,0)
In [30]:
display_component(pca,1)

Classifying in a reduced feature space with kNN

In [31]:
ys=df['y'].astype(int).values
subdf=df[['pc1','pc2']]
subdfstd=(subdf - subdf.mean())/subdf.std()
Xs=subdfstd.values
def classify(X,y, nbrs, plotit=True, train_size=0.6):
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
    clf= KNeighborsClassifier(nbrs, warn_on_equidistant=False)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    Xall=np.concatenate((Xtrain, Xtest))
    if plotit:
        print "Accuracy on training data: %0.2f" % (training_accuracy)
        print "Accuracy on test data:     %0.2f" % (test_accuracy)
        points_plot2(Xall, Xtrain, Xtest, ytrain, ytest, clf)
    return nbrs, training_accuracy, test_accuracy
In [32]:
classify(Xs,ys,1)
Accuracy on training data: 1.00
Accuracy on test data:     0.86
Out[32]:
(1, 1.0, 0.8571428571428571)
In [33]:
classify(Xs,ys,40)#run this a few times
Accuracy on training data: 0.62
Accuracy on test data:     0.54
Out[33]:
(40, 0.61538461538461542, 0.54285714285714282)
In [34]:
classify(Xs,ys,17)
Accuracy on training data: 0.94
Accuracy on test data:     0.94
Out[34]:
(17, 0.94230769230769229, 0.94285714285714284)

Learning Curves (against complexity)

In [35]:
fits={}
for k in np.arange(1,40,1):
    fits[k]=[]
    for i in range(200):
        fits[k].append(classify(Xs, ys,k, False))
nbrs=np.arange(1,40,1)
fmeanstr = np.array([1.-np.mean([t[1] for t in fits[e]]) for e in nbrs])
fmeanste = np.array([1.-np.mean([t[2] for t in fits[e]]) for e in nbrs])
fstdsstr = np.array([np.std([t[1] for t in fits[e]]) for e in nbrs])
fstdsste = np.array([np.std([t[2] for t in fits[e]]) for e in nbrs])
In [36]:
pal=sns.color_palette()
plt.gca().invert_xaxis()
plt.plot(nbrs, fmeanstr, color=pal[0]);

plt.plot(nbrs, fmeanste, color=pal[1]);
In [37]:
def cv_optimize_knn(X, y, n_folds=10):
    clf = KNeighborsClassifier()
    parameters = {"n_neighbors": range(1,40,1)}
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    return gs
In [38]:
Xtrain, Xtest, ytrain, ytest = train_test_split(Xs, ys, train_size=0.8)
bestcv = cv_optimize_knn(Xtrain, ytrain)
In [39]:
bestcv.best_estimator_, bestcv.best_params_, bestcv.best_score_
Out[39]:
(KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_neighbors=7, p=2, weights='uniform'),
 {'n_neighbors': 7},
 0.95652173913043481)
In [40]:
points_plot2(Xs, Xtrain, Xtest, ytrain, ytest, bestcv)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x115173a10>
In [41]:
points_plot_prob(Xs, Xtrain, Xtest, ytrain, ytest, bestcv)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x11311bc10>