iris dataset in sklearn
Code source: Gaƫl Varoquaux
Modified for documentation by Jaques Grobler
License: BSD 3 clause
Additional code and annotations: Clifton Callender
See original code without additional code and annotations here
from sklearn import datasets
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # matplotlib basic 3D plotting
from sklearn.decomposition import PCA # Principal Component Analysis
import some data to play with
iris = datasets.load_iris()
iris dataset features
X = iris.data[:,:2] # we only take the first two features.
iris dataset labels
Y = iris.target
boundaries for the x- and y-axes in the 2D plot below
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(1, figsize=(8, 6))
Plot the training points
note the use of X[:, n] to get the nth column of the 2Darray
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
To getter a better understanding of interaction of the dimensions
plot the first three PCA dimensions
fig = plt.figure(2, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
n_components=3
indicates to get the first three principal components
pca = PCA(n_components=3).fit(iris.data)
reduce the feature data from four to three dimensions
X_reduced = PCA(n_components=3).fit_transform(iris.data)
Create and label the 3D scatterplot
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()
pca.components_
expresses the principal components in terms of the original
feature space
print("The vectors for three principal components, given in terms of the " \
"original 4D feature space, are:\n\n", pca.components_, "\n")
pca.explained_variance_
is the variance explained by each of the
principal components
print("The variance explained by each of the principal components is:\n\n",
pca.explained_variance_, "\n")
pca.explained_variance_ratio_
expresses the variance explained as a ratio
print("Variance explained expressed as a percentage:\n\n",
pca.explained_variance_ratio_)