iris dataset in sklearn

Code source: Gaƫl Varoquaux
Modified for documentation by Jaques Grobler
License: BSD 3 clause
Additional code and annotations: Clifton Callender
See original code without additional code and annotations here

from sklearn import datasets
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # matplotlib basic 3D plotting
from sklearn.decomposition import PCA # Principal Component Analysis

import some data to play with

iris = datasets.load_iris()

iris dataset features

X =[:,:2]  # we only take the first two features.

iris dataset labels

Y =

boundaries for the x- and y-axes in the 2D plot below

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(1, figsize=(8, 6))

Plot the training points
note the use of X[:, n] to get the nth column of the 2Darray

plt.scatter(X[:, 0], X[:, 1], c=Y)

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

To getter a better understanding of interaction of the dimensions
plot the first three PCA dimensions

fig = plt.figure(2, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)

n_components=3 indicates to get the first three principal components

pca = PCA(n_components=3).fit(

reduce the feature data from four to three dimensions

X_reduced = PCA(n_components=3).fit_transform(

Create and label the 3D scatterplot

ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.set_ylabel("2nd eigenvector")
ax.set_zlabel("3rd eigenvector")

pca.components_ expresses the principal components in terms of the original
feature space

print("The vectors for three principal components, given in terms of the " \
      "original 4D feature space, are:\n\n", pca.components_, "\n")

pca.explained_variance_ is the variance explained by each of the
principal components

print("The variance explained by each of the principal components is:\n\n",
      pca.explained_variance_, "\n")

pca.explained_variance_ratio_ expresses the variance explained as a ratio

print("Variance explained expressed as a percentage:\n\n",