Random Forests At Scale

Twitter: @JulioBarros

Creates multiple decision trees and combines individual predictions into one overall prediction.

Each tree can be trained independently - parallelizable.

from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits

digits = load_digits()

clf = RandomForestClassifier(n_estimators=20, max_depth=10, n_jobs=-1) clf.fit(digits.data, digits.target) clf.score(digits.data, digits.target)

from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits

digits = load_digits() X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(digits.data, digits.target, test_size=0.1, random_state=0)

clf = RandomForestClassifier(n_estimators=20, max_depth=10, n_jobs=-1) clf.fit(X_train,y_train) clf.score(X_test,y_test)

MNIST original

import sklearn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_mldata
from sklearn.grid_search import GridSearchCV

mnist = fetch_mldata('MNIST original')
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(mnist.data, mnist.target, test_size=0.1, random_state=0)

model = RandomForestClassifier()
parameters = [{"n_estimators": [25, 50, 100],'max_depth':[10,20,40]}]

clf = GridSearchCV(model, parameters,verbose=5, n_jobs=8)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

print "Best Params: " + str(clf.best_params_)
print "Best Score: " + str(clf.best_score_)
print "Best Estimator: " + str(clf.best_estimator_)

cubeit = lambda x: x***3

reduceByKey, groupByKey, combineByKey, sortByKey,

mapValues, flatMapValues, …

Set of algorithms that run over the RDDs.

Can improve scheduling/performance and can be reused on different data sets.

Mentions of (grid search) parameter tuning.

During development you can use a local stand alone “cluster”. In production YARN or Mesos.

Using Python version 2.7.8 (default, Aug 21 2014 15:21:46) SparkContext available as sc, SQLContext available as sqlCtx. >>>

sc = SparkContext(appName="RandomForestsAtScale")

lines = sc.textFile("README.md") words = lines.flatMap(lambda line: line.split()) t = words.map(lambda word: (word,1)) counts = t.reduceByKey(lambda a, c : a + c) print counts.sortBy(lambda t: t[1],ascending=False).take(10)

sc.stop()

Perfect for development and testing before submitting to production cluster.

* Assuming your data is always perfect in the real world. :)

Twitter: @JulioBarros

Want to get notified of new articles and insights?