Random Forests At Scale
Twitter: @JulioBarros
Creates multiple decision trees and combines individual predictions into one overall prediction.
Each tree can be trained independently - parallelizable.
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits
digits = load_digits()
clf = RandomForestClassifier(n_estimators=20, max_depth=10, n_jobs=-1) clf.fit(digits.data, digits.target) clf.score(digits.data, digits.target)
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits
digits = load_digits() X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(digits.data, digits.target, test_size=0.1, random_state=0)
clf = RandomForestClassifier(n_estimators=20, max_depth=10, n_jobs=-1) clf.fit(X_train,y_train) clf.score(X_test,y_test)
MNIST original
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_mldata
from sklearn.grid_search import GridSearchCV
mnist = fetch_mldata('MNIST original')
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(mnist.data, mnist.target, test_size=0.1, random_state=0)
model = RandomForestClassifier()
parameters = [{"n_estimators": [25, 50, 100],'max_depth':[10,20,40]}]
clf = GridSearchCV(model, parameters,verbose=5, n_jobs=8)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
print "Best Params: " + str(clf.best_params_)
print "Best Score: " + str(clf.best_score_)
print "Best Estimator: " + str(clf.best_estimator_)
cubeit = lambda x: x***3
reduceByKey, groupByKey, combineByKey, sortByKey,
mapValues, flatMapValues, …
Set of algorithms that run over the RDDs.
Can improve scheduling/performance and can be reused on different data sets.
Mentions of (grid search) parameter tuning.
During development you can use a local stand alone “cluster”. In production YARN or Mesos.
Using Python version 2.7.8 (default, Aug 21 2014 15:21:46) SparkContext available as sc, SQLContext available as sqlCtx. >>>
sc = SparkContext(appName="RandomForestsAtScale")
lines = sc.textFile("README.md") words = lines.flatMap(lambda line: line.split()) t = words.map(lambda word: (word,1)) counts = t.reduceByKey(lambda a, c : a + c) print counts.sortBy(lambda t: t[1],ascending=False).take(10)
sc.stop()
Perfect for development and testing before submitting to production cluster.
* Assuming your data is always perfect in the real world. :)
Twitter: @JulioBarros