Tuning Hyper Parameters for Text Classification

In previous articles we worked through basic approaches for text classification by presenting a simplified version of a problem posed by a client and examining the performance of several algorithms. In this article we improve (slightly) the performance of one of the algorithms with a grid and random hyper parameter optimization search.

What is a hyper parameter?

You may be wondering what exactly a hyper parameter is? If you are a programmer a parameter may mean something you pass into a function. But in machine learning a parameter is something your model needs to learn (or optimize). For example, in the classic equation for a line y = mx + b (as it is written) x is the independent variable and y is the dependent variable. The slope coefficient m and the intercept b are the parameters that control the actual line. In a (simplified) machine learning problem you have a list of xs (such as square footage of a house) and a list of ys (the cost of the house) and you try to find/learn the m and b that minimize the error for the data. These are the parameters of the linear model.

A hyper parameter is something that you set or choose to configure the model and the machine learning process. For this example a hyper parameter might be the learning rate or tolerance, etc. Every machine learning algorithm has hyper parameters. Some more than others. For example for a random forest classifier you can set the number of estimators, the criterion to use for a split, the number of features to consider at a node, the maximum depth of the tree, and many more. A neural net would have hyper parameters such as number of layers, number of nodes per layer, learning rate, activation function, dropout rate, etc. Additionally some would argue that the choice of an algorithm is itself a hyper parameter and some systems try to evaluate differnt algorithms against each other on the same dataset.

Hyper parameter search

Choosing good hyper parameters can have an large affect on the performance of the algorithm for your particular case so it is important to explore their settings. Luckily the default parameters in sci-kit learn are usually a great starting point and there are automated ways to explore other settings. Two of the simplest ways are with grid search and random search.

In both options you specify the parameters you want to explore and their allowable values and pass it to a function to do the search. For example this code specifies that we want to search over the loss, penalty and alpha hyper parameters. The values to explore for loss and penalty are given in a list and in a numpy array for alpha.


params = {'loss' : ['hinge','modified_huber'],
          'penalty' : ['none', 'l2', 'l1'],
          'alpha' : np.linspace(0.00005,0.002,num=10)
         }

You can think of grid search as a set of nested loops that grind through all of the different possible values specified. Grid search is interesting in that it feels very comprehensive but it has the drawback that the number of specified setting quickly gets out of hand (2 * 3 * 10 settings for this example) and the parameter specification may not include an optimal setting. For example the linspace functions gives us the approximage values array([5.0e-05, 2.6e-04, 4.8e-04, 7.0e-04, 9.1e-04, 1.1e-03, 1.3e-03, 1.5e-03, 1.7e-03, 2.0e-03]) but what if the optimal value is closer 3.3e-04 and in between two of the values we specified?

With random search you specify the parameter to explore but it is much more natural to give ranges (or distributions) for continuous parameters (rather than specific grid points) and the algorithm is free to choose any valid value. Though it may feel a bit more haphazard random search can result in a better because we end up exploring more values for alpha than we specifically specified in the grid search. That is we are likely to explore 60 different values for alpha instead of 10 in grid search. In the example below we specify that alpha should be chosen from a uniform distribution rather than listing the specific values to use.

params = {'loss' : ['hinge','modified_huber'],
          'penalty' : ['none', 'l2', 'l1'],
          'alpha' : scipy.stats.uniform(0.00005, 0.002)
         }

Then running the search is straight forward. The grid search looks like this:

grid_search = GridSearchCV(SGDClassifier(max_iter=max_iter), params, cv=nfolds, n_jobs=n_jobs, verbose=verbose)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)
print(grid_search.score(X_test,y_test))

yielding

Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    6.3s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   36.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:   60.0s finished
{'alpha': 0.0002666666666666667, 'loss': 'modified_huber', 'penalty': 'l2'}
0.7818092841163311
0.7899888143176734
CPU times: user 10.4 s, sys: 321 ms, total: 10.7 s
Wall time: 1min

and the random search like this

rand_search = RandomizedSearchCV(SGDClassifier(max_iter=max_iter), 
                                 param_distributions=params, n_iter=n_iter, 
                                 cv=nfolds, n_jobs=n_jobs, verbose=verbose)                                 
rand_search.fit(X_train, y_train)

print(rand_search.best_params_)
print(rand_search.best_score_)
print(rand_search.score(X_test,y_test))

with the output

Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   34.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:   55.5s finished
{'alpha': 0.00024772331598254247, 'loss': 'modified_huber', 'penalty': 'l2'}
0.7811101789709173
0.7905480984340044
CPU times: user 10.5 s, sys: 179 ms, total: 10.7 s
Wall time: 56.2 s

Note that we're limiting the number of random trials to 60, the same as the grid search for a more fair comparison. We're also doing a 5 fold cross validation so we could examine the variance as well but we'll keep it simple and report the best score.

In this particular example the randomized search did seem to find a better set of values than the grid search at 79% accuracy vs %78.9 for grid and 78.4% for our original SGDclassifier. Admittedly this is not a huge improvement but sometimes every little bit helps and the improvement may be more substantial in other situations.

Also, keep in mind that other runs will see different results, it is randomized after all, and it is important to use cross fold validation and really study the variance of the results. The big win is usually when the search space is so large that the grid search is expensive or not feasible.

Conclusion

I hope you see that exploring hyper parameter tuning can be worth while and that there are automated ways to do it easily. In a future article we'll talk about doing it smart and automated with Bayesian and evolutionary search strategies.

Related

Text Classification with IBM Watson

Text Classification with Scikit-learn


Want to be automatically notified of more articles like this?