Hypergamy and the Gender Wars

With the exception of intentional delusion caused by imbibing too many Cosmo articles, EVERY woman knows that youth is beauty, and youth is fleeting. Beauty is the primary way in which a woman turns…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Advice for Picking Model for Simple Classification Task

Comparing and contrasting logistic regression, random forest and neural network and how to use them in R

Classification has been one of the oldest and most classical problems in statistical analysis. There are many existing machine learning models that could be utilized. For this blog, we would provide some insight for choosing among three types of machine learning model: linear classification, random forest and neural network, on a simple multi-class classification task.

The dataset could be downloaded with the public Kaggle API with the following command line code (os):

Then, we could use the following code to read data, remove rows containing missing data and change datatype into factors:

Before fitting any model, we are going to first split the data into train and test subsets with the fraction of 0.9 and 0.1:

Since our response variable in this question is categorial, we are going to use multinomial logistic regression here. Multinomial logistic regression estimates the probability of a given instance belonging to different categories based on its values of explanatory variables.

Below is the output from summary(logistic):

The coefficients could also be visualized:

The coefficients could be interpreted as the increasing rate of the log odds which is the log ratio of the corresponding probability compared to the baseline probability (of being A, in this case). To make it more clear, we could exponentiate the coefficients by executing exp(coef(summary(logistic))) and get:

As an example, 0.8197 is on the second column of the first row. This means that, conditioning on all other variables, the probability odd for a female individual being classified as segmentation B vs. A would change by a factor of 0.8197 if this individual is male instead. Therefore, we could say that being male instead of female makes an individual less likely to be classified as segmentation B compared to segmentation A since 0.8197 is less than 1.

Then we could run prediction on the test dataset:

And the value of logisticPred will be like:

In the prediction, the first instance has 0.1327 probability of being A, 0.3539 probability of being B, 0.4959 probability of being C and 0.0175 probability of being D.

Finally, let’s could check the confusion matrix and accuracy of the prediction:

And we will get the output:

The final test accuracy for multinomial logistic regression is 0.4729.

Random forest is a kind of classification model composed of decision trees. An example of a simple decision tree is that, given some pictures of either cat or human, we try to decide what is in a given picture by a policy: All pictures containing tails are pictures of cats, otherwise are pictures of humans. This may not be a perfect model (since not all cat pictures has tails in it) so we may want to add more policies: if there are any pointy ears or not, if there is any fur or not… Each of these policies is a decision tree and we can get a random forest by putting them together: All pictures containing tails, pointy ears or furs are pictures of cats, otherwise are pictures of humans. The final decision is made by major vote.

To fit a random forest, we are going to use package party:

By running the below code, we will get the following plot:

In the plot, if we zoom in, we could clearly see the structure of the random forest — how each decision is made, how the decision trees are composed. In this case, there are too many decision trees and the graph becomes too dense so we could also run print(rf) to get the structure in text:

Then, let’s run prediction on the test set with this model:

And calculate the confusion matrix and test accuracy:

Which produces the output:

The final test accuracy for random forest is 0.4910.

Multilayer feed-forward neural networks are universal approximators, which means that they could approximate any function arbitrarily well. They are composed of layers of interconnected nodes which is computed based on the inputs and weights. Weights are tuned during training so the network could learn important patterns from the input data. In this demonstration, we are going to use a two-layer feed-forward neural network.

We are going to use PyTorch to fit the network. You may want to make sure Python3 has been installed on your computer.

Since there is lots of existing posts about how to write a neural network, I would not go over the steps in detail. Below is the script which was used to produce the results:

By using the above script to train for 200 epoches, we get the below training curve which is plot for training/validation accuracy vs.number of epoches.

The final test accuracy is 0.5045.

As you may already know, parameters of neural networks are not interpretable. Which means that we are training a blackbox and no one could tell how the model produced the outputs.

Regarding choosing model, I would like to compare and contrast the three models — logistic regression, random forest and feed-forward neural network, from three aspects: the final test accuracy, the easiness to fit and the interpretability.

The three models produced similar final test accuracy, among which neural network produced the highest and logistic regression produced the lowest. However, the model performance regarding the test accuracy could highly depend on the data which it is trained on. The result of neural network also depends on randomness of parameter initialization and hyperparameter choice (the learning rate, neural network structure, number of epoches, regularization, …). The best thing to do is to try to fit and tune all the candidate models and pick the one that produces the highest test accuracy. This may not be doable in all cases. Therefore, we could also choose the model based on the property of the dataset: Is it multinomial? Could the data be easily classified based on a set of decision tree policies? Does the model have a fixed shape? etc.

When you are training model with limited computational power, you may want to take into consideration the easiness to fit when choosing model. This is especially important when the dataset is large. If you do not have access to GPU resource, the easiness to fit would be logistic regression > random forest > neural network. Using neural network for deep learning could be very costly without GPU resource.

Finally, the interpretability is also an important thing you may want to consider, especially when you want to use the model to learn human-readable patterns from the dataset (e.g. the correlation between the response and explanatory variable). As we’ve mentioned above, neural networks are poorly interpretable. Logistic regression coefficients and random forest policies are both interpretable in different way. You may want to choose the model based on the way they could be interpreted.

You could find all demo scripts in the below repository:

Add a comment

Related posts:

Collecting metrics data from decentralized systems

All decentralized networks, including blockchains and other P2P systems face the technical problem of how to gather metrics and statistics from nodes run by many different parties. Achieving this…

Statutory Severance and Common Law Severance

When employees are dismissed from their jobs without cause, they’re entitled to notice in advance of their dismissal or pay instead. This amount is to compensate for the time they’ll need to find new…