###Logisitic regression in R and evaluating the fit.
The glm function follows the same formula style as lm with just a few added arguments to indicate that one has 0,1 data and a logistic transformation.
For the spam data and using capLong the call is
fit <- glm(spam~capLong, data=dat,family=binomial)
Similar to lm one can use the summary function to get information about the fit and predict to make predictions. There is also a concept of a residual but it is not as easy to use as the ones from lm. Note that the predicted values can have two forms: they can be the probabilities implied by the logisitic function and the linear model (response option in type) or just the linear part predicted by the covariates without (link option).
In some applications such those from medicine or public health the estimated probabilities are important. E.g. what is the probability of a drug being effective. However, for the spam case and in for classication problems the probability is an intermediate step to build the classifer:
1) Choose a threshhold B
2) If the predicted probability is greater than B then classify as a "1" (spam email in the spam example). If predicted probability is less than B classify as a "0" or good email.
Note that we have the choice of B being any value in the range from 0 to 1.
It is standard to fit the glm to part of your data set and then test out the classifier on the remaining part. This help to avoid overfitting and biasing the results by using the data twice.
It is helpful to see how this would be done in R. For the spam data set train is a logical to separate the full dataset into training and testing. Also it may be better to use log10 capLong instead of its raw values.
fitGLM <- glm(spam~I(log10(capLong)), data=dat,family=binomial, subset=train) fit.test<- predict( fitGLM, newdata=dat[ !train,], type="response") data.test<- (dat[!train,])$spam #use .5 as the B value class.test<- ifelse( fit.test < .5, 0,1) confusionTable<- table( class.test, data.test)
The confusion table shows how well we are doing both in terms of classifying correctly and the misclassifications.
class.test 0 1
0 119 42
1 14 71
We focus on two features that can be extracted from this table
True positive rate (TPR) The fraction of spam emails that are classified correctly. ( 71 / ( 42 + 71) = 62.8% ).
False positive rate (FPR) The fraction of good emails that are classified as spam ( 14/ ( 119 + 14) = 10.5% )
The function ROC.point is some handy code that computes these two frequencies for a set of B values. It can also be used to draw a ROC curve.