Besides using continuous variables to make predictions it is also useful sometimes to use variables that are just different catgories. e.g. Male/Female, Engine Type.
This lecture goes over how the use of the class factor for variables makes it possible to include these in an lm equation.
For the AudiA4 data set has the variable engine to indicate the type of engine. This appears as character strings (of the engine type) but it is actually a factor variable.
> is.numeric( AudiA4$engine)  FALSE > is.character( AudiA4$engine)  FALSE > is.factor( AudiA4$engine)  TRUE
Factor variables simply keep track of different categories and have two parts: the character string names of the catogories, known as levels, and then an integer variable that maps each data value to one of the categories. The example below creates a simple factor variable from character strings and then disects. Keep in mind that although the factor class may seem redundant it is makes it easy to include these kind of variables in a regression.
#some character data > var1<- c( "red", "blue", "red", "green", "blue", "blue", NA) > print(var1)  "red" "blue" "red" "green" "blue" "blue" NA > > var1F<- as.factor(var1) > print( var1F)  red blue red green blue blue <NA> Levels: blue green red > > index1<- as.numeric( var1F) > print(index)  3 1 3 2 1 1 NA
In either case ( var1 or var1F) we should get the same result for table
> table( var1) var1 blue green red 3 1 2
This is good example of the use of the match function.
> levels<- unique( var1) > print( levels)  "red" "blue" "green" NA > index2<- match( var1, levels) > print( index2)  1 2 1 3 2 2 4 >
Note that index1 and index2 are not quite the same. To get them equal one needs to omit NA as a possible item to match. Then case 4 will just be "matched" as an NA.
For the Audi data a simple example to see if the engine type makes a difference. Based on a table of these values there are problems with the categories and it makes sense to fix these first.
table( AudiA4$engine) 1.8T 2.0T 3.0 NEWLY Prestige SE 51 305 1 2 2 1 # Oops! Some of these are not engines and can not # handle just one dat point for 3.0 engineNew0<- as.character(AudiA4$engine) ind<- as.numeric(AudiA4$engine) engineNew0[ ind >= 3] <- NA engineNew <- as.factor( engineNew0) table(engineNew)
With the cleaned up version of the engine variable:
lm( price ~ engineNew, data=AudiA4 ) Call: lm(formula = price ~ engineNew, data = AudiA4) Coefficients: (Intercept) engineNew2.0T 7339 15403
Interpretation is that for cars without the 2.0 (the 1.8T) engine the predicted value is 7339 for cars with the 2.0T the prediction is (7339 + 15403) = 22742
Note that compared to the using other predictors this is not a very good model ( R^2 = .33)
Or including a more complete model
lm( price ~ year + mileage + engineNew, data=AudiA4 )
Here is a way to adjust for whether a car is within 250 miles of Boulder.
distF<- as.factor( ( AudiA4$distance <= 250) ) #recall that logicals get coded as 0 and 1. lm( price ~ year + mileage + distF, data=AudiA4 ) Call: lm(formula = price ~ year + mileage + distF, data = AudiA4) Coefficients: (Intercept) year mileage distFTRUE -3.033e+06 1.522e+03 -9.266e-02 3.865e+02
The interpretation here is that there is a premium of about 3865 dollars for cars within 250 miles.
The intercept is hard to interpret in many cases because the variables are never extrapolated to zero ( e.g. price when year =0).
It is convenient to "center" the variables by their mean or subtract off a constant to make the intercept interpretable. Note that this adjustment does not change the actual fit or any statistics of the model, just the coefficients.
lm( price ~ I(year - 1999) + mileage + distF, data=AudiA4 ) Coefficients: (Intercept) I(year - 1999) mileage distFTRUE 1.023e+04 1.522e+03 -9.266e-02 3.865e+02
This way the base price (1999 year cars) is the intercept.