* R-Squared
* R-squared and pseudo r-squared is a useful statistics produced by most regression type estimation routines.
* R-squared (R2) is a measure of how much of the variance in y is explained by the model.
* Thus a model with only an intercept has an R2 of 0.
set seed 101
clear
set obs 10000
gen y1=rnormal()
reg y1
* While in the opposite extreme a model which does not have any unexplained variance has an r2 of 1.
reg y1 y1, noconstant
* Technically this regression should not work but Stata does the math and produces the results.
* Let's see how well R2 approximates explainable variation.
gen x=rnormal()
gen u=rnormal()
gen y2 = (1)*x + u
* The variance of the model is equal to 1 (from 1^2 * var(x))
* The variance of the unexplained error is equal to 1 (var(u))
* Thus our true explained variance should be equal to var(x)/(1^2 * var(x)+var(u)) = 1/2
reg y2 x
* Thus we can see our R2 estimate of the explained variance is very close to the true which is .5
* I made enphasis on noting the coefficient on the x.
* That coefficient significantly scales explainable variation.
* Thus:
gen y3 = 2*x + u
* Should have a much larger R2 because model variance = 2^2*varx = 4
* Var(u) = 1
* R2 = 4/(4+1)=.8
reg y3 x
* If we were to add multiple xs the calculation is similar though if there is correlation between the xs then that will factor into the model.
gen x1 = rnormal()
gen x2 = rnormal()
gen y4 = x1 + x2 + u
* R2 = var(x1) + var(x2) / (var(x1) + var(x2) + var(u) = 2/3
reg y4 x1 x2
* If there is correlation between the xs then that can substantially throw off the calculations.
* In the extreme cases corr(x1, x2)=1 then we are back to the same scenario as y3
* y = x1 + x2 + u (if x1~N(0,1) and x2~N(0,1)) then x1=x2
* y = 2*x1 + u
* Thus R2 = .8
* In the other extreme corr(x1,x2)=-1
* Then, given that they are both N~(0,1), x2=-x1
* y = x1 + x2 + u = x1 - x1 + u = u
* Which is the same as y1
* R2 = 0
* R-squared can also be thought of as the square of the correlation between the predicted values and the observed.
reg y4 x1 x2
predict y4hat
corr y4hat y4
* Thus we can see that there is an 81% correlation between yhat and y observed.
* A high correlation would indicate that our model have done well at predicting observable characteristics.
di r(rho)^2
* A brief note on adjusted R2.
* R2 is known to always be larger the more variables are in your model.
gen z1 = rnormal()
gen z2 = rnormal()
gen z3 = rnormal()
reg y4 x? z?
* Thus: the R2 moved from R-squared = 0.6610 to
* R-squared = 0.6611
* This factor being known researchers have developed the Adj-R2 which slightly penalizes the R2 for including more variables.
* Thus Adj R-squared = 0.6609
* This might be appropriate given known facts, however it is trivial and almost always worth ignoring.
* I generally don't pay attention to the AR2 and I don't know anybody else who does either.
* A .0001 difference in R2 is so unimportant as to be completely ignorable without significant loss of content.