Tuesday, April 26, 2011

Some analysis on university ranking by US News


The yearly US News best college ranking is an important tool in comparing schools for students and their eager parents. The latest data is publicly available (paying 20 bucks would get full access) [Ref.1]. And the methodology is easy to find and explain [Ref.2]: a score would be weighted by peer assessment, retention, faculty resources, student selectivity, graduation rate, etc; therefore the final ranking would be based on the scores of a number of colleges.

It is interesting to explore and dissect the ranking process by US News. Still the dirty job of data extraction, transformation and loading occupied 90% of the working time. Data crunching was performed with logistic regression (for private/public), and selective linear regression (for score), by the nice tools from SAS/STAT. Factor analysis and partial least square regression were used to minimize the multicollinearity that is widespread in this data.

The analysis leads to two conclusions. First, the ranking is relatively qualitative instead of quantitative. The ranking heavily depends on the reputation opinion form surveying institutions’ administrators and high schools’ counselors. Other variables just modify the result. Second, the ranking favors private universities. Being a private university would add 3 points to the overall score. The best public university, UC Berkeley, is ranked as 22nd. I didn’t find any reason why it is inferior to some private universities ahead. At the data level, the public universities and private ones are distinguishable. And apparently they target different customer groups. To be fair, the US News may divide the university ranking into two subsystems: public universities and private universities, which could be more helpful in understanding the universities' standing in their sectors.

References:
1. http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
2. http://collegethrive.com/college-rankings-us-news-world-report-method


****************(1)CLUSTERING STEP******************;
ods listing close;
ods output variables = _varlist;
proc contents data = uscr11;
run;

proc sort data = _varlist;
   by num;
run;

proc sql;
   select variable into: num_vars separated by ' '
   from _varlist
   where lowcase(type) = "num" and num not in (4, 5)
;quit;

proc varclus data = uscr11 summary outtree=tree;   
   var &num_vars;
run;

ods html style = harvest;
ods graphics on;
goptions htext = 4pct ftext = "Albany AMT";
axis1 order = (0.5 to 1 by 0.1);
axis2 label = none;

proc tree horizontal haxis=axis1 vaxis=axis2;
   height _propor_;
   id _label_;
run;

proc sgscatter data = uscr11;
   matrix &num_vars /ellipse=(alpha=0.25) markerattrs=(size=1);
run;

****************(2)IMPUTATION STEP******************;
proc mi data = uscr11 nimpute = 1 round = .01 
        seed = 20110425 out = _tmp0;
   monotone regpmm(donaterate = score ugrepidx gradrate retention);
   var score ugrepidx gradrate retention donaterate;
run;

proc mi data = _tmp0 nimpute = 1 round = .01 
        seed = 20110425 out = imputed;
   monotone reg(top10fresh = score ugrepidx sat25p sat75p acceptrate);
   var score ugrepidx sat25p sat75p acceptrate top10fresh;
run;

****************(3)FACTOR ANALYSIS STEP******************;
proc factor data = imputed nfactors = 3 rotate=promax
            reorder out = factorized plots=(scree);
   var &num_vars;
run;

data _tmp1;
   set factorized;
   if type = "private" then do; shape = "club"; color = "blue"; end;
   else do; shape = "diamond"; color = "red"; end;
   keep shape color factor:;
run;

proc g3d data = _tmp1;
   scatter factor2*factor3 = factor1 / color = color shape = shape;
run;

****************(4)LOGISTIC REGRESSION STEP******************;
proc logistic data = imputed plots = (roc);
   model type = &num_vars / 
         selection = stepwise slentry = 0.3 slstay = 0.3;
run;

proc pls data = imputed plot = (corrloadplot variableimportanceplot);
   model score = &num_vars;
run;

proc sql;
   select variable into: vars separated by ' '
   from _varlist
   where num in (3, 6, 7, 8, 12, 13, 14, 15, 16)
;quit; 

****************(5)VARIABLE SELECTION STEP******************;
proc glmselect data = imputed plot = (coefficientpanel aseplot);
   partition fraction(validate = 0.5);
   class type;
   model score = &vars / 
         selection = stepwise(choose = validate select = sl);
run;
ods graphics off;
ods html close;
ods listing;

****************END OF ALL CODING***************************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...