HOW SHOULD WE RATE RESEARCH UNIVERSITIES?
by Nancy Diamond and Hugh Davis Graham
For Underlying Data, click here.
[Note: This article and research data
have been reproduced with permission
from late Prof. Graham's web page at Vanderbilt University.
For additional information on the original site, please contact
David L. Carlton.]
Planning for the National Research Council's (NRC) next study of research-doctorate
programs in the United States, with publication expected in 2004, has highlighted
disagreements over how quality should be measured. One side in the debate supports
continued reliance on reputational surveys as the primary measure of quality.
On the other side, advocates call for more objective measures of research performance,
as demonstrated in publications, awards, prizes, and other indicators of scientific
and scholarly achievement. Elite institutions favored by the traditional reputational
method generally resist the use of more quantitative per capita measures that
may favor newer, aspiring programs and universities. Like Republicans and Democrats
arguing over which methods to use in conducting the Census, the champions of
subjective and more objective methods know that the choice or mix of methods
will significantly determine who benefits --and who loses -- from the findings.
The commercial success of college and university rankings published annually
by U.S. News & World Report and the 1995 publication of the NRC report,
Research-Doctorate Programs in the United States (hereafter, Report),
has intensified this debate. (1) The Report
contained a wealth of program data, including quantitative indicators of research
output. At the same time, however, the NRC ranked faculty and programs exclusively
by their reputational rating. This produced top-quartile lists and top-twenty
bragging rights that necessarily disappointed many of the 274 institutions whose
programs were included in the study. In the competitive academic marketplace,
the stakes of this ratings game are high. Top-ranked research-doctorate programs,
or those seen to be within striking distance of the top tier, may win increased
funding, recruit nationally recognized faculty and talented students, and place
their graduates in the academic job market. Conversely, low ranking can produce
program decline and even termination. The prospect of another national NRC study,
the first of the 21st century, has heightened interest in the planning process.
Ambitious universities not previously accorded top-tier status are especially
open to alternative methods that offer institutional challengers an opportunity,
one less influenced by inherited hierarchies of status and prestige, to demonstrate
their research achievement. In this article, reference to "rising" or "challenging"
institutions denotes universities that were not ranked among the top 25 according
to any of the four major national surveys since 1960.
In The Rise of American Research Universities (Johns Hopkins, 1997),
we emphasized the importance and value of quantitative per capita measures of
scholarly research over reputational surveys. (2)
Because that book charts the research development of more than 200 universities
since World War II, we aggregated data at the institutional level at several
points over time, rather than at the program level, where national studies sponsored
by the American Council on Education (ACE) and the NRC have concentrated their
analysis. (3) In this article, we apply the per
capita method to program-level data and compare the results with the NRC's reputational
ratings of the research quality of program faculty. Our purpose is to test,
at the program or department level, our book's dual finding that first, quantitative
per capita assessments confirmed the research excellence of most of the elite
universities customarily found among the top 20 when judged according to reputation.
Second, per capita measures also demonstrated the superior performance of "rising"
institutions, whose achievements often have been masked by the national surveys
that ranked campuses according to reputation. On the basis of these comparisons,
we offer specific recommendations for how -- and how not -- to rate research
universities in the next NRC study.
The Strengths and Weaknesses of Reputational Ratings
Reputation surveys have dominated 20th-century assessments of American faculty
and graduate education. Developed during the 1920s and 1930s through the pioneering
work of Raymond Hughes, and advanced by Hayward Keniston in the late 1950s,
reputational surveys had won credibility for three reasons.
(4) First, these evaluations rested on the peer-review principle that
scientific, scholarly, and artistic quality is best assessed by recognized experts
in the field. Peer review thus represented a qualitative, holistic judgment
that also could reflect quantitative measures of research performance. Since
World War II especially, peer review has enjoyed wide respect among academics,
as well as government, business, and foundation officials, as the most appropriate
method for awarding appointments, promotions, tenure, research grants and contracts,
and prizes.
Second, the crucial assumptions underpinning peer review -- that the rater
is an expert who knows the body of work or persons being assessed -- were reasonably
met during the early and middle decades of the 20th century when reputational
ratings became the primary evaluation method of the major national studies.
Doctoral education prior to World War II was dominated by the prestigious members
of the Association of American Universities (AAU), a group of 14 founding campuses
whose ranks increased to only 30 institutions in 1940. Even in 1960, the Council
of Graduate Schools (CGS), representing institutions that granted 95 per cent
of all Ph.D.s, had only 100 member universities. In this still relatively small
world of graduate study, the teaching function of doctoral education largely
coincided with its research function. Doctoral programs were housed in traditional
academic departments, where the faculty generally knew the work of their disciplinary
colleagues on other American campuses.
Third, in the absence of alternative, more objective methods of measurement,
this legacy of rater familiarity with the research of faculty in their disciplines
lent credibility to subjective ratings. Not until the late 1960s and early 1970s
did the reporting of federal research funding and developments in electronic
data processing, most notably in citation indexing, offer opportunities to measure
individual and institutional research output directly, rather than indirectly
through the filter of reputation. (5) At the
same time, however, the development of quantitative measures, together with
American higher education's dramatic expansion in the 1960s, and the larger
revolution in communications and research networks, rapidly undermined the institutional
arrangements that had earned early respect for reputational ratings.
The resulting criticism of reputational assessments generally rests on two
grounds. One is based on research in the psychology of human perception, while
the other, accelerating in its impact, is based on the rapidly changing research
environment of the post-Sputnik era. The first body of criticism, duly noted
in the NRC Report, emerged from the development of survey research in
the 1950s and 1960s. It demonstrates that reputation surveys are biased by a
halo effect that lifts the reputations of departments and programs with academic
stars, and of those located on prestigious campuses. (6)
Additionally, reputation ratings are biased in favor of large programs. Raters
who recognize three published scholars in a department of forty faculty tend
to rate it higher than a department of twenty where only two are recognized.
(7)
A second line of criticism, less recognized though more damaging to the validity
of reputational ratings, is based on changes which have undermined the very
premise that legitimated reputational surveys in the first place. Driven by
the defense research imperatives of the Cold War, the unprecedented growth of
the American economy, the demographics of the baby boom, and technical advances
in communications, the revolution in knowledge creation has radically rearranged
our research environment. We have witnessed this great transformation in our
lifetimes, and our careers have been enriched by it. Yet we are so intimately
caught up in its processes that we need to step back and consider the impact
of these changes on the assessment of research achievement.
What are the chief attributes of this transformation? Perhaps most important,
research became increasingly specialized, widening the spectrum of inquiry and
deepening its penetration. Knowledge creation also grew increasingly interdisciplinary,
with a resulting fragmentation of our disciplinary communities. By the 1980s
and 1990s, as American universities conferred between 30-to-40,000 new Ph.D.'s
annually, the number of qualified researchers exploded, and quality research
spread to second- and third-tier institutions. Research institutes proliferated,
as did new scientific and scholarly associations and journals. The entire apparatus
of research communications and infrastructure was internationalized. Interdisciplinary
research was furthered as publication and research collaboration on the internet
and electronic mail communication became instantaneous.
The Peer Review Disconnection
These changes have produced important consequences for the evaluation of research-doctorate
programs. Most significant has been a profound split between the university's
discipline-based organization for graduate training on the one hand, and the
interdisciplinary research networks on the other. As the American research university
enters the 21st century, its department-based teaching is still grounded in
a horizontal structure that is resistant to change. Departments hire faculty
to cover the main subfields of the disciplinary terrain and attend to important
organizational routines -- such as promotion and tenure decisions and graduate
and undergraduate teaching obligations -- requirements that fix faculty firmly
within these traditional arrangements. Our large, discipline-based professional
associations continue to publish directories that list faculty rosters by department,
and the reputational surveys reflect such arrangements. In 1993, for example,
the NRC used a disciplinary focus, asking more than 16,000 respondents to rate
the scholarly quality of the faculty in some fifty departments in their fields.
(8)
At the same time, faculty research networks that have become increasingly
vertical no longer correspond to this horizontal department organization. In
this constantly changing research environment, specialized, interdisciplinary
networks typically connect researchers to only one or two members of their discipline
who share their research interests. These networks then branch outward, and
with increasing regularity, reach across the globe. Rather than reflecting department
directories, faculty research networks more closely reflect our own e-mail address
lists.
This growing disconnection between faculty research networks and the discipline-based
doctoral programs is the loss of expertise from the peer review equation in
reputational surveys. Faculty raters, who know a great deal about the quality
of scholarship in their research areas, are asked instead to assess the work
of entire faculties and graduate programs in scores of other departments. It
is probable that the distortions of the halo effect, always problematical, have
been magnified during recent decades as raters have faced departments filled
with specialists whose work was unfamiliar to them. Under these circumstances,
scholarship was far less important in determining prestige ratings than either
the past reputations of departments or affiliated universities.
The most troublesome consequence of continued reliance on reputational surveys
is the harm this subjective method inflicts, however inadvertently, on aspiring
departments, programs, and institutions. The prestige of established elites
appears to act as a filter, screening from view the research achievements of
the challengers, depriving them of recognition for accomplishments they have
earned. The result of this baneful process in fact may be two-directional, screening
our most prestigious universities from the bracing effects of vigorous competition
by challenging institutions.
Comparing Reputational Ratings and Quantitative Measures by Academic Discipline
The argument outlined above, that reputational ratings have grown obsolete
and harmful, is plausible, but unproven. Indeed, the history of reputational
surveys as the mainstay of national university comparisons since the 1920s shows
remarkably little research validating its utility as an accurate measure of
research quality. The major national studies instead presumed the primacy of
reputational surveys as a measure of research quality. This presumption was
defensible through the 1960s and early 1970s when alternative measures of assessment
were underdeveloped, and there was a loose academic consensus -- one that still
exists -- that rankings based on reputation ratings were more or less correct,
especially at the top of the research hierarchy. However, to perpetuate this
untested assumption in the face of the extraordinary changes that were undermining
its premise represented a disappointing standard of scientific rigor.
In the absence of a systematic validation of the most promising subjective
and quantitative measures of university research quality against a benchmark
standard of excellence, what evidence is available to test the proposition that
reputational ratings fail to recognize the research achievements of rising programs
and institutions? First, studies that documented research achievement in individual
disciplines, especially sociology and political science, have provided a more
finely grained analysis of research performance. Such studies produced rankings
based on the number of publications, citations, grants, patents, and other research
indicators, and compared ratings based on these measures with reputational ratings.
(9) In several of these single-discipline studies, especially those
that relied on per capita measures, researchers have found a discrepancy between
reputational ratings and the levels of research achievement shown by rising
departments and programs. (10)
Second, in The Rise of American Research Universities, we demonstrated
the same phenomenon at the institutional level. In the public sector we identified
21 rising universities, including the University of California (UC), Santa Barbara
and the State University of New York (SUNY) at Stony Brook. In the private sector
there were 11 such campuses, including Brandeis and Rochester, institutions
whose achievements were under recognized by the major reputational surveys.
The institutional-level focus we employed, designed for a different purpose,
does not yield the level of precision available through program-level analysis.
In this article, to extend our analysis, we compare NRC's reputational ratings
with per capita measures of citation and award density. The tables below
reflect these comparisons in both individual disciplines and broad fields. The
left-hand columns document the NRC reputational rankings of scholarly quality
of program or department faculty, while the right-hand columns reflect rankings
based on per capita citation density (or awards density for humanities fields).
Citation density and award measures were provided in the NRC Report,
but were not used by the NRC for ranking purposes. (11)
Finally, we compare David Webster's and Tad Skinner's institutional aggregation
of the NRC reputational rankings (Change, 1996) with our own grand ranking
that is based on quantitative per capita data.
The Strengths and Weaknesses of Citation Measures
Before discussing the tables, it is important to note the strengths and weaknesses
of the per capita citation and award measures. These indicators refer to the
number of citations or awards for a given department or program divided by the
number of program faculty. Such indicators thereby avoid the problem, common
in press-release competition among universities, of conflating quantity with
quality by comparing total output data (for annual publications, citations,
awards, research dollars, etc.) irrespective of program or institutional size.
Per capita indicators offer instead a unit of research productivity that can
be compared across programs at institutions of different sizes and types.
(12).
The value of citation analysis as an indicator of research quality has been
widely acknowledged. (13) Published scholarship
varies widely in quality -- roughly half of all scholarly and scientific publications,
bibliometricians report, are never cited at all. Ranking university doctoral
programs by the frequency with which the published scholarship of their faculty
is cited by others thus provides a valuable benchmark of research quality, arguably
the best single measure available.
On the other hand, despite its superior value as an indicator of research
importance, there are inherent limitations. Citation analysis is but a single
indicator, and no single indicator, however excellent, is sufficient for measuring
the complexity and quality of institutional knowledge creation. There are other
drawbacks. The NRC's funding and deadline pressures in the early 1990s, combined
with the limited capacities of the Institute for Scientific Information (ISI),
publisher of the Citation Index series, produced a data base of objective
indicators with a level of reliability substantially below that which can be
achieved today. (14)
In the last NRC study, errors were introduced through misreporting by campus-based
Institutional Coordinators, who were assigned the task of providing the number
of campus faculty (the denominator in per capita citation density measures).
Still other errors involved output data (publications, citations) caused by
mistakes in recording names and institutions, in matching zip codes, and in
data entry. However, such flaws tend to be randomly distributed, and produce
little significant distortion when aggregated at the level of academic field
or institution. Moreover, the number of arts and humanities awards, collected
by the NRC staff, avoided most electronic data processing errors. Although caution
is required when comparing programs or departments on the basis of the NRC Report's
citation density scores, careful comparisons demonstrate persistent discrepancies
between subjective and objective measures of research achievement. In all of
the comparisons that follow, the patterns of research performance that emerge
are consistent with our research findings that reputational rankings tend to
mask the demonstrable research achievements of challenging institutions.
Comparing Reputational and Quantitative Measures of Research Achievement
by Discipline
Tables 1 through 5 compare reputational and citation density (or award density
for the humanities) for individual disciplines. Table
1 shows rankings for the top 25 programs in astrophysics and astronomy,
as representative of fields in math and the physical sciences. We selected astronomy
as an illustrative discipline for several reasons. Because only 33 doctoral
programs in astronomy and astrophysics were rated by the NRC (as compared, for
example, with 179 programs in cell and developmental biology), we thought that
astronomers are more likely to know one another's work. By implication, members
of small research communities should be less vulnerable to the halo-effect distortions
of institutional prestige. Thus, the appearance of significant differences between
reputation and citation rankings in astronomy reinforces the argument that institutional
prestige often distorts collegial perception of research performance.
Table 1 reflects three patterns. The first supports a finding
demonstrated in The Rise of American Research Universities: the nation's
elite universities that have won the top reputation rankings, have earned their
enviable status through superior research achievement. Familiar institutional
elites -- Caltech, Princeton, UC Berkeley, Harvard, MIT -- dominate the top
ten ranks in both reputation and per capita citation density. However, according
to citation density scores (displayed in the right-hand column), certain challenging
institutions, either absent or not highly ranked by reputation (displayed in
the left-hand column) rise toward the top of the list. These campuses include
Massachusetts-Amherst, UC Santa Cruz, SUNY-Stony Brook, and Colorado. This dual
pattern is repeated throughout the tables that follow. On the one hand, established
elite institutions, such as the Ivy League campuses and great state flagships,
are often top-ranked on both reputational and objective measures. At the same
time, challenging universities, often younger and smaller institutions such
as SUNY-Stony Brook, Brandeis, or the newer UC campuses, break into the upper
ranks when measured by their research achievements rather than by a perceived
level of prestige. A third pattern found in Table 1 seems distinctive
to the fields of astronomy and astrophysics. Certain universities (Arizona,
Hawaii-Manoa) appear to benefit from the prominence of their astronomical observatories,
scoring higher on reputation but lower when ranked by citation density.
Tables 2 through 5, comparing disciplines representing the biological sciences,
engineering, social and behavioral sciences, and arts and humanities, show similar
patterns of high rank in both reputation and per capita measures by traditionally
prestigious institutions, high rank by rising institutions on quantitative measures,
and certain patterns distinctive to specific disciplines. In
cell and developmental biology (Table 2), for example, traditional
elites -- MIT, Caltech, and Harvard -- rank high according to both measures.
(15) At the same time, several challenging institutions -- Case Western
Reserve, Vanderbilt, Brandeis, and Cincinnati (all of which save Brandeis have
a campus medical school) -- break into the top 25 when measured by citation
density. Finally, cell biology programs based in medical schools are strongly
represented in both rankings. The programs at the Stanford and Colorado medical
schools, for example, are ranked higher on both reputational and quantitative
measures than their counterparts in the arts and sciences.
Similarly, in the field of electrical engineering (Table
3), all three patterns hold. Proven elites -- Caltech, Princeton, Stanford,
MIT -- rank high according to both measures. They are joined in the top 10 quantitative
rankings (right-hand column) by rising challengers UC Santa Barbara and SUNY-Buffalo.
Third, when ranked according to citation density, the rising research universities
include non-flagship land-grant universities -- for example, North Carolina
State (which also appears among the reputational top 25) and Colorado State
-- a group not strongly represented among the top ranks in other fields.
In social and behavioral sciences disciplines such as history, where publication
typically takes the form of books rather than journal articles, citation density
is a less reliable indicator. However, economics (Table
4) provides a more typical example. The reputational ranking for economics
holds few surprises. It is worth noting, however, that at Caltech, where faculty
divisions are not organized according to the NRC's disciplinary taxonomy, a
"virtual" economics program assembled by the Institutional Coordinator, ranked
19th (of 107 programs) in the NRC's reputation survey. In the eyes of faculty
raters, Caltech's exceptional "coattail effect" boosted the reputation of even
a program that did not formally exist.
In economics, the top 10 citation density ranking was led by a number the
same elite institutions -- Chicago, Harvard, MIT, Stanford -- found in the reputational
top 10, with the striking exception of Maryland-College Park, which jumped to
first place in citation density from 20th rank in reputation. Maryland's high
per capita ranking demonstrates in part the power of academic stars, in this
case, College Park economist Mancur Olson, whose widely cited 1971 book, The
Logic of Collective Action, created a new analytical paradigm.
(16) Aside from Maryland's striking rise (and Caltech's highly regarded
"virtual" program), the economics comparison demonstrates a similar pattern
of challenging institutions -- Boston University, Rochester, Vanderbilt -- rising
in the per capita category.
The final discipline-based comparison ranks programs in philosophy
(Table 5) as representative of the arts and humanities. Because
no accurate method for measuring book publication was available in the early
1990s, the NRC staff independently compiled a data file of honors and awards
received by humanities program faculty. Unfortunately, the Report
provides awards data for only a small number of programs, especially when compared
with the high numbers of article and citation data that were documented. As
a consequence, a small difference in awards per program faculty produced a large
difference in ordinal ranking, and also a large number of ranking ties.
In our own research for The Rise of American Research Universities,
to account for fact that book publication was not represented, we constructed
a similar index for measuring the research productivity of arts and humanities
faculty. In a pilot study that documented the relationship between per capita
awards and book publication in three humanities disciplines, we found a positive
correlation of .73. This correlation demonstrates that the documentation of
awards can provide a practical substitute for book publication. The awards density
measure thus tends to be a high quality, low quantity indicator, the opposite
of such measures as total publications or total research grant dollars.
The philosophy comparisons show similar patterns to those in other disciplines,
with programs at prestigious universities -- Princeton, Harvard, UC Berkeley,
Stanford, Michigan, Cornell, MIT, Chicago, and Brown -- ranked among the top
15 in both reputation and awards density. Yet, consistent with the other disciplinary
comparisons (Tables 1-4), challenging institutions -- Illinois-Chicago (ranked
8th), Massachusetts-Amherst, Emory, Notre Dame, and Syracuse -- break into the
top 25 quantitative ranking. A distinctive element is represented by Pittsburgh,
which placed two top-ranked programs (Philosophy, and History and Philosophy
of Science) in both the reputational and award density categories.
Comparing Reputational and Quantitative Measures for Academic Fields
Tables 6 through 10 compare top-20 reputational (provided by Webster and Skinner)
and per capita institutional rankings for the five broad fields of study represented
by single disciplines (Tables 1-5). When performance measures for doctoral programs
are aggregated at the level of field rather than discipline, the top 10 ranks
in citation density are typically dominated by established elites, with challengers
breaking into the second ten ranks. Thus, in the physical sciences
and mathematics (Table 6), the challenging universities in
the citation density category include Arizona, UC Santa Barbara, Colorado, New
York University, and Pittsburgh. In the biological sciences
(Table 7), challengers -- UC Irvine, Iowa, and Colorado --
break into the quantitative top 20. In engineering (Table
8), UC Santa Barbara, tied for 16th in the reputational category, soars
into third place in the citation density ranking. Syracuse, SUNY-Buffalo, and
Rochester are ranked in the second ten. On the other hand, Purdue, Carnegie
Mellon, Georgia Tech, and Penn State, ranked in the reputational top 20, do
not appear in the per capita citation top 20.
In the social and behavioral sciences (Table
9), established leaders dominate the first 10 places in the citation density
column, and challengers, led by 9th-ranked SUNY-Stony Brook, dominate the second
10 ranks. It is striking how many universities highly ranked by reputation --
UC Berkeley, Princeton, Minnesota, Cornell, North Carolina-Chapel Hill, and
Illinois-Urbana -- are not included in the top 20 citation density ranking.
In the arts and humanities (Table 10), where
high rankings are dominated by private institutions, prestigious universities
continue to lead the top ranks on both reputational and quantitative measures.
Challengers -- UC Davis, Rice, and UC Irvine -- follow in the second 10 of the
awards density category.
Institutional Grand Ranking
The final comparison (Table 11) shows
the top 50 institutions ranked according to the mean score of reputation in
the left column and by citation and awards density in the right column.
(17) Not surprisingly, at this level of competition, it is more difficult
for challengers to break into the top ranks in either category. The established
leaders who dominate the reputational rankings tend to be strong across the
academic spectrum. This is especially true for the well-endowed private universities,
which claim eight of the top 10 positions in both rankings. (The other two universities
are the UC campuses at Berkeley and San Diego.) Challenging institutions, in
contrast, typically have concentrated their resources on their strongest programs,
building what Stanford provost Frederick E. Terman called "steeples of excellence."
For such rising institutions (which included Stanford in the 1950s), this strategy
seems well designed for breaking into the top ranks -- Terman emphasized that
"the steeples be high for all to see." (18)
Thus, the 18 highest ranked universities in both the reputational and quantitative
rankings are all institutions rated in the top 20 in the previous major reputational
surveys. Judged according to the more objective citation or award density measure,
the challengers appear beginning with UC Santa Barbara, ranked 19th among all
universities, and 6th among public universities. Successful rising challengers
-- Colorado, Washington University, Rochester, UC Irvine, and SUNY-Stony Brook
-- rank 21 through 25 respectively. A second block of rising universities is
led by Rice and Brandeis, ranked 31st and 32nd respectively, in the quantitative
per capita density categories.
How Should the Next NRC Study Rate Research-Doctorate Programs?
As the NRC gears up for the next study, measuring the research quality of
program faculty is only one item on the Council's planning agenda. Economist
Charlotte Kuh, the project director, has been meeting with various administrative
and faculty groups to hear their opinions, and to let them know that the next
study must prove more useful than its predecessors to nonacademic constituencies,
chiefly, government policymakers, private foundations, and perhaps most challenging
to reach, business leaders. The new study thus will include more interpretation
of data and trends that address the interests and needs of both academic and
nonacademic constituencies.
Raising funds to support the next study will be difficult, partly because
its predecessors are seen as chiefly of interest to academics concerned about
the pecking order of institutional and program prestige. In addition, vexing
problems of program taxonomy, alluded to in our earlier discussion of the growing
mismatch between traditional departmental structures and the increasing interdisciplinary
fluidity of the research enterprise, confront the study's planners. The graduate
student constituency also needs more useful measures of program effectiveness.
There is little evidence, for example, that beyond the problematical reputational
rankings, prospective graduate students have found the findings from previous
national surveys useful.
Nonetheless, the heart of the next NRC study should remain a comparative assessment
of the quality of research in the nation's research-doctorate programs. Leading
the world in knowledge production, American universities are crucial to economic
growth and competitive success in the global market of the 21st century. We
hear this rhetoric all around us and read it in boilerplate promotional literature
from campus and corporation alike; yet, it is profoundly true. It is therefore
important that the first national study of the 21st century succeed where the
previous national studies fell short -- by producing a report that documents
not only the sustained and merited reputation of traditional elites, but also
the new research leadership of rising institutional challengers. Leaders in
government, business, and industry, comfortable in their long association with
elite programs and campuses, could learn from a new study that the pool of institutional
talent is much deeper than it has appeared.
What research design should guide the next NRC assessment of research-doctorate
programs? At a June 1999 NRC project planning conference in Washington, D.C.,
a group of faculty and administrators drawn from the nation's campuses reached
a consensus that the new study's design should be guided by the results of a
pilot project. This pilot study would examine intensively a sample of representative
programs and institutions, measuring multiple indicators of performance (publications,
citations, patents, research funding, awards and fellowships, etc.) to establish
benchmarks of research quality and program effectiveness. Then, against these
standards a variety of indicators designed to measure the full program universe
would be tested to compare their ability to predict the standard. The methods
tested would include measures used in previous studies (reputational survey
questions, article publication, citations, humanities awards, etc.) and potential
new indicators (publications and citations in leading journals, patents, book
publication, measures of graduate training effectiveness such as job placement).
Should the next NRC study include a reputation survey of faculty? In our Chronicle
of Higher Education (June 1999) opinion essay, we answered no -- reputation
assessments should be given an honorable burial in the century that gave it
birth, that benefitted from its maturity, and that witnessed its subsequent
decay under the relentless pressures of the knowledge revolution.
(19) But a decision should depend substantially on the results of
the NRC's pilot study. It is striking that throughout the last century the reputation
survey device, though employed as the lead rating measure in all the major national
studies, has never been systematically validated as a measure of research quality.
The NRC decision on whether or not to use the reputation survey moreover,
may be determined in part by other factors. These surveys are defended as a
social science method providing uniquely holistic, peer review assessments that
reflect the strengths of large programs, and that also provide continuity over
time for longitudinal studies. At the same time, there are political factors
to be considered, legitimate concerns when dealing with a quasi-official body
performing such a high-stakes service. Universities that traditionally have
dominated reputation surveys constitute a powerful lobby for their continued
use. The political need of organizations to avoid the controversy inherent in
ranking their own members or constituent groups is also a relevant concern.
Assessments of reputation enable organizations, such as the NRC and the ACE,
to claim that the quality ratings were determined by expert members of the constituency,
not by the sponsoring research organization.
Whether or not a reputation survey is included in the next study, the NRC
must above all avoid repeating the major mistake of the 1995 project -- the
listing of all programs as ranked by reputational survey score. This identified
the Council, the research arm of the National Academy of Sciences, as an official
arbiter of rank in the great academic ratings game. Even were the reputational
evaluation not so vulnerable to challenge, the decision to rank programs exclusively
by subjective data, rather than to list both subjective and objective program
data alphabetically by program (a presentation used in the NRC 1982 study),
stamped a particular pecking order with the NRC's powerful imprimatur.
However cloudy the future of subjective rankings may remain, the promise of
effective objective measures of research quantity and quality appears bright.
The ISI reports significant advances since the early 1990s in the comprehensiveness
and reliability of its data files, and in its ability to match authors, publications,
citations, and programs on a large scale. (20)
The NRC pilot project will provide an opportunity to develop and test new indicators
of scholarly research performance, including measures of publication and citation
in leading journals. (21) An indicator documenting
book publication would be appropriate for assessing humanities faculty, who
have limited engagement in journal publications. (22)
The awards indicator for humanities programs, high in promise as a qualitative
measure, but weakened by low award totals in the 1995 study, can be greatly
strengthened by including competitive awards and prizes conferred by academic
associations. Moreover, in assessing graduate program effectiveness, the pilot
program may develop and test such revealing measures as time to degree, job
placement, and postdoctoral fellowship awards. (23)
A study of research-doctorate programs appropriate for the 21st century may
be published on the web in a format convenient to consumers. Users might download
the data and calculate their own rankings, possibly by using software accompanying
the report that allows users to construct composite scoring schemes, similar
to those used by the U.S. News
rankings, that assign varying weights to selected measures of program performance.
The planning for the next NRC national assessment is facing intense scrutiny
because the stakes are unusually high. As evidence accumulates that the tradition
of focusing primarily on prestige ratings has masked a successful surge by challenging
programs and institutions, the risk remains that the next study could repeat
the old pattern. Much of the blame rests with the academic audience itself,
which has rushed to embrace or criticize the prestige ratings, even as the sponsoring
organizations, the NRC and the ACE, have tried to emphasize the variety of program
measures and to resist aggregating program data into grand institutional rankings.
Clark Kerr, taking a longer view, observed (Change, 1991) that the
timing of reputational change in American higher education history has coincided
with periods of great transformation. The first occurred, Kerr claimed, after
the Civil War when the great private and state research universities were built,
and the second occurred with the expansion of research activity inspired by
federal funding in the wake of Sputnik. In Kerr's view, the period from 1990
to 2010, during which "[a]t least three fourths of the faculties will turn over,
and there will be some net additions . . . as enrollments rise," may be another
period of significant change in the leadership configuration of America's research
universities. (24) If Kerr is correct, the NRC
has a unique opportunity to describe the new alignment and set the standard
of evaluation.
Revised June, 2000
NOTES
1. National Research Council, Research-Doctorate Programs in the United States: Continuity and Change (Washington, D.C.: National Academy Press, 1995).
2. Hugh Davis Graham and Nancy Diamond, The Rise of American Research Universities: Elites and Challengers in the Postwar Era (Baltimore: Johns Hopkins University Press, 1997).
3. The four major national post-World War II reputational studies are: Alan M. Cartter, An Assessment of Quality in Graduate Education (Washington, D.C.: American Council on Education, 1966; Kenneth D. Roose and Charles Andersen, A Rating of Graduate Programs (Washington, D.C.: American Council on Education, 1970; Lyle V. Jones et al., An Assessment of Research-Doctorate Programs in the United States, 5 vols.(Washington, D.C.: National Academy Press, 1982); and the National Research Council 1995 study cited above.
4. Raymond M. Hughes, A Study of the Graduate Schools of America (Oxford, Ohio: Miami University Press, 1925); Hughes, "Report of the Committee on Graduate Instruction," Educational Record 15: 192-234. Hayward Keniston, Graduate Study and Research in the Arts and Sciences at the University of Pennsylvania (Philadelphia, Pa.: University of Pennsylvania Press, 1959).
5. In the NRC-sponsored studies of 1982 and 1995, which expanded the use of quantitative measures, reputational ratings showed a strong positive correlation with the more objective research indicators. Such high correlations are an expected result when comparing large numbers of research doctorate programs.
6. James Fairweather, "Reputational Quality of Academic Programs: The Institutional Halo Effect," Review of Higher Education 28,4 (1988): 345-56; Robert K. Toutkoushian, Halil Dundar, and William E. Becker, "The National Research Council Graduate Program Ratings: What Are They Measuring?" Review of Higher Education 21,4 (1998): 315-42.
7. For a review of reputational surveys, see David S. Webster, "Reputational Rankings of Colleges, Universities, and Individual Disciplines and Fields of Study from their Beginnings to the Present," Higher Education: A Handbook of Theory and Research, vol.8, ed.by John C. Smart, 234-304 (New York: Agathon Press, 1992). See also David L. Tan, "The Assessment of Quality in Higher Education: A Critical Review of the Literature and Research," Research in Higher Education 24,3 (1986): 223-65, and Clifton F. Conrad and Robert T. Blackburn, "Program Quality in Higher Education: A Review and Critique of the Literature and Research," John C. Smart, ed., Higher Education: Handbook of Theory and Research, vol.1 (New York: Agathon Press, 1986).
8. The NRC in 1993 sent questionnaires to 16,700 of the 65,470 faculty in the 274 institutions in the study; roughly half (7,900) returned usable questionnaires. The survey's most important rating indicator was "93Q," where respondents rated the scholarly quality of the program faculty on a scale of 0 to 5, with 0 denoting "Not sufficient for doctoral education" and 5 denoting "Distinguished."
9. These studies began to appear in the late 1960s. See for example, Lionel S. Lewis, "On Subjective and Objective Rankings of Sociology Departments,"American Sociologist 3 (1968): 129-31; W. Miles Cox and Viola Catt, "Productivity Ratings of Graduate Programs in Psychology Based on Publication in the Journals of the American Psychological Association," American Psychologist (October 1973): 793-809. For a more recent view, see James C. Garand and Kristy L. Graddy, "Ranking Political Science Departments: Do Publications Matter?" PS 32,1 (March 1999): 113-16.
10. See for example Richard C. Anderson, Francis Narin, and Paul McAlister, "Publication Rating versus Peer Rating of Universities," Journal of the American Society for Information Science (March 1978): 91-103.
11. The NRC Report, Appendix P, contains reputational and citation data for all programs, and per capita density measures for citations. The number of awards won by arts and humanities program faculty was also provided in Appendix J. We calculated per capita award density measures for these fields (Tables 5 and 10) by dividing the total number of awards for each university by the number of department or program faculty. The citation and award density scores were then converted into Z-scores to standardize the quantitative rankings. In Tables 6 through 11, the reputational scores are drawn from David S. Webster and Tad Skinner, "Rating PhD Programs: What the NRC Report Says ...and Doesn't Say," Change (May/June 1996): 24-44.
12. For a discussion of per capita measures, see Graham and Diamond, Rise of American Research Universities, 55-63.
13. See for example Jonathan R. Cole and Stephen Cole, Social Stratification in Science (Chicago: University of Chicago Press, 1973).
14. Research-Doctorate Programs in the United States (1995), Appendix G, 143-46; Brendan A. Maher, The NRC's Report on Research-Doctorate Programs: Its Uses and Misuses," Change (November/December 1996): 54-59.
15. Rockefeller University and UC San Francisco, ranked second and third, respectively, by reputation, were not included in the citation density ranking for this study. Both institutions had fewer than 11 programs rated in the 1995 NRC study, the minimum number selected for inclusion in our comparison.
16. See Mancur Olson, The Logic of Collective Action (Cambridge: Harvard University Press, 1971). The NRC Report recorded a citation density of 32.3 for the economics program faculty at Maryland, ranked 20th by reputation for scholarly quality, compared with an average citation density of 15.9 for the top 10 economics programs ranked by reputation. The Report provided a Gini coefficient of 74.7 for the Maryland department, indicating an unusually high concentration of citations on a small number of the program faculty. (The mean Gini coefficient for to 10 economics programs ranked by reputation is 10.0.) Olson accounted for roughly one fifth of the citations attributed to Maryland's 47 economics faculty members. For a discussion of the Gini coefficient, see Ronald G. Ehrenberg and Peter J. Hurst, "The 1995 NRC Rankings of Doctoral Programs: A Hedonic Model," Change (May/June 1996): 46-50.
17. The reputational rankings are from Webster and Skinner, who ranked 104 institutions with 15 or more programs included in the 1995 NRC study. In calculating the z-scores for citation and award density, we included 110 institutions. To Webster and Skinner's 104 campuses, we added six institutions with fewer than 15 NRC-rated programs: Alabama-Birmingham (13 programs), Brandeis (14), Dartmouth (11), Delaware (13), Georgetown (14), and Tufts (11).
18. Stanford provost Frederick E. Terman, quoted in Roger L. Geiger, Research and Relevant Knowledge: American Research Universities Since World War II (New York: Oxford University Press, 1993), 125.
19. Hugh Davis Graham and Nancy Diamond, "Academic Departments and the Ratings Game," Chronicle of Higher Education, 18 June 1999.
20. Henry Small, "Relational Bibliometrics," Proc. Fifth Biennial Conference, International Society for Scientometrics and Infometrics, M.E.D. Koenig and A. Bookstein, eds. (Medford, N.J.: Learned Information, 1995), 525-32.
21. In The Rise of American Research Universities, top-journal analysis of publications provided high quality indicators of research achievement in science, engineering, and the social and behavioral sciences. Analysis of citations in such leading journals offers even greater promise as an indicator of research quality. The leading journals in a program field may be identified using objective criteria, such as the ISI's list of "Journals Ranked by Times Cited." However, the problem of using top-journal analysis, from the perspective of a quasi-official sponsoring organization such as the NRC, may be less technical than political. Identifying top journals in an NRC study may provoke resentment by academic organizations, subscribers, and researchers associated with excluded journals, who object to NRC selection as a form of an endorsement that provides researchers, especially untenured scientists and scholars, with prestige guideposts on where and where not to seek publication.
22. The inability to link book authors to their academic programs and institutions on other than a hand-count basis has meant that books (other than anthologies, which are included in ISI data) have been excluded from all the major studies. The exclusion of book publication from studies of academic research achievement has been a glaring weakness of the large-scale studies, including, to our regret, The Rise of American Research Universities. There is reason to believe, however, that technical methods are now available to link book authors to their institutions. The pilot study should provide the NRC, and perhaps the ISI, with an opportunity to develop and test such a measure, especially in arts and humanities programs where book publication is the norm of scholarly output.
23. See Maresi Nerad and Joseph Cerny, "From Rumors to Facts: Career Outcomes of English Ph.D.s; Results from the Ph.D.'s-Ten Years Later Study," CGS Communicator 32,7 (Special Issue Fall 1999): 1-12.
24. Clark Kerr, "The New Race to be Harvard or Berkeley
or Stanford," Change (May/June 1991): 1.
Table 1.
Top 25 Research-Doctorate Programs in Astrophysics and Astronomy
Ranked by Mean Score of Reputation Rating and Citation Density
(Return)
(for underlying data,
click here)
|
Reputation
|
Citations/Faculty
|
||||
|
Rank
|
Campus
|
93Q Score | Rank | Campus | Z-Score |
| 1 | Caltech | 4.91 | 1 | Caltech | 3.486 |
| 2 | Princeton | 4.79 | 2 | UC-Berkeley | 1.707 |
| 3 | UC-Berkeley | 4.65 | 3 | UMass-Amherst | 1.622 |
| 4 | Harvard | 4.49 | 4 | UC-Santa Cruz | 1.134 |
| 5 | Chicago | 4.36 | 5 | Harvard | 0.724 |
| 6 | UC-Santa Cruz | 4.31 | 6 | Princeton | 0.716 |
| 7 | Arizona | 4.10 | 7 | MIT | 0.410 |
| 8 | MIT | 4.00 | 8 | SUNY-Stony Brook | 0.338 |
| 9 | Cornell | 3.98 | 9 | Colorado | 0.292 |
| 10 | Texas-Austin | 3.65 | 9 | Yale | 0.292 |
| 11 | Hawaii-Manoa | 3.60 | 11 | Minnesota | 0.221 |
| 12 | Colorado | 3.54 | 12 | Chicago | 0.039 |
| 13 | Illinois-Urbana | 3.53 | 12 | Cornell | 0.039 |
| 14 | Wisconsin-Madison | 3.46 | 14 | UCLA | 0.007 |
| 15 | Yale | 3.31 | 15 | Maryland-College Park | -0.152 |
| 16 | UCLA | 3.27 | 16 | Arizona | -0.214 |
| 17 | Virginia | 3.23 | 17 | Texas-Austin | -0.224 |
| 18 | Columbia | 3.20 | 18 | Stanford | -0.299 |
| 19 | Maryland-College Park | 3.07 | 19 | Columbia | -0.307 |
| 20 | UMass-Amherst | 3.04 | 20 | Wisconsin-Madison | -0.469 |
| 21 | Penn State | 3.00 | 21 | Illinois-Urbana | -0.501 |
| 22 | Stanford | 2.96 | 22 | Indiana | -0.595 |
| 23 | Ohio State | 2.91 | 22 | Ohio State | -0.595 |
| 24 | Minnesota | 2.59 | 24 | Hawaii-Manoa | -0.629 |
| 25 | Michigan | 2.65 | 25 | Michigan | -0.696 |
Source: National Research Council, Report, 1995, Appendix table L-1.
Note: The 93Q score refers to the NRC reputation rating of scholarly quality of program faculty on a scale of 0 to 5, with 0 denoting "not sufficient for doctoral education," and 5 denoting "distinguished."
|
Reputation
|
Citations/Faculty
|
|||||
| Rank | Campus | 93Q Score | Rank | Campus | Z-Score | |
| 1 | MIT | 4.86 | 1 | MIT | 4.649 | |
| 2 | Rockefeller U | 4.77 | 2 | Stanford Medical | 3.547 | |
| 3 | UC-San Francisco | 4.76 | 3 | UC-San Diego | 3.219 | |
| 4 | Caltech | 4.73 | 4 | Colorado Medical | 2.661 | |
| 5 | Harvard | 4.70 | 5 | Harvard | 2.176 | |
| 6 | Stanford Medical | 4.55 | 6 | Caltech | 2.029 | |
| 7 | UC-San Diego | 4.50 | 7 | Yale | 1.969 | |
| 8 | U of Washington | 4.49 | 8 | Duke | 1.420 | |
| 9 | Washington U | 4.48 | 9 | Princeton | 1.680 | |
| 10 | Yale | 4.37 | 10 | U of Washington | 1.515 | |
| 11 | Princeton | 4.36 | 11 | Washington U | 1.436 | |
| 11 | Stanford (A&S) | 4.36 | 12 | Case-Western Reserve | 1.088 | |
| 13 | UC-Berkeley | 4.15 | 13 | UCLA | 1.045 | |
| 14 | Duke | 4.11 | 14 | UNC-Chapel Hill | 1.032 | |
| 15 | Chicago | 4.10 | 15 | Columbia | 0.880 | |
| 16 | Wisconsin-Madison | 4.05 | 16 | Penn | 0.848 | |
| 17 | UCLA | 3.99 | 17 | Chicago | 0.751 | |
| 18 | Texas-SW Medical | 3.98 | 18 | Vanderbilt | 0.759 | |
| 19 | Columbia | 3.94 | 19 | Johns Hopkins | 0.532 | |
| 20 | Johns Hopkins | 3.91 | 20 | New York U | 0.517 | |
| 21 | New York U | 3.88 | 21 | UC-Berkeley | 0.456 | |
| 22 | Colorado Medical | 3.85 | 22 | Brandeis | 0.386 | |
| 23 | Pennsylvania | 3.81 | 23 | Minnesota Medical | 0.380 | |
| 24 | Baylor Medical | 3.80 | 24 | Cincinnati | 0.374 | |
| 25 | UNC-Chapel Hill | 3.79 | 25 | Illinois-Chicago | 0.355 | |
Source: National Research Council, Report,
1995, Appendix table P-7.
| Reputation | Citations/Faculty | |||||
| Rank | Campus | 93Q Score | ||||