Public Policy Initiative/Quantitative Metric and Article Quality

Background

In 2010, Wikimedia Foundation started the Public Policy Initiative. The purpose of the Initiative was to improve Wikipedia content in articles related to public policy. Additionally, the project sought to increase the use of Wikipedia as a teaching tool in university classrooms. The Initiative model operated by coordinating with university teaching faculty who had an interest in using Wikipedia as a platform for a literature review assignment. The Initiative staff then recruited and trained Campus Ambassadors to support the professor on campus; the Campus Ambassadors svoluneered as de facto teaching assistants for the Wikipedia related class elements. The Initiative team also coordinated Online Ambassadors, the criteria for these applicants was substantial experience in Wikipedia and a reputation for helping newcomers. The Online Ambassadors provided assistance to students 24/7. The students selected an Online Ambassador to act as their mentor; the mentor provided feedback on the student work and helped students learn to navigate Wikipedia. Wikimedia Foundation also provided support materials and sample syllabi to the professors.

In the 2010 fall term, the Public Policy Initiative staff collaborated with faculty from top universities. Professors from George Mason, George Washington, Georgetown, Harvard, Indiana-Bloomington, James Madison, Lehigh, Siena College, Syracuse, and UC Berkeley, participated in the Initiative pilot. In the fall pilot participation included: 14 courses/classes, 13 professors, XX students, XX campus Ambassadors, XX online Ambassadors. The students worked on XX articles through the project, and XX of those articles were related to public policy. Class size ranged from X to XX students.

The Initiative research was tasked with measuring content improvement in Wikipedia through the project. The article quality metrics existing within Wikipedia are qualitative, subjective, and complex in execution. These are not criticisms, rather statements of fact, and as this experiment shows these methods are amazingly consistent; they are just not easy for people new to Wikipedia to understand and use. In order to measure article quality improvement over the course of the project, the Initiative team needed to develop a quantitative metric that would be acceptable to the Wikipedia community and easy for newcomers and subject matter experts to use. This experiment tested the consistency of: the existing 1.0 ratings, the quantitative metric, article quality ratings among Wikipedians, article quality ratings among subject matter experts, and compares Wikipedian article quality scores to subject matter expert scores.

Research Methods

In order to determine whether or not article quality improved through the project and the impact of student work in future article quality measurements, we must show that the tool we use to measure article quality is consistent among assessors and sensitive to article improvement. This experiment tests the consitency of: the existing 1.0 ratings, the quantitative metric, article quality ratings among Wikipedias, article quality ratings among subject matter experts, and compares Wikipedian article quality scores to subject matter expert scores.

Quantitative Metric

At the start of the project, Initiative staff worked with the Wikipedia community to create a quantitative metric to assess the quality of articles improved through the project. There was also hope that if the metric proved to be consistent with Wikipedia ratings and an effective useful tool that it would have a sustainable purpose within Wikipedia.

The metric places a numeric value on six different aspects of article quality: comprehensiveness, sourcing, neutrality, readability, formatting, and illustrations. The difference in numeric values creates an inherent weighting system so that the most important aspects have the biggest impact on the final score. The quantitative scores translate into Wikipedia 1.0 ratings (A, B, C, Start class and Stub). The metric does not substitute for the Wikipedia Good Article review process, it simply generates a rating which indicates that the article would probably meet an “A” standard if the article were to go through the review process. Also, thresholds are built into the metric so that even if an article scores fairly well in some aspects, it must meet minimum criteria in certain aspects to attain higher ratings.

This rubric is based Wikipedia's policies and expectations for high-quality articles. It has detailed breakdowns of scores for different aspects of article quality, but it also can translate into the standard Stub/Start/C/B scale and thus feed into the 1.0 assessment system without too much duplicated effort. The language is for what is expected for high-quality articles is mostly adapted from the featured article criteria.

Assessment area Scoring methods Score
Comprehensiveness Score based on how fully the article covers significant aspects of the topic. 1-10
Sourcing Score based on adequacy of inline citations and quality of sources relative to what is available. 0-6
Neutrality Score based on adherence to the Neutral Point of View policy. Scores decline rapidly with any problems with neutrality. 0-3
Readability Score based on how readable and well-written the article is. 0-3
Formatting Score based on quality of the article's layout and basic adherence to the Wikipedia Manual of Style 0-2
Illustrations Score based on how adequately the article is illustrated, within the constraints of acceptable copyright status. 0-2
Total 1-26


Numerical scores can be translated into the different classes on the 1.0 assessment scale. For the lower classes, comprehensiveness and sourcing are the main things that differentiate articles of different classes; things like neutrality, style, layout, and illustrations quickly become important as well for the higher tiers of the assessment scale. GA-class and higher require separate reviews, but high numerical scores can indicate whether an article is a likely candidate for one of these ratings. For everything except GA and FA, the ratings are automatically determined by the banner template if detailed scores are present.

  • Stub - An article with a 1 or 2 in comprehensiveness is Stub-class.
  • Start - An article with a 3 or higher in comprehensiveness that does not qualify for a higher rating is Start-class.
  • C - An article must have at least a score of 4 in comprehensiveness and 2 in sourcing to qualify as C-class.
  • B - An article must have at least a score of 7 in comprehensiveness, 4 in sourcing, 2 in readability, and 2 in neutrality to qualify as B-class.
  • GA - An article with at least 8 in comprehensiveness, 5 in sourcing, 3 in neutrality, 2 in readability, 2 in formatting and 1 in illustrations may be a good candidates to be nominated for Good Article status. (B is the highest rating automatically assigned by a numerical assessment.)
  • A - An article with a 10 in comprehensiveness, 6 in sourcing, 2 in readability, 3 in neutrality, 2 in formatting, and 2 in illustrations may be good candidates for an A-class review.
  • FA - An article with full points in every category may be a good Featured Article Candidate; even then, additional work may be necessary to comply fully with the manual of style.


Assessment Team Recruitment

After development of the metric, the research needed a team of Wikipedians and public policy experts to test it. Recruitment was more difficult than anticipated; although staff emplyed several broad scale methods, only the effective methods are reported here. To gather participation of Wikipedian assessors Initiative research staff searched article histories and user contributions and talk pages for active editors on policy related pages with repuations for constructive criticism. These users then received personal messages on their talk pages inviting them to assess with WikiProject: United States Public Policy. Fourteen Wikipedians participated in article assessment during the fall term pilot.

Interestingly, although staff had no success attempting to recruit policy experts through direct individual requests, the seven public policy experts were successfully recruited through an email to the Sacramento State Masters in Public Policy and Administration alumni email list. All policy expert assessors had a graduate degree in public policy. In combination with the Wikipedian assessors these individuals made up the assessment team.

Experimental Design

Analysis and Results

Interpretation of Results

References