Experts vs. Online Consumers:
A Comparative Credibility Study of Health and Finance Web Sites (Full Report)
TABLE OF CONTENTS
Choosing Categories and Selecting Web Sites
Running the Study
Analyzing the Data
Part 1: Pair-Ranking Task Analysis
Part 2: 1-To-10 Ranking Task Analysis
Data Consistency Between Part 1 and Part 2
Part 3: Variables
Part 4: Optional Questions
Results and Discussion
Parts 1 and 2: Site Rankings Results and Discussion
How Health Experts Ranked Sites
Overall Health-Site Evaluation Trends
How Finance Experts Ranked Sites
Overall Finance-Site Evaluation Trends
Differences Between Finance and Health Experts
Differences Between Experts and Consumers
Expert vs. Consumer Health Rankings
Expert vs. Consumer Finance Rankings
Part 3: Variable Rankings Results and Discussion
Variable Scores of Health vs. Finance Experts
Variable Scores of Experts vs. Consumers
Part 4: Optional Questions Results and Discussion
Guidelines and Design Implications
Analysis of Existing Consumer Reports WebWatch Guidelines
Additional Field-Specific Guidelines
Appendix A: Sites With Web Addresses
Appendix B: List of Subjects
Appendix C: Optional Questions
Appendix D: Coding Categories
Appendix E: Tables of Site-Pairing Scores
Appendix F: Table of Expert vs. Consumer Variable Scores
LIST OF FIGURES
Figure 1: Description of comment categories
Figure 2: Health-site rankings
Figure 3: Percentage of health-expert comments, by category
Figure 4: Finance-site rankings
Figure 5: Percentage of finance-expert comments, by category
Figure 6: Expert vs. consumer health-site rankings
Figure 7: Expert vs. consumer health-site comments
Figure 8: Expert vs. consumer finance-site rankings
Figure 9: Expert vs. consumer finance-site comments
Figure 10: Expert variable scores
Figure 11: Finance-expert vs. consumer variable scores
Figure 12: Health-expert vs. consumer variable scores
Various researchers have conducted studies on consumers to understand the different elements of Web site credibility. However, some consumers may not be well equipped to make informed decisions about the accuracy of information in technical fields such as health or finance. In fact, consumers have had mixed results when judging credibility of information in other mediums. A recent study compared teacher and scientist perspectives on the credibility of 31 science information sources, including print media such as Discover magazine, radio and television programs such as ABC News's Nightline, and environmental organizations such as the Sierra Club [Klemm]. This study found that teachers and scientists significantly differed in their perceptions of the credibility of different information sources. For example, while scientists in the study rated TV newsmagazines such as ABC News's 20/20 as having the least amount of credibility, elementary school teachers gave the program the highest credibility ratings.
Some studies have shown that consumers rely significantly on visual cues when judging credibility. In a study on the influence of color and graphics on the credibility of Yellow Pages directory advertising [Lohse], subjects perceived colored ads as more credible than black and white. Similarly, a 1990 study of the perceived trustworthiness of television spokespeople found that viewers believed that people with baby-looking faces and females delivered more trustworthy content than people with mature-looking faces and males [Brownlow]. On the other hand, a 1999 study by the American Society of Newspaper Editors examined consumer and journalist assessments of newspaper credibility and found that consumers were even more critical credibility judges than the journalists [Urban]. Consumers in the newspaper study were particularly concerned about factual errors, spelling and grammar mistakes, journalistic bias, manipulation of the news by parent companies, and sensationalism when assessing the credibility of newspaper articles.
Given the varying results of studies of consumer credibility assessments in other mediums, we are left wondering whether consumers' credibility evaluations of Web sites are correct. These evaluations are increasingly important as people use the Web today to carry out a variety of vital tasks and research. Consumers are faced with important decisions about the information sources that they choose to believe for making important health or financial decisions. So, do these everyday people know which Web sites are really credible, especially in vital areas such as finance and health? What do industry experts say about the credibility of sites in their fields? And, finally, how do the experts' assessments compare to how the average person decides which sites to trust?
We started to answer these questions by conducting a comparative study of expert versus consumer evaluations of Web site credibility. Our goal for this study was to approach credibility from a new perspective. Instead of focusing on how consumers judge credibility, we endeavored to understand if consumers' judgments are correct and if not, why not? To reach this goal, we asked experts in the finance and health fields to evaluate sites in their domains of expertise, describe how they appraised each site, and then rank each site's credibility in relation to other sites. This expert study paralleled a consumer study conducted by the Stanford Persuasive Technology Lab (Stanford PTL) and Consumer Reports WebWatch that asked over 2,600 average people to rate the credibility of Web sites in 10 content areas [Fogg, 2002]. Our study then compared how experts and consumers evaluated the same health and finance sites to understand if and how consumers failed in their assessments. By comparing the expert and consumer evaluations, we hoped to identify any gaps in consumer education and begin to design guidelines for improving consumer understanding of online credibility. Furthermore, by studying experts in two diverse fields — health and finance — we aimed to learn about field-specific credibility in order to inform Web design guidelines and consumer education needs.
Every study of credibility must first lay the groundwork by defining the term credibility. In this paper we adhere to the definition of credibility outlined by Fogg and Tseng (1999), with the following discussion drawing largely from this work. In their view, credibility can be defined as believability. Credible information is believable information. There are two main ideas that help one to understand the construct of credibility. First, credibility is a perceived quality. It is not an objective property of a Web site like how many words it contains or how many links are on the page. Instead, when one discusses credibility, it is always from the perspective of the observer's perception. Second, people perceive credibility by evaluating multiple dimensions simultaneously. These dimensions can be categorized into two key components: trustworthiness and expertise. The trustworthiness component refers to the goodness or morality of the source and can be described with terms such as well-intentioned, truthful, or unbiased. The expertise component refers to perceived knowledge of the source and can be described with terms such as knowledgeable, reputable, or competent. People combine assessments of both trustworthiness and expertise to arrive at a final credibility perception. Although this definition positions credibility as a subjective perception, we assumed for this study that subject-matter experts would be better judges of a site's credibility and content quality than a consumer with no particular expertise or knowledge specialty.
As a usability and interaction design agency, our organization is interested in understanding how to create ethical online experiences that communicate credibility to their users. Consumer Reports WebWatch, a nonprofit project of Consumers Union, publisher of Consumer Reports, commissioned this study. Consumer Reports WebWatch's goal is to investigate, inform, and improve the credibility of information published on the World Wide Web. It is supported by grants from The Pew Charitable Trusts, the John S. and James L. Knight Foundation, and the Open Society Institute.
We believe that this investigation will advance Consumer Reports WebWatch's and Sliced Bread Design's goals by setting credibility benchmarks for finance and health sites. We do not claim that the rankings and comments in this study are the definitive evaluations of the credibility of the sites involved. In fact, it is essential to understand that this study is not an investigation of the most credible health and finance sites on the Web. Rather, it is a study of the credibility of the particular sites that we have chosen for the comparative purposes of this research.
Unfortunately, this study confirms some of our suspicions about consumers' poor credibility assessments. As you read on, you will find that experts carefully evaluated content while consumers relied on visual appeal for much of their credibility appraisal. However, the good news is that this study identifies specific opportunities for consumer education. By realistically identifying the sources of the problem and the directions for resolution, we can begin to inform consumers and enhance the usefulness of the Web, which is arguably the biggest, most accessible information source on Earth. After all, as the cliché says, knowledge is power.
back to top
The format of this study of experts' assessment of Web site credibility was based on methods for Web credibility research developed by the Stanford Persuasive Technology Lab (Stanford PTL). By running pilot studies, Stanford PTL developed an online research method using paired comparisons of Web sites. Subjects were asked to review two Web sites, select one as more credible, and comment on his or her choice. When run with a large number of subjects, the paired method resulted in a relative ranking of the credibility of a group of Web sites and a rich base of comments. This paired comparison method was used in a consumer study by the Stanford PTL that ran at the same time as this expert study, in the summer of 2002 [Fogg, 2002]. The Stanford PTL 2002 consumer study asked average people to rate the credibility of a single pair of Web sites in the same category (i.e., health, finance, news, sports, etc.). Each Web site in the pair was randomly selected from a pool of 10 content categories with 10 sites in each category. The study collected over 2,600 consumer rankings of paired sites from these 10 categories. The health and finance categories received 228 and 408 consumer rankings, respectively.
Our expert study examined a small number of experts' assessments of sites in their fields. In contrast to the consumer study, which asked an individual to rank one pair of sites from the 10 sites in a category, our expert study asked each expert to rank five pairs of sites in one session. This allowed each expert to review all 10 of the sites in his or her category of health or finance. Asking the experts to assess pair rankings allowed some comparison to the consumer study, but the small number of pairings was not sufficient to produce a reasonable overall ranking of sites. Both the Stanford PTL consumer pilot tests and our own expert pilot tests confirmed that having subjects start with a 1-to-10 ranking of all sites was not effective because subjects had trouble assessing more than two sites simultaneously on initial review. However, if we introduced the sites in randomized pairs first, we could successfully present the 1-to-10 ranking task next. Thus, to allow some direct comparison to the consumer study and produce an overall expert ranking of sites in our study, we started experts with the paired comparison method, and followed with a secondary overall ranking task in which the experts ordered all the sites from most to least credible.
We also added a section to the expert study that was based on a separate Stanford PTL credibility study, conducted in 2001, which examined the elements of Web sites that contributed to perceptions of credibility [Fogg, 2001]. The Stanford PTL 2001 study asked 1,400 consumers to evaluate elements of Web sites that helped or hurt credibility, such as "The site lists author's credentials for each article." For each variable, non-expert subjects indicated how that variable affected the credibility of Web sites in general by selecting a response along a 7-point Likert-type scale from -3 (much less believable) to +3 (much more believable). Our expert study presented just 30 of the previous 55 variables, because our expert pilot studies revealed that inclusion of all 55 variables would negatively impact the overall length and feel of the study session. As a result, we eliminated less relevant items, such as "The site was recommended to you by a friend." We also eliminated similar variables, such as "The site is difficult to navigate," vs. "The site is arranged in a way that makes sense to you." The inclusion of this variable task allowed us to measure experts' assessment of each variable's affect on credibility and then compare the expert responses to those of the Stanford PTL's large group of consumers. The consumers in the Stanford PTL 2001 study were asked to evaluate each variable in relation to all Web sites in general, while experts in our study were asked to evaluate each variable only in relation to Web sites in their fields of expertise (finance or health). Also, note that the Stanford PTL 2001 study consumer group used for comparison in this variable evaluation task was a different group of people than the Stanford PTL 2002 study consumer group used for comparison in the Web site ranking task described in the previous paragraph.
In addition, we wanted to gather open-ended thoughts from the experts about credibility. To do so, we included an optional section in which we posed credibility-related questions for subjects to answer in an unrestricted text field. We created the questions by compiling a list of key credibility areas, such as visual design and site ownership, forming candidate questions, and reducing the number to five individual questions. We then refined the language for each question during pilot studies we conducted in our own pre-testing phase. The final five questions are listed in Appendix C.
back to top
CHOOSING CATEGORIES AND SELECTING WEB SITES
Working with Consumer Reports WebWatch and the Stanford PTL, we selected two content categories from the 10 in the consumer study: health and finance. The remaining consumer categories were: e-commerce, entertainment, news, nonprofit, opinion or review, search engines, sports, and travel. We felt that the health and finance categories were particularly relevant for a credibility study because they were heavily-trafficked by consumers, allowed for identification of experts in the given fields, and presented an interesting difference between a primarily informational category (health) and a primarily transactional category (finance). In addition, we chose to study these categories because incorrect information on a health or finance site could be particularly harmful to a consumer if used for making an important health or financial decision.
For each category, Consumer Reports WebWatch and Sliced Bread Design worked in conjunction with the Stanford PTL to select 10 Web sites to include for the expert study. The sites chosen were the same health and finance sites assessed by consumers in the parallel Stanford PTL 2002 study. The sites were all consumer-oriented sites, and were selected to present a variety of sites across a range of variables, including:
- Bricks-and-mortar company vs. Internet only
- Range of visual and information design quality and style
- Brand name vs. No name
- Variety in amount and presentation of advertisements
- Presence of awards or seals of approval
In addition, we preferred general sites over content-specific sites; for example, we included health sites that presented a wide range of topics rather than those that focused on one specific health condition such as breast cancer. We excluded sites that required a subscription to access premium content. However, we included sites that required free registration, and provided "dummy" accounts for those sites for use by our panel of experts. All of the finance sites allowed transactions and all of the health sites provided health information, not just product sales. Last and most importantly, we chose sites that we expected to provide a range of credibility ratings. Thus, the study did not aim to select the 10 most credible sites in a category, but rather 10 sites that would produce a range of credibility rankings and comments. See Appendix A for the list of sites with Web addresses.
Note that the sale of one of the health sites, DrKoop.com, to a Florida company was announced on July 15, 2002, and the site changed significantly because it is no longer affiliated with former U.S. Surgeon General C. Everett Koop. This study was completed before the announced sale.
back to top
This expert study included eight health experts and seven finance experts, for a total of 15 subjects. We recruited approximately 15 experts in each category for a minimum goal of seven completing the study in each industry area. We did not desire a larger sample size, because our goal was to obtain consistent opinions from a reasonable number of experts within our budget and time constraints. We believed that more experts would have produced redundant results for this study, and in fact our intuition was correct as the experts in each category agreed with the category group as a whole. We discuss our measure of subject agreement below.
With a small number of subjects, the expert selection was important. First, we identified candidates who were experts in their domains of expertise — finance or health — and had experience on the Web. We defined expert as an accomplished authority in his or her field; someone who an average reasonable person would describe as "expert" if provided with a description of the person's experience and achievements in his or her field. Next, we reduced the list by screening for geographic diversity throughout the United States. Although we included one German-born doctor-researcher who currently works in Canada, international participation was not a prerequisite in order to match the consumer study subjects. We also balanced the experts for types of expertise to include a mixture of academics and industry practitioners. For example, our finance experts included a professor, a journalist, a financial advisor, and an industry analyst. This process resulted in a list of approximately 15 experts in each field whom we recruited by e-mail and phone. During recruitment, we specifically mentioned Consumers Union and Consumer Reports WebWatch to encourage expert participation.
Subjects who completed the study were paid an honorarium of $100 and a matching $100 was donated to a charity of his or her choice selected from a list of the 50 largest U.S. charities, ranked by total income in fiscal year 2000. This list of charities was compiled by The NonProfit Times. Two subjects sent the total $200 to the charity they selected. Please refer to Appendix B for a list of subjects and a description of each person's expertise.
back to top
RUNNING THE STUDY
The study took place between June 14 and July 12, 2002. Subjects accessed the study online via a Web site (finance and health experts were given two separate Web addresses). The first page required the subject to enter his or her name, which reinforced that it was not anonymous, and that he or she agreed to the terms of the study. This Welcome Page specifically mentioned Consumers Union and Consumer Reports WebWatch to reinforce the integrity of the study and encourage the subject to spend more time and provide quality comments. This page also emphasized that the subject must complete the study in one browser session. The second page described the four parts of the study with the estimated time for completing each part, and provided the dummy login account for any sites requiring free registration.
Part 1 of the study presented the five randomly selected site pairs. Each of the pairs was presented on one page, for a total of five pages. For each pair, the subject chose one site as "1 = more credible" and the other as "2 = less credible" by selecting from radio buttons. The subject was asked to share comments in a text field.
Next, Part 2 presented the 1-to-10 ranking task on one page. For each site, the subject selected the 1-to-10 rank from a drop-down menu. Error checking prevented the subject from leaving the page without assigning unique numbers to each of the 10 sites. Again, comment fields were provided for each site.
Next, Part 3 of the study introduced the variable ranking task and explained the 7-point ranking scale. The 30 variables were presented on the next two pages.
Next, Part 4 presented five optional questions. Each subject was thanked for his or her participation in the required portion of this study, and asked to answer as many or as few optional questions as his or her time allowed.
Finally, the last page of the survey thanked the subjects again. The data for each subject was then stored in a database for analysis.
back to top
ANALYZING THE DATA
After the testing period ended, we reviewed the data to ensure that it was complete and reasonable. For completeness, we confirmed that the database showed valid information for each subject and for each task. For reasonableness, we analyzed the data to ensure that it made sense and noted any discrepancies that required discussion. This task varied by section, but included such things as comparing the order of each subject's pair rankings with his or her 1-to-10 rankings.
Once we were assured that the data fields looked valid, we analyzed the data to understand the results. First, we quantitatively analyzed the order of ranked sites and variables and calculated an average response for each group of experts. Next, qualitative data, which includes site-specific comments and answers to optional questions, required comprehensive review. In order to classify the comments and to compare them to the parallel Stanford PTL 2002 consumer study, we coded each qualitative comment based on a list of content codes developed by the Stanford PTL; we added one category, information source. The categories are explained in Figure 1 below.
Figure 1: Description of comment categories
||Comments relating to how users perceive advertising on the site|
||Comments relating to the perceived motive, good or bad, of the organization behind the site|
||Comments relating to the look and visual appeal of the site|
Comments relating to the perceived bias of information in the site
||Comments relating to how the information is structured or organized on the site, including navigation and site organization|
Comments relating to the scope or focus of the site, including the quantity of information available
||Comments relating to the accuracy of the information on the site that does not mention information sources|
||Comments relating to the citation of sources|
|Name Recognition and Reputation/Affiliation
||Comments relating to name recognition of the site or the reputation of the site's operator, affiliates, supporters, or partners|
||Comments relating to the tone or attitude conveyed by the site's content|
Each comment was assigned one or more of the 24 codes and could have been positive or negative with respect to the code. An example content code was "CMN," standing for "company motive negative," which means that the comment mentioned a concern that the motive of the company was negative. See Appendix D for the complete list of content codes with sample categorized comments. In addition, we grouped some individual codes into meaningful categories. The previous table defines the categories of expert comments that are referred to in the Results and Discussion section of this paper. Once all of the comments were coded, we calculated the frequency of each category. Thus, we are able to understand how often comments addressed a particular subject.
back to top
PART 1: PAIR-RANKING TASK ANALYSIS
The analysis of the results varied for each part of the study. Part 1, the pair rankings, provided quantitative and qualitative data based on pairs of sites. We read and coded qualitative data as described above. Quantitatively, the ranking system stored "+1" in the database when a site was ranked the more credible of the pair and "-1" when a site was ranked the less credible. We used the +1/-1 scores to calculate a mean score for each site. For example, a site that won all of its pairings would have a final mean score of +1; winning half the pairings would result in a score of 0; and losing all of the pairings would result in a score of -1.
The parallel Stanford PTL 2002 consumer study created site rankings by using the mean score to order the sites in each category. However, as discussed in the Methods section above, the small number of pairings in this expert study and the random pairing method did not necessarily lead to an accurate ranking order of the sites for the experts. In order to understand the effect of the random pairings, we reviewed which sites were paired with which and found that indeed some of the pairings were "unfair." For example, ShareBuilder was paired exclusively against sites that did well overall, and thus its pair ranking was lower than its 1-to-10 ranking from Part 2. In future studies, we can consider using a different pairing algorithm that seeks to match all pairs before allowing duplicates to get a wider range of comments.
back to top
PART 2: 1-TO-10 RANKING TASK ANALYSIS
Part 2, the 1-to-10 rankings, also provided quantitative and qualitative data on all of the sites. We reviewed and coded qualitative comments as described above. Quantitatively, the ranking system stored a number from 1 to 10 in the database for each site for each subject. Using the 1-to-10 numbers, we calculated the mean ranking for each site. See the Results and Discussion section, Figures 2 and 4, for the sites in each category ranked from 1 to 10.
back to top
DATA CONSISTENCY BETWEEN PART 1 AND PART 2
In order to verify the data consistency for each subject, we checked to see that each expert's answers from the pairings agreed with his or her final 1-to-10 rankings. For example, a discrepancy existed if a subject rated site A more credible than site B in the pair rankings, but then ranked site B higher than site A in the 1-to-10 rankings. We refer to these dissonant occurrences as "reversals." Of the 75 total pairs (15 subjects with five pairs each), there were seven pairs reversed. In all except two of these reversed cases, the subject ranked the sites adjacent to each other in the 1-to-10 ranking, thereby making the discrepancy fairly insignificant. Thus, there were only two pairings of the 75 with notable discrepancies, and in those two cases the subjects' comments did not explain their rationale for changing the relative ranking of a site from Part 1 to Part 2.
We also scored how much each expert was in agreement with the other experts in his or her group. We desired a high level of agreement, meaning consistency among expert opinions; conformity shows both that the expert point of view was stable and that the results were meaningful with the number of experts we included in the study. We devised a rough calculation for agreement based on a comparison between an individual expert's pair ranking of each site with the group's 1-to-10 ranking of the site. If an expert rated site A overall more credible than site B in the pair task, and site A did better in the group 1-to-10 rankings, the expert got a point. With 10 sites in five pairs, the maximum score was five. Thus, if the expert's pair rankings completely agreed with the group's rankings, the expert would have scored a 5. The scores were as follows: finance: 5, 5, 5, 5, 4, 4, 3; health: 5, 5, 5, 5, 4, 4, 3, 3. In summary, we used this coarse measurement to understand whether our experts were generally in agreement, and we found that they were.
back to top
PART 3: VARIABLES
Part 3 provided quantitative data on how different elements of a Web site in the expert's field affected its believability. As discussed in the Study Design section, each item received a score from -3 (makes the site much less believable) to +3 (makes the site much more believable). Using those scores, we computed a mean for each variable, both overall and for health and finance categories separately. Next, we used those means to make two comparisons: the differences between the two groups of experts, and the differences between the experts and the consumers from the Stanford PTL 2001 study, which was described in the Methods section.
To compare differences between the groups of experts, we noted items that were interestingly different, which we defined as a difference of at least 1 point in the mean rankings of the two groups. To compare differences between the experts and a large group of consumers, we compared the expert mean rankings to those of the consumers' mean rankings from the Stanford PTL 2001 study. These consumer mean rankings were based on 1,400 consumer assessments of Web credibility variables as described in the Study Design section. Ninety-five percent confidence intervals were computed from these consumer responses and compared to the means for each group of expert participants in our study. For the experts, in all but 5 of the 60 mean variable scores (each group of experts contributed 30 mean scores calculated for each of the 30 variables), there was a statistically significant difference from the general population at the 95% confidence interval (p=.05). However, while most of the items were statistically different, we identify and discuss only items that have greater practical difference using our 1-point-difference rule. These results are discussed in the Variable Rankings Results and Discussion section.
back to top
PART 4: OPTIONAL QUESTIONS
The optional questions provided qualitative comments that we read and coded according to the coding scheme described in the Analyzing the Data section above. Thirteen of the 15 subjects — 7 health experts and 6 finance experts — answered the optional questions. The interesting themes found in these responses are discussed in the Results and Discussion section.
back to top
Results and Discussion
PARTS 1 AND 2:
SITE RANKINGS RESULTS AND DISCUSSION
This section presents and discusses the results of the site rankings gathered in Parts 1 and 2 of the study. Figures 2 and 4 show the mean results from both ranking tasks, presented in order of the final 1-to-10 ranking from Part 1. The pair ranking from Part 1 of the study is displayed on the right side of the table for comparison with the final 1-to-10 ranking. Note that these final results are not the 10 most credible health and finance sites on the Web in each category, but only a ranking of the sites chosen for this study and judged by this small group of experts.
We did not include a table ordered by the results from the pair rankings from Part 1 of the study because, as discussed in the Study Design section, we believe that the expert pair rankings are insufficient to produce robust results. Since pairings used in Part 1 were random, and there was a small sample size, not all sites were paired equally with other sites, and there were some duplicate pairings. ShareBuilder, for example, was ranked tenth in the pair rankings and improved to seventh in the 1-to-10 rankings. A review of the sites that ShareBuilder was paired with in Part 1 reveals that it was always matched with sites that were ranked more credible in the Part 2 (1-to-10) rankings. Therefore, in the final 1-to-10 rankings, ShareBuilder was eventually ranked ahead of two other sites with which it was never paired. An investigation of the other sites with a difference in the rankings between Parts 1 and 2 also revealed "unfair" pairings similar to the situation that occurred with ShareBuilder.
back to top
HOW HEALTH EXPERTS RANKED SITES
Figure 2: Health-site rankings
*To distinguish from MayoClinic.org, which is the Web site for the Mayo Foundation and its associated clinics and hospitals, we refer to MayoClinic.com with the .com succeeding its name throughout this document.
In the health ranking, the site for the National Institutes of Health (NIH) was ranked number one by every participant in the expert study. Next, MayoClinic.com, WebMD, and InteliHealth all had fairly close mean rankings. These sites were perceived as generally credible but not as credible as the NIH site. MDChoice ranked right in the middle with a mean score ranking of 5.0, not quite making it into the top group of sites, yet not suffering from the credibility issues of the bottom five. The next group of sites — Dr. Koop, HealthWorld Online, Dr. Weil, and Oxygen Health and Fitness — comprise the bottom end of the rankings by our panel of experts. None of these sites completely lacked credibility, but each suffered at least one major flaw that is described in detail in this section. Finally, Health Bulletin was ranked at the bottom: Five of our eight health experts ranked this site last. To understand why each site fared as it did in the rankings, the following is a brief review of our experts' comments for each site.
Our panel of experts ranked the NIH first due to its "sterling" reputation; one expert characterized it as "the gold standard." The NIH also scored high because of its"lack of self-interest," said one. Another comment summarized the overall expert opinion of NIH: "I don't have to look at their Web site (although I did!) to know that I trust this site. It is more about knowing the source and the processes rather than credibility markers such as design." Other experts pointed out that the site references peer-reviewed journals and that it is used as a source by other sites. There were no negative comments about this site.
(Back to table.)
2. MayoClinic.com, 3. WebMD, 4. InteliHealth
Each of these sites had an affiliation with reputable sources that resulted in its favorable ranking. Experts ranked the MayoClinic.com site highly because of its reputation, commenting that it is "credible due to its affiliation with the highly regarded Mayo Clinic." WebMD's credibility was enhanced by its coverage of quality information from reputable sources cited for each article. This resulted in one expert's writing, "Authorship and sometimes credentials of each posting listed. Some original material, a lot of listings from other credible (usually peer-reviewed) sites." And, InteliHealth gained credibility via its affiliation with the Harvard Medical School and a seal of approval from the nonprofit American Accreditation HealthCare Commission (URAC), a health organization and Web site accreditation agency. For-profit WebMD and InteliHealth ranked slightly below the commercial site for the not-for-profit Mayo Clinic Foundation (MayoClinic.com) because some experts had slight concerns about its company motive. For example, one expert wrote of InteliHealth, "Though they proudly display the Harvard logo, the URAC Accreditation Seal and many other trust marks, I would still be wary as they are commercially driven."
(Back to table.)
MDChoice is an interesting middle-ranked site where the comments clearly reflected its midpoint ranking. Our experts did not seem to distrust the content on this site, because it was either reviewed by outside sources or included a direct link to credible outside sources. Instead, our experts ranked MDChoice lower than others because it seemed to contain only material culled from other sites. One expert explained his ranking as follows: "Material culled mostly from other reasonable sources and apparently reviewed by editorial board. Appears to be less original material and review than #2 & 3 ranked choices." This lack of original content paired with suspicions "about a company backed by venture capital money" and a large number of drug company ads caused it to be ranked below other for-profit health sites.
(Back to table.)
6. Dr. Koop
Our experts were mixed in their responses to Dr. Koop, although most said they were suspicious of its credibility because of the "many ads" or the lack of reference citations. Some cited concerns about the site's earlier negative press, such as one who opined, "After the Dr. Koop scandal a couple of years ago, this site has no credibility whatsoever." In the summer of 2000, Dr.Koop.com, Inc. and four corporate officers were sued in securities fraud class-action lawsuits filed by investors in Texas (USA), which alleged that the company made false promises when it began selling its stock to the public during its initial public offering. Former U.S. Surgeon General C. Everett Koop was not named in the litigation. The lawsuits were eventually settled during the summer of 2001.
(Back to table.)
7. HealthWorld and Dr. Weil (tie)
HealthWorld and Dr. Weil tied in their rankings by our panel of experts, with both ranked in the bottom half because of a perception that each is too commercial in focus. The majority of our experts stated that the credibility of these sites suffered because the sites were not only providing information about alternative health products, but also selling these same products. "I do not like product information appearing mid-screen (monitor) when I am browsing selecting information.... I was expecting content," wrote one expert in reference to HealthWorld. Several experts also mentioned a lack of references or outside review as a problem; one such comment is, "I feel this is a suspicious Web site, with non-evidence based health information, not from evidence-based sources."
(Back to table.)
9. Oxygen Health
Oxygen Health ranked ninth primarily because our health experts felt it focused less on health and more on fitness and commercial articles. Experts assessing the site in our study had trouble finding real health information. One said:
"This seems to be a fitness/stress reduction/feel good site as opposed to a 'health' site. I thought I would attempt to find some health information. As this is a site designed for women, I figured 'breast cancer' information would be easy to locate. After spending some time looking at link options under the 'health and fitness' tab, I did a search for 'breast cancer,' and was not able to locate any substantive information regarding the disease."
Some experts pointed out that the articles appeared to be "… authored mostly by non-MDs and [there's] no sign of subsequent review." However, our experts did not find information that was factually incorrect, which saved Oxygen Health from being ranked last.
(Back to table.)
10. Health Bulletin
Finally, our panel of experts ranked Health Bulletin last because it contained what one expert characterized as "biased information from the alternative/homeopathic point of view" presented in a flashy manner and "without actual credible authorship present." There were no positive comments about this site, as our experts had a fairly unanimous negative opinion of the information. Said one: "Numerous spelling errors. Sentences such as 'One thing would be to find out whether your headaches are more serious medically. Get an exam. Second would be to try out different preparations from the store or from mail-order distributors. Good luck.' make the site not very credible."
(Back to table.)
back to top
OVERALL HEALTH-SITE EVALUATION TRENDS
Figure 3: Percentage of health-expert comments, by category
Percent of health-expert comments
Figure 3 provides a quantitative summary of the types of comments the health experts made during the site-pairing and ranking tasks. The comment percentages do not add up to 100%, as some comments were coded with multiple categories and not all categories tracked appear in the table. This quantitative analysis paired with our review of the comments revealed several trends in the way these health experts judged the credibility of health sites.
First and foremost, our health experts gave the most credibility to sites that provided information from reputable sources, as illustrated by the high number of comments in the name/reputation/affiliation (43.9%) and information source (25.8%) categories. Our experts' credibility perceptions of health sites are influenced more by the expertise dimension of our credibility definition than the trustworthiness dimension (click here for our credibility definition). For example, one expert explained his credibility ranking for InteliHealth by writing, "…material written/reviewed by Harvard Medical School docs. Some info from standard source (like USP for patients)." In contrast, Dr. Koop.com's perceived lesser credibility seems to be harmed by issues of reputation and sourcing: "Authorship of individual articles difficult if not impossible to ascertain. Leadership heavy, using Dr. Koop's name, only 2 other MDs on panel listed as 'authors and experts,' none of the three appear to be practicing clinicians." Reputation and sourcing greatly influenced the final health-site rankings, which generally trended — from highest to lowest — from sourcing by reputable authors (NIH, WebMD), to general review (InteliHealth), to no author credentials or citations (Health Bulletin).
Sources develop a good reputation in the health field by having a history of providing quality information. Our health experts used reputation as a key indicator of credibility by assuming that such sources are motivated to continue to provide good information to protect their reputation. For example, MayoClinic.com ranked in the top tier almost purely by its past reputation, with comments such as, "Again, [this site is] credible due to its affiliation with the highly regarded Mayo Clinic. This site does not, however, appear to contain as much in-depth health information as InteliHealth." In contrast, the reputation Dr. Weil brought to his site was countered by the site's commercial focus: "Dr. Weil is credible, but the site is more commercial — i.e., trying to sell you upgrades, vitamins, etc." This leads to the next judgment criterion: company motive.
Company motive was the third-highest commented area by our health experts (22.7%). Our experts felt that health sites should operate with the interests of the readers as their first priority, not their balance sheet. One expert summarized this viewpoint, writing, "I find health Web sites that sell or market products less credible than those that relay information only." Our experts' rank reflected this credibility trend by rating the U.S. government's NIH, which works solely for the benefit of people's health, at the top of the credibility scale. The overall ranking then proceeds — from highest to lowest — from commercial interests that were presented in a non-sales-oriented manner (WebMD and InteliHealth), to sites where our experts had concerns about commercial motive driving the content (Dr. Weil), and finally to sites where products were sold in line with the content, making it difficult for users to distinguish ads from editorial content (Health Bulletin).
While credibility could be tainted by commercial motive and product sales, our experts' comments did not indicate that a profit motive alone precluded credibility. In fact, according to our experts, if the site has deep, sourced information, it could still be for-profit and remain mostly credible. This was the case with WebMD, which ranked third: "Good overall site, but I'm always cautious about publicly traded companies underwriting health Web sites and any possible ulterior motives to the content." In the case of Dr. Weil, a few experts commented that they looked for sourced information to counteract the product sales, but could not find it, which thereby hurt its credibility among our health panel of experts. One expert wrote, "Mostly, this site is about selling vitamins. There's no research that I can find."
Finally, from the perspective of surface credibility criteria such as writing and visual design, our health experts were concerned with language-presentation issues such as editing, poor grammar, and typos. Although they did not base their credibility assessments purely on t