Rethinking self-reported measure in subjective evaluation of assistive technology
© The Author(s) 2017
Received: 26 November 2016
Accepted: 14 June 2017
Published: 8 August 2017
Self-reporting is used as a subjective measure of usability study of technology solutions. In assistive technology research, more than often the ‘a coordinator’ directly assist the ‘subject’ in the scoring process. This makes the rating process slower and also introduces bias, such as, ‘Forer effect’ and/or ‘Hawthorne’ effect. To address these issues we propose to use technology mediated interaction between the ‘subject’ and ‘the coordinator’ in evaluating assistive technology solutions. The goal is to combine both the qualitative and quantitative scores to create a relatively unbiased rating system. Empirical studies were performed on two different datasets in order to illustrate the utility of the proposed approach. It was observed that, the proposed hybrid rating is relatively unbiased for usability study.
KeywordsDisabilities Assistive technology Subjective rating bias Technology assessment Hybridization Kano analysis
Self-reported measure suffers from predictive bias . Use of the same measure in disability experiment is unreliable and affected by experimental biases . To resolve bias issues, a paired comparison testing was reported in —which is weighting the rating scores by subjects participated in the study. The paired comparison is unable to address underlying bias issue. We propose a technology mediated approach that includes subjective rating with user observation (secondary) to solve the existing problem. The rational to choose a hybrid approach is that removing some bias may improve the reliability of the subjective rating.
The hybrid approach aims to combine subjective rating (un-weighted) with weights given by experts. The indirect weight can be computed by experts’ analysis on post-experiment recorded video combining with subject’s ability profile. The hybrid approach combines the quantitative data from subjective rating and weighted qualitative analysis of rating validation on the video.
Two different cases were analyzed to show the utility of the proposed approach. The first experiment considers usability and mental workload analysis of blind subjects during interaction with RMAP  (reconfigured mobile android phone)—an android application to read printed text. The second experiment is a cross-disability communication experiment  performed to assess the mental workload of four participants in designing four different design of the same problem—different design mode of communication using android application and other devices. NASA-TLX rating scheme was used in both assessment [5, 6].
We perform Cronbach alpha test  to ensure the reliability of the score and use a gradient descent algorithm to update the alpha score (weights) in the process of hybridization. We also apply Kano analysis , to see whether subjects are satisfied or not with the new rating process.
The rest of the paper is organized as follows. The “Background” summarizes reported works to set the context of research. The subsequent sections (“Research method” and “Results and discussion”) explain the proposed hybrid method with two experiments. Finally, “Conclusion” concludes the paper with lessons learned and future works.
The individual ratings are high for applications that are tailored to ones specific need. Such ratings are rarely useful and general enough to apply to a diverse population. The effect is known as ‘Forer effect’, or sometimes ‘Barnum Effect’ . Some studies have found that subjects give higher accuracy ratings for three reasons: (a) the subject believes that the analysis applies only to him or her, and thus applies their own meaning to the statements , (b) the subject believes in the authority of the evaluator, and (c) the analysis lists mainly positive traits . A closely general and related effect to the ‘Forer effect’ is the ‘subjective validation’; a person will consider a statement or another piece of information to be correct if it has any personal meaning or significance to them . In disability data collection, the facilitator or interviewer assists the subject in the rating process. The subjects need to express his/her opinion in witness of the facilitator. The problem with this observation is well known Hawthorne effect . Hawthorne effect affects subjective rating reflecting idealized rather than typical behavior.
Cognitive bias —A cognitive bias is a pattern of deviation in judgment which may be caused by inferences about other people and situations . It may sometimes lead to perceptual distortion, inaccurate judgment, illogical interpretation, or what is broadly called irrationality and results an unsatisfactory rating. Cognitive bias arises from various processes that are sometimes difficult to distinguish. These include shortcuts (heuristics) during information-processing, mental noise and the mind’s limited information processing capacity, emotional and moral motivations, or social influence. 
Construct bias —It occurs when an experiment has different meanings for two groups, in terms of the precise construct that the test is intended to measure. It has to do with the relationship of observed scores to true scores on psychological test. If this relationship can be shown to be systematically different for different groups, then we might conclude that the test is biased. Construct bias can lead to situations in which two groups have the same average true score on a psychological construct but different test scores.
Predictive bias —It has to do with the relationship between scores on two different tests. One of these tests (the predictor test) is thought to provide values that can be used to predict scores on the other test (the outcome test or measure). For example, graduate admissions officers might use Graduate Record Examination (GRE) test scores to predict GPAs. The GRE would be the predictor test and GPAs would be the outcome measure. In this context, test bias concerns the extent to which the link between predictor test true scores and outcome test observed scores differ for two groups. If the GRE is more strongly predictive of GPA for one group than for another, then the GRE suffers from predictive bias, in terms of its use as a predictor of GPA.
Issues in unbiased rating
Issues to have an unbiased rating scale (summarized from )
(1) The connotations of category labels
Rating descriptor words require some more thought. Not equal-interval scale may cause biased scale. Example: terrible__horrible__awful __fair __slightly good__all right__reasonably good
(2) Effect of response alternatives on interpretation of the question
The response alternatives can affect the interpretation of the question. Knowledge of this phenomenon makes it easy to influence the responses of subjects. Example: “how often have you considered quitting your job?”
(3) Implicit assumptions of the question
Some questions are biased because of an implicit assumption made by the question. Example: intrinsic or germane cognitive load?
(4) Forcing a choice
A forced-choice rating scale will bias results by eliminating the undecided and/or those with no opinion. In disability study, this is a crucial consideration
(5) Unbalanced rating scales
Generally, rating scales should be balanced, with an equal number of favorable and unfavorable response choices. Example: (unbalanced) “Excellent,” “very good,” “good,” “fair,” “poor.” This scale is unbalanced, with three favorable and only one unfavorable response choice
(6) Order effects in rating scales
Traditionally, researchers present the most positive items in the scale first (e.g., “strongly agree,” “extremely interesting,” or “extremely satisfied”) and the most negative items last (“strongly disagree,” “very boring,” or “extremely dissatisfied”)
(7) The direction of comparison
Many surveys contain questions of comparison, where respondents are asked to compare two stimuli
(8) The number of points
Ideally, a rating scale should be consistent enough points to extract the necessary information. Variability can be improved by using scales with too many points
(9) Context effects
Many surveys consist of a series of questions whose purpose is to help the researcher determine which factors correlate most strongly with the subjects’ overall opinion. Some questions may influence by subsequent questions
Paired comparison approach
In cognitive psychology paired comparison technique works with pair-wise comparing entities to judge which of each entity is preferred, or has a greater amount of quantitative property . The method of pair-wise comparison is used in the scientific study of preferences, attitudes, voting systems, social choice, public choice, and multi-agent AI systems. Suppose we have two mutually distinct alternatives x and y, the preference can be expressed as a pairwise comparison. For instance, the agent prefers x over y: “x > y” or “xPy”. The agent prefers y over x: “y > x” or “yPx”; or, agent is indifferent between both alternatives: “x = y” or “x|y”.
The NASA task load index (NASA-TLX) is developed by the Human Performance Group at NASA’s Ames Research Center . NASA-TLX is a multidimensional subjective rating tool that rates perceived workload, in order to assess the task, system, or team’s effectiveness or other aspects of performance. Six sub-scales in NASA-TLX are mental demand, physical demand, temporal demand, performance, effort and frustration . That applies paired comparison technique known as the weighted NASA-TLX method to resolve the bias issues . It requires subject’s more (at least double) involvement in the rating process, which is not plausible in disability study context.
In our work, the hybridization process runs with subjects rating and post-rating analysis of the rating process (expert’s rating about rating).
A hybrid rating system combined with the Kano model in subjective data collection can provide a much needed objectivity and un-biased rating in the usability study. A case study is used to explain the processes in our laboratory experiment.
Procedure 1: rating hybridization process
(1) Rating data pre-processing procedure
(a) Selection of multidimensional rating data
(b) Identifying qualitative and quantitative part
(c) Identify statistical analysis
(2) Rating effect and bias analysis
(a) Reliability test—Cranach alpha test
(b) User acceptance—Kano analysis
(3) The hybridization_process
(a) Weight computation—critical incidence analysis
(b) Weight update—gradient decent delta rule
(c) Scale improvement with weight
R-MAP dataset—This study was conducted between two groups: representative and non-representative . In the representative group, there were four blind people; two are non-expert, and two are expert. Expertness is considered based on their experience smart phone use experience for more than a year. The second study group (non-representative) contained twenty blindfolded participants; ten of them are considered an expert and the other ten participants are non-expert. Participants were trained and were asked to use the R-MAP to read different objects, documents that have text in it (e.g. texts from a text book). The data collection process is passed through the institutional review board. R-MAP subjective rating dataset uses the concept of the NASA task load index [5, 6] with six dimensions to assess mental workload: mental demand, physical demand, temporal demand, performance, effort, and frustration.
NASA-TLX used in R-MAP subjects’ mental workload assessment
How much mental and perceptual activity was required (e.g. thinking, deciding, calculation, remembering, looking, searching, etc.)?
Was the task easy or hard, simple or complex, extracting or forgiving?
How much physical activity was required (e.g., pressuring during tapping interface, tapping in different locations, double-tapping, position the camera, positioning your hand, positioning the item etc.)?
Was the task easy or hard, slow or fast, slack or strenuous and restful or laborious?
How much time presser did you feel due to the rate or pace at which the task or task element occurred?
Was the pace slow and leisurely or rapid and frantic?
How successful do you think you were in accomplishing the goals of the task set by experimenter?
How satisfied were you with your performance in accomplishing these goals?
How hard did you have to work (mentally and physically) to accomplish your level of performance?
How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent did you feel during the task?
R-MAP usability measures quantitative
How difficult was the experiment instruction content for you?
How difficult was to learn with the instruction format?
How much did you concentrate during experiment?
What do you think about the chances of errors during the experiment?
How pleasant are you to participate in this experiment and to use the design?
Subjectobservation (critical incidence) weight index table
Critical incidence (observed)
The subject is very much happy and responds quickly
Subject asked a question or responded slowly
Subject is confused about the rating
Subject’s reply is not the relevant, facilitator need to ask him again
Subject is sad
- 2.Cross-disability communication dataset—The cross-disability dataset is more qualitative dataset than quantitative . Four usability experts participated in a discussion of cross-disability communication improvement. A deaf (Doris), a blind (Bob) a deaf–blind (Debra) with a facilitator (Jon) set in that conversation. The naming convention (not the actual name) is used for the simplicity. The fifth person was instructed to record the conversation, which is later used in critical incidence analysis. A snapshot of the experimental setting is shown in Fig. 3.In the whole conversation, four different types of communication design (to deaf vs. blind communication) were discussed. These four designs are: (1) speech-text-speech, (2) speech-sign-speech, (3) Braille-text-Braille and (4) Braille-sign-Braille. Figure 4 shows one such design between (Speech-sign-speech). The experiment is conducted without any time bound. Through this process, NASA-TLX score was recoded for each design.
CD-1 (communication design one-speech-text-speech) Speech from Bob can be encoded, then sent to Doris and she can read, and then text her reply with decoded as speech to Bob. Considerations: Bob cannot type, but speak and listen on the other hand Debra cannot speak or listen, but read text and type.
CD-2 (communication design two-speech-sign-speech) Speech from Bob can be encoded, then sent to Doris and played by the avatar to mimic the sign (ASL) to Doris, and finally Debra replays by the sign that is encoded to speech and sent back to Bob (Fig. 4). Considerations: Bob cannot type, but speak and listen on the other hand Doris cannot speak or listen, but read text and type.
CD-3 (communication design three speech-sign-speech) Braille from Bob can be encoded, then sent to Doris as text and she can read and reply text to Bob that is decoded to Braille again.
CD-4 (communication design four-speech-sign-speech) Braille from Bob can be encoded, then it can be sent to Doris and played by the avatar to mimic the sign (ASL) to Doris, and finally Doris replays by a sign, that is encoded to Braille and send back to Bob.
NASA-TLX used in subjects’ collaborative mental workload assessment 
How much coordination activity was required (e.g., correction, adjustment)? Were the coordination demands to work as a team low or high, infrequent or frequent?
How much communication activity was required (e.g., discussing, negotiating, sending and receiving messages)? Were the communication demands low or high, infrequent or frequent, simple or complex?
Time sharing demand
How difficult was it to share and manage time between task work (work done individually) and teamwork (work done as a team)? Was it easy or hard to manage individual tasks and those tasks requiring work with other team members?
How successful do you think the team was in working as a team? How satisfied were you with the team related aspects of performance?
How difficult was it to provide and receive support (providing guidance, helping team members, providing instructions, etc.) from team members? Was it easy or hard to support/guide and receive support/guidance from other team members?
How emotionally draining and irritating versus emotionally rewarding and satisfying was it to work as a team?
Cronbach’s alpha  is widely believed to be an indicator of the degree to which a set of items measures a single one-dimensional latent construct and useful in reliability testing of scores.
Cronbach’s alpha statistic is widely used in the social sciences, business, nursing, and other disciplines. Researchers investigate that alpha can take on quite high values even when the set of items measures several unrelated latent constructs [23, 24]. As a result, alpha is most appropriately used when the items measure different substantive areas within a single construct. Alpha can be artificially inflated by making scales which consist of superficial changes to the wording within a set of items or by analyzing speeded tests. In this case, alpha treats any covariance among items as true-score variance . Alpha is not robust against missing data and in case of more than one construct. Coefficient omega or other may be more appropriate when the set of items measures more than one construct [25, 26].
Cronbach’s alpha scale
α ≥ 0.9
Excellent (high-stakes testing)
0.7 ≤ α < 0.9
Good (low-stakes testing)
0.6 ≤ α < 0.7
0.5 ≤ α < 0.6
α < 0.5
Kano analysis  is performed to get insight of user satisfaction in the rating process. It is a widely used usability tool that focuses on differentiating any features of the operation, as opposed to focusing initially on user’s needs. Kano also produced a methodology for mapping consumer responses to questionnaires onto his model which may be useful incorporating with a traditional rating system.
The diagonal line (blue) indicates the one-dimensional expected need of the user. The curve in the bottom (red) indicates the user’s basic need which is known as a must-be requirement and the curve in the top (green) represents the excitement needs in terms of attractive requirement . Dotted lines are used to represent scales of user acceptance and satisfaction. The horizontal dotted lines represent satisfaction and the vertical lines are aligned with user’s acceptance scores.
Kano model questioners
Subject’s expectation fulfilled (functional form)
If you see, the subject can perform × operation how does he feel?
(a) He likes it in that way
(b) It must be that way
(c) The way he is interested
(d) He can live with that way
(e) He dislikes the way
Subject’s expectation not fulfilled dysfunctional form
If you see a subject cannot perform × operation how does he feel?
(a) He likes it in that way
(b) It must be that way
(c) The way he is interested
(d) He can live with that way
(e) He dislikes the way
Kano model evaluation
The user satisfaction coefficient (US-coefficient)
According to Kano explanation, the positive US-coefficient ranges from 0 to 1; the closer the value to 1, the higher the influence on user satisfaction. The minus sign in dissatisfaction Eq. (3) indicate negative influence on customer satisfaction if this product quality is not fulfilled. If the value of dissatisfaction approaches −1, the influence on customer dissatisfaction is especially strong, and product feature is not fulfilled. A value of about 0 signifies that this feature does not cause dissatisfaction. The satisfaction and dissatisfaction scores are based on the frequencies of M, O, A, I and R. Some rules can be followed which are known as evaluation rule. M > O > A > I must-be > one-dimensional > attractive > indifferent.
The quality improvement index (QI)
The value is indicative of how important the feature, service or process requirement is in terms of the quality improvement. The higher value in the positive range, the higher the relative improvement of quality form subjective viewpoint. However the higher negative values of this index, the higher the relative competitive disadvantage.
Critical incidence observation
Critical incidence combine (What + emotion + Why) from a given observation . What—provides an in-depth description of the event and try to write it without judgment or interpretation. Emotions—describes the feelings the subject was ‘experienced’ with the incident. Why—explains why the incident was meaningful to us, and then put observer in the position of the subject and explains from their perspective why the incident was meaningful. critical incident analysis considers the position of observer-What is some of his personal beliefs related to expert knowledge that he/she identified when reflecting on this incident? After considering this incident, what he/she would do differently in light of new understanding? The hybrid approach encompasses the un-weighted NASA-TLX system and experts rating (weights) on the recorded videos. The matters of considerations are subjective experiences, functional states, task difficulty and time pressers . The weight range considered (5-1 scale) as like Table 4.
Gradient descent rule
Gradient descent rule needs a number of assumptions to be satisfied in order to succeed in convergence. To keep it simple for wide range of readers from different disciplines, we computed the initial value 0.5, the average of normalized maximum (1.00) and minimum (0.00) possible value ((1.00 + 0.00)/2). Total number of iteration required to converge will be sensitive to initial value. For instance, we start with 0.5 as the initial weight and performed five iterations, which are shown in result section (Table 11).
Results and discussion
Alpha score on R-MAP dataset
Alpha score: cross-disability communication NASA-TLX data
Alpha value (unweighted)
Alpha value (weighted)
The problem of data reliability takes part in cross-disability data analysis. In un-weighted cross-disability NASA-TLX, all scores lose reliability in terms of alpha (Table 10). Alpha in weighted NASA-TLX (weighted by subject) shows the worst reliability score of all datasets.
Alpha score in cross-disability dataset improvement
Modified alpha value
Subjective rating scales are used quite frequently in almost every aspect of research and practice for the assessment of workload, fatigue, usability, annoyance and comfort, and lesser known qualities such as urgency and presence . The biased data can impede the actual need of target user and inferences obtained may not reveal the true nature of the problem leading to poor acceptances to target community. But, in disability research, same methods are being used for a long time with the help of an interpreter/moderator-facilitator. We tried to show the issues related to data reliability and acceptability of such rating system in disability study.
The proposed hybrid approaches have several benefits in all aspects of usability engineering—reliability and user acceptability. According to the reviewer, the idea has found applications in other research/applied practices, for instance Credit Rating Agencies (e.g. Standard and Poor’s, etc.) are using in parallel qualitative and quantitative methodologies for providing their scores. It is expected to help to gain a better understanding of exact subjective need, prioritizing needs for development activities, distinguishing end users demand, adding the design trade-off process.
Although disability research is mostly user centric, Kano evaluation in the disability rating acceptance might give a better understanding the characteristics of disability actual needs and to get natural requirements in assistive technology design. The gradient descent based weight updating of critical incidence scores are proposed as a new approach in qualitative weight adjustment.
In summary, this research combines the qualitative (using the gradient descent rule) and quantitative score to create a relatively unbiased rating system. This work deals with the analysis of ‘Forer’ and ‘Hawthorne’ effects in a subjective rating system. They propose a relatively hybrid unbiased rating system. Cronbach alpha test is used to ensure the reliability of the score and a gradient descent algorithm is used to update the alpha score. Kano analysis has also been used to see if the subjects are satisfied with the new rating process. This hybridization will be compared with other mechanisms in the future.
This project was partially funded by National Science Foundation (NSF-IIS-0746790), USA. We greatly acknowledge to Clovernook Center for the Blind and Visually Impaired, Memphis TN, USA, Especially to Mr. Majid Khan for continuous cooperation. We also greatly acknowledge reviewers encouraging comments and valuable suggestions in updating different parts of the paper.
The author declares that he has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Sears A, Hanson VL (2012) Representing users in accessibility research. ACM Trans Access. Comput 4(2):7. doi:https://doi.org/10.1145/2141943.2141945 View ArticleGoogle Scholar
- Dickson DH, Kelly IW (1985) The ‘Barnum Effect’ in personality assessment: a review of the literature. Psychol Rep 57(1):367–382View ArticleGoogle Scholar
- Parsons HM (1991) Hawthorne: an early OBM experiment. J Org Behav Manag 12(1):27–43Google Scholar
- Brown RT, Reynolds CR, Whitaker JS (1999) Bias in mental testing since bias in mental testing. Sch Psychol Q 14(3):208View ArticleGoogle Scholar
- Hart SG, Staveland LE (1988) Development of NASA-TLX (task load index): results of empirical and theoretical research. Hum Mental Workload 1(3):139–183View ArticleGoogle Scholar
- Hart SG (2006) NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting. Sage Publications, Thousand Oaks, pp 904–908Google Scholar
- Hossain G, Shaik AS, Yeasin M (2011) Cognitive load and usability analysis of R-MAP for the people who are blind or visual impaired. In Proceedings of the 29th ACM international conference on design of communication. ACM, New York, pp 137–144Google Scholar
- Hossain G, Yeasin M (2013) Collaboration gaps in disabilities sensemaking: Deaf and blind communication perspective”, Proceedings of the ACM conference on computer supported cooperative work (CSCW) CIS Workshop 2013, Feb 24, 2013, San OntarioGoogle Scholar
- Krantz S, Hammen CL (1979) Assessment of cognitive bias in depression. J Abnorm Psychol 88(6):611View ArticleGoogle Scholar
- Marks DF (2000) The psychology of the psychic, vol 2. Pro-metheus Books, New YorkGoogle Scholar
- Andrich D (1978) A rating formulation for ordered response categories. Psychometrika 43:357–374View ArticleMATHGoogle Scholar
- Friedman HH, Amoo T (1999) Rating the rating scales. J Mark Manag 9(3):114–123Google Scholar
- Cronbach LJ (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16:297–333View ArticleGoogle Scholar
- Kano N, Seraku N, Takahashi F, Tsuji S (1984) Attractive quality and must-be quality. J Jpn Soc Qual Control 14(2):39–48Google Scholar
- Redpath L, Stacey A, Pugh E, Holmes E (1997) Use of the critical incident tech-nique in primary care in the audit of deaths by suicide. Qual Health Care 6(1):25–28View ArticleGoogle Scholar
- Russell I (2012) The delta rule. University of Hartford, West Hartford. Accessed 5 Nov 2012Google Scholar
- Annett J (2002) Subjective rating scales: science or art? Ergonomics 45(14):966–987View ArticleGoogle Scholar
- Pollack Irwin (1965) Iterative techniques for unbiased rating scales. Q J Exp Psychol 17(2):139–148View ArticleGoogle Scholar
- Krauss-Whitbourne, S (2012) When it comes to personality tests, skepticism is a good thing. Psychol Today. Accessed 25 Nov 2012Google Scholar
- Nielsen J (1992) The usability engineering life cycle. Computer 25:12–22View ArticleGoogle Scholar
- Thurstone LL (1927) A law of comparative judgement. Psychol Rev 34:278–286Google Scholar
- Kantowitz BH, Roediger HLIII, Elmes DG (2009) Experimental psy-chology. Cengage Learning, BostonGoogle Scholar
- Schmitt N (1996) Uses and abuses of coefficient alpha. Psychol Assess 8:350–353. doi:https://doi.org/10.1037/1040-35126.96.36.1990 View ArticleGoogle Scholar
- Zinbarg R, Yovel I, Revelle W, McDonald R (2006) Estimating generalizability to a universe of indicators that all have an attribute in common: a comparison of estimators for alpha. Appl Psychol Meas 30:121–144. doi:https://doi.org/10.1177/0146621605278814 MathSciNetView ArticleGoogle Scholar
- Zinbarg R, Revelle W, Yovel I, Li W (2005) Cronbach’s, Revelle’s, and McDonald’s: their relations with each other and two alternative conceptualiza-tions of reliability. Psychometrika 70:123–133. doi:https://doi.org/10.1007/s11336-003-0974-7 MathSciNetView ArticleMATHGoogle Scholar
- Dunn TJ, Baguley T, Brunsden V (2013) From alpha to omega: a prac-tical solution to the pervasive problem of internal consistency estimation. Br J Psychol. doi:https://doi.org/10.1111/bjop.12046 Google Scholar