Cognitive interviewing is a versatile method that can be helpful to explore how people perceive, process, and respond to stimuli. It is a popular approach to study UI/UX (user interface design and user interaction and experience) in industry, as well as for exploring how people perceive questions and questionnaires in marketing. In the social sciences, it is often used to study cognitive validity or “actual” validity.
Therefore, before going into cognitive interviews themselves, we will briefly explain cognitive validity in the context of measurement instruments and manipulations. We will then proceed with explaining cognitive validity, after which we will deepen the discussion by delving into response processes, narrative response models, and finally, explaining how cognitive interviews can be used to study the validity of a measurement instrument or manipulation.
Cognitive validity usually refers to whether people interpret instructions, questions and questionnaires as intended. Specifically, cognitive validity refers to whether participants interpret the procedures, stimuli, and response registrations that form a measurement instrument (or manipulation) as intended.
Examples of manipulations are elements of behavior change interventions (e.g., the application of a behavior change principle such as modeling or persuasive communication), ingredients of psychotherapy (e.g. an approach to help people with reattribution or protocol elements designed to foster trust in the therapy-client relationship), or manipulations as used in psychological experimental studies (e.g. stimuli such as sound fragments that should induce a feeling of stress, a procedure designed to temporarily increase participants’ self-esteem, or a set-up designed to produce the smell of apple-pie).
Examples of measurement instruments are questionnaires (e.g., an experiential attitude scale designed to measure people’s feelings towards hand washing, a survey aiming to map out people’s perceptions of traffic safety, or personality indices) or response latency tasks (e.g. the implicit association test).
Both manipulations and measurement instruments consist of procedures and usually stimuli, and measurement instruments also specify how to register participants’ responses. For the manipulations and measurement instruments to be valid for a given population in a given context, usually a first requirement is that these constituent procedures, stimuli, and response registrations are interpreted as intended. Cognitive interviews are a method to study whether this is the case.
We will use the term “item” to refer to one single element of a measurement instrument or manipulation. A single measurement instrument item corresponds to the registration of a response. For example, in a questionnaire, each question is an item, and in a response latency task, each trial is an item. In a manipulation, each combination of stimuli and procedural information that is comprehensible on its own (or more accurately, that is capable of producing the desired effect on its own) is an item. For example, manipulation items can be single auditive stimuli (audio clips or songs), videos, images, paragraphs of text, or (elements of) protocols. Often, manipulations will be one item, intended to be interpreted in unison.
In the rest of the chapter, we will use questionnaire items as examples. Remember that everything also applies to other types of measurement instruments (such as response latency tasks) as well as to manipulations. Anything that was designed to produce some specific effect or response can be explored to some degree with cognitive interviews.
A useful concept when thinking about how people perceive items (i.e., procedures, stimuli, and methods to register responses) is response processes. For measurement instruments, people’s response processes are the processes that produce the response that is then registered (“measured”). For manipulations, response processes are the processes produced in response to the item (not necessarily resulting in a reponse that is registered).
Response processes usually start the same way: with people’s perception of an item. After the item is perceived, people interpret what they perceived and further process that information. Cognitive validity refers to how “correct” these response processes are.
As an example, let us take the infamous Pickle Fanaticism Scale. Its first iteration has 33 items in three subscales (“Strong desire to eat pickled products”, “Extreme liking of pickled products”, and “Feels the need to evangelize about the benefits of pickles”). Below are three items, one from each subscale:
- I often contemplate the role of pickles in a post-modern society
- I always eat pickles for breakfast
- I recommend pickles to people I know
The response scale is the same for each of the 33 items: “Strongly Disagree” (1); “Disagree” (2); “Neither agree nor disagree” (3); “Agree” (4); and “Strongly Agree” (5).
Each item can be considered to be formed by six stimuli: the question texts (the three sentences) and each of the five response options. Each of these stimuli, as well as the procedure determining how they are laid out on the page, can be interpreted differently. These interpretations drive further processing that ultimately results in a response that is registered.
With respect to cognitive validity, this means that for an item to have cognitive validity, each of these stimuli must be interpreted as intended. For example, participants may have different definitions of “often” (as in the first item; e.g. “more than half the time” or “more than two-thirds of the time”), “post-modern society” (also in the first item; a term that quite a lot of people may not be familiar with), “always” (second item; e.g. “100% of the time” or “90% of the time or more”), and how broadly they define the group of “people I know” (i.e. whether this includes distant acquintances or not). If the stimuli are not interpreted as intended, participants effectively answer a different question.
In an influential book about the psychology of survey responses, Tourangeau, Tips and Rasinski (2000) describe a model with four components: comprehension, retrieval, judgment, and response. Each of these is subdivided into specific processes: comprehension contains attending to questions and instructions; representing a logical form of the question; identifying the question focus (i.e. what is being asked); and linking key terms in the question to relevant concepts. The retrieval component is subdivided into generation of a retrieval strategy and cues; retrieving memories; and filling in missing details. Judgment consists of assessing the completeness and relevance of the memories; drawing inferences based on accessibility; integrating the material; and making an estimate based on partial retrieval. Finally, the response component consists of mapping judgment onto the response options and editing the response.
Because many different types of measurement instruments exist, there is no single response process model that is useful across the board as a model for how people process questions. As a consequence, multiple response process models exist. However, this generic model is a useful scaffold when thinking about what happens when people respond to a survey. In addition, it can be used to organize the results of cognitive interviews, discussing which items exhibit problems with each of the four components in this model (comprehension, retrieval, judgment, and response).
Cognitive interviews use two methods to explore participants’ response processes: the Think Aloud method and Probes.
The Think-Aloud method does pretty much what it says on the tin: participants are asked to think aloud. This is of course less trivial in practice: thinking aloud is (for most people) not natural, so participants need some practice and often have to be reminded a few times.
When using the think-aloud method, it is important that the interviewer limits themselves to reminding the participants to think aloud, for example by asking “what are you thinking?”
Because during the actual cognitive interview, the interviewer cannot say much else, training participants in thinking aloud is important. To train them, prepare a number of questions or scenarios you can use. For example, Willis (1999) suggests the following:
“Try to visualize the place where you live, and think about how many windows there are in that place. As you count up the windows, tell me what you are seeing and thinking about.”
Willis, 2009, p. 4
For some participants, multiple such tasks may be required before they get the hang of it, so so you want to prepare several different tasks to have them ready when required.
If the cognitive interview is not recorded, during the cognitive interview, the interviewer makes notes of their interpretations of the participants’ response processes on the basis of their thinking aloud. Of course, if the interview is recorded, notes can still be helpful to provide details that will not be captured on the recording.
Benefits of think-aloud approaches are that there is minimal risk of intervening with participants ‘natural’ response processes. On the other hand, despite the training in thinking aloud, participants can vary quite a bit in how comprehensively they think aloud. In addition, they will not generally elaborate on specific issues that may be relevant to you as a researcher.
Therefore, the think-aloud approach is often combined with verbal probing. Probes are questions that you prepare to, well, probe specific issues. For example, you may anticipate that participants may heterogeneously define the group of “people they know”, as in the question “I recommend pickles to people I know”. If that is the case, you may want to prepare a series of probes to explore these interpretations. For example, you could prepare probes such as “When you answered this question, what did”people I know” mean to you?” or “To you, what determines when you know somebody?”.
There are more extensive guidelines in ‘Cognitive Interviewing: A “How To” Guide’ by Willis (1999), available at https://bit.ly/wbm-willis-ci. A comparison of different types of verbal probes is available in ‘Which probes are most useful when undertaking cognitive interviews?’ by Priede, Jokinen, Ruuskanen & Farrall (2014), available at https://bit.ly/wbm-priede-ci-probes.
A number of commonly used cognitive interview coding schemes exist [some are listed below; for more details, see Peterson et al. (2017) and Woolley et al. (2006)]. Whether you use existing coding schemes, develop your own, use existing schemes with more detailed codes added relating to your specific reponse models, or even have multiple coders code using different coding schemes is ultimately a subjective, scientific, and pragmatic consideration. Like for all decisions you take in any scientific endeavour, the most important thing is to clearly, comprehensively, and transparently document your decision and the underlying justification.
Peterson, Peterson & Powell
- Is the item wording, terminology, and structure clear and easy to understand?
- Has the respondent ever formed an attitude about the topic? Does the respondent have the necessary knowledge to answer the question? Are the mental calculations or long-term memory retrieval requirements too great?
- Is the question too sensitive to yield an honest response? Is the question relevant to the respondent? Is the answer likely to be a constant?
- Is the desired response available and/or accurately reflected in the response options? Are the response options clear?
- Do all of the items combined adequately represent the construct? Are there items that do not belong?
Levine, Fowler & Brown
- Items with unclear or ambigous terms, failed to understand the questions consistently.
- Items for which respondents lacked information to answer a question.
- Items measuring construct that are inapplicable for many respondents (e.g. made assumptions).
- Items failed to measure the intended construct.
- Items making discriminations that are too subtle for many respondents.
- Several other general issues associated with the development of a questionnaire.
- Problems with intent or meaning of a question.
- Likely not to know or have trouble remembering information.
- Problems with assumptions or underlying logic.
- Problems with the response categories.
- Sensitive nature or wording/bias.
- Problems with introductions, instructions, or explanations.
- Problems with lay-out or formatting.
Cognitive validity is a useful concept even when thinking about items that are used to, for example, collect information about people’s age, gender, or education level. However, in psychology, items are often used to measure psychological constructs. In such cases, cognitive validity is still important, but mostly as a first prerequisite for “regular” validity.
This “regular” validity is a complex concept, and various definitions and approaches have been proposed over time (see e.g., Borsboom et al., 2004). It is generally defined as whether the item (or set of items aggregated into a measurement instrument) measures what it is supposed to measure. Although this is the commonly used definition, the commonly used procedures for judging validity do not align with that definition. These issues are discussed in depth in (Borsboom et al., 2004), (Borsboom et al., 2009), and this disconnect is clearly illustrated by (Maul, 2017).
One solution is to abandon the construct-instrument correspondence usually seen as the core of validity, and define validity as a function of the interpretation and use of an item or measurement instrument (Kane, 2013). However, (Borsboom et al., 2009) argue that this also seems untenable (Borsboom et al., 2009), instead proposing what they call “test validity” as a solution. This test validity is in essence the validity people talk about when talking about validity without specifying a type: whether a measurement instrument measures what it is supposed to measure.
Borsboom, Cramer, Kievit, Zand Scholten, and Franic (2009) argue that almost all measurement models are reflective measurement models: put simply, they assume that item scores reflect an underlying latent construct. They further argue that the model of how the construct of interest causes these scores is central to validity, or as they put it, “In our view, in fact, establishing the truth of such a model would clinch the question of validity.” (Borsboom et al., 2009, p. 155). They argue that research into validity should concern itself with response processes to establish, in short, how a given item or measurement instrument works: “If successful, such a research program therefore solves the problem of test validity, because it by necessity becomes clear what the items measure and how they measure it. And that is all there is to know regarding validity.” (Borsboom et al., 2009, p. 155)
As an instrument to think about and explicate one’s theory of how an item or measurement instrument work, the concept of response models can be helpful. We define a response model here as the intended process leading to the registration of the participants’ responses and starting with their exposure to the stimuli and procedures that, together with the response registration procedure, constitute an item. An item is a distinct element of a measurement instrument or manipulation: these typically (but not necessarily) consist of multiple items. Although the logic underlying response models as means to specify what happens in between when somebody perceives stimuli and produces a response applies to manipulations as well as to measurement instruments, in the remainder of this chapter we’ll discuss measurement instruments as they explicitly contain a procedure for registering responses.
The response models describe how the construct that an item is designed to assess causes the variation in the item’s scores (if the item performs as it’s supposed to, i.e. if the item is valid; see Borsboom et al. (2004) and Borsboom et al. (2009) for more background and how this approach contracts with Kane (2013)’s argument-based approach). It describes how more fundamental psychological processes are invoked, for example how the relevant reflective and/or reflexive, cognitive and/or affective, deliberate and/or automatic constructs, such as mechanisms, processes, or representations, ultimately produce the response that the item registers.
Note that the type of response model can differ as a function of one’s ontological and epistemological perspective. From more constructivist perspectives, the response model may involve shared construction of meaning; perspectives tending towards realism might lean more heavily on attention or memory processes; and if one entertains an operationalist perspective, response models might be exceedingly rudimentary (though admittedly, researchers with that perspective would probably not engage in cognitive interviews in the first place).
As discussed above, participants’ response processes are the description of what happens as they perceive, interpret, and process an item and ultimately produce the response that is registered by the item’s response registration procedure. From a validity perspective, then, these response processes ideally closely reflect the item’s response model.
The form of a response process is typically quite different from the form of the response model. The former is often derived from participants’ verbal descriptions that express the results of introspective efforts; whereas the latter is often a description of the involved theoretical constructs and mechanisms (see the previous section). For example, the latter can contain automatic or unconscious processes, which would, almost literally by definition, be unavailable to introspection. Therefore, complete overlap between response processes and the desired response models may be impossible.
Therefore, not all parts of the response models for all items can always be verified using cognitive interviews; other methods may have to be invoked, such as experiments where items are manipulated to verify parts of the item’s response model. This means you will have to first decide how each part of the response model can be verified, and then for the cognitive interviews, select those parts of each item’s response model that in fact lend themselves to verification with cognitive interviews.
Once you made this selection, you have a list of response model parts for each item. Often, the response models for a set of items that belong to the same measurement instrument will overlap. To illustrate this, we will give two examples of common situations. First, sometimes you assume that two item are so-called “parallel items”. These are items that you assume measure the exact same thing in the exact same way and assuming a reflecting measurement model. Such items are, for all practical purposes, interchangeable — and in such cases, the response models will be identical.
Second, sometimes you have two items that measure very different things. For example, one item is designed to measure somebody’s income, and a second item is designed to measure somebody’s education. These can both be part of a measurement instrument for social-economic status that assumes a formative measurement model. The response models for these two items will be quite different.
Once you made this selection, you should have the following:
- for each item, a response model;
- for each part of each response model, a decision as to whether you think it’s feasible to study whether those parts of people’s response processes are consistent with the corresponding parts of the item’s response model;
- for each part of each response model, in which other response models it occurs.
You need one more piece of information before you can move towards specification of the coding schemes and the preparation of prompts: the response process spectrum.
For every part of each response model you want to verify, specify the corresponding response process spectrum. This is, as far as you can think of, a list of all possible alternatives to the response model. For example, if you response model contains a step “the person visualises the front wall of their house”, for example in an item measuring the number of windows in somebody’s house, a potential response process spectrum could be:
- the person visualises the front wall of their house (i.e. the response model)
- the person visualises the back wall of their house
- the person visualises a side wall of their house
- the person does not visualise anything
During the actual cognitive interview, you will likely discover that people’s response processes deviate from the response model in ways you couldn’t imagine beforehand, and that’s ok. The main purpose of this step is to help you get an idea of the kinds of things you’ll want to spot in each part of each response model.
Once you produced the response process spectrum for all response model parts for all (unique) response models, you can start compiling your coding scheme.
Based on the response spectra for each part of each response model, you can now produce codes that you will use to code the notes (or maybe transcripts) of your cognitive interviews. These codes will be the “glasses” through which you will see your results: although you can always add new codes during the coding phase, in general, it is easy to miss things for which you did not prepare a code.
For each part of each response model, think of a brief code that tells you how people’s response processes look. The coding scheme can be hierarchical: you can have “sub-codes” or “child codes” to organize your codes. For every code, designate a code identifier: a unique string of characters consisting only of lower case letters (
a-z), upper case letters (
A-Z), digits (
0-9), and underscores (
_), and always starting with a letter. If you have hierarchical codes, you can indicate the hierarchy using a greater-than sign (
Examples of valid codes are:
Finally, once you have your coding scheme, you can craft your prompts.
Prompts are specific things you can ask participants during the cognitive interview designed to elicit expressions that are informative as to specific parts of their response processes. Designing these is relatively straightforward once you have your coding scheme: you have to think about which questions you can ask that are likely to lead to answers that you can then code with specific codes pertaining to specific parts of the response process.
Of course, if you manage to formulate prompts that can cover multiple parts of participants’ response processes, that’s more efficient. Therefore, asking open-ended questions (e.g., “why did you provide that answer?”) is a popular approach. However, sometimes closed questions can be very efficient to quickly check whether people did a given thing (e.g., “to arrive at this estimate, did you visualise the front wall of your house?”).
The product of this step will be a list of prompts, that you sort in the order in which the items will be presented to participants. You may want to designate unique identifiers to the prompts (e.g. numbers, letters, or a combination of these) to structure the notes you take during the cognitive interview, or even enter your notes into a file that already contains the prompts.
Using existing coding schemes (see the section above) has advantages and drawbacks. A salient advantage is that if you use an existing coding scheme, you don’t have to map out the response models and response process spectrums for each item. This saves you a lot of time and potentially frustration and uncertainty if you don’t know much about the relevant response models. Another advantage is that using an existing coding scheme facilitates comparison of item performance (and so, measurement instrument performance) over different cognitive interviews studies.
A big disadvantage is that if you use an existing coding scheme, you don’t have to map out the response models and response process spectrums for each item. Those exercises force you to think long and hard about the assumptions underlying each item’s validity (and so, the validity of your measurement instrument), and skipping this make it more likely you miss problems. A second disadvantage is that it is harder to see how to improve the items, since the results of your cognitive interview will be very generic; they will not point to specific parts of the response models.