Chapter 5 Initial codes

In 1974, Richard Feynman gave a commencement address at Caltech, a university in the United States, where he not only coined the term “cargo cult science”, but also formulated a very succinct definition of scientific integrity:

The first principle is that you must not fool yourself – and you are the easiest person to fool. So you have to be very careful about that. After you’ve not fooled yourself, it’s easy not to fool other scientists. You just have to be honest in a conventional way after that.

In research, fooling yourself is excessively easy due to what has been referred to as “researcher degrees of freedom” (Simmons et al., 2011) or “the garden of forking paths” (Gelman & Loken, 2014): conducting a study involves making hundreds of decisions, varying from trivial to critical, and each of these decisions is an opportunity to fool yourself unintentionally (or, for that matter, intentionally).

Since humans have a lot to gain from effectively fooling ourselves much of the time, it’s very hard to realise when you’re fooling yourself. Doing research with scientific integrity, therefore, requires availing oneself of an array of supporting tools. In qualitative research, one of these tools is the practice of carefully developing a set of initial codes.

At first glance it may seem like developing an initial code set makes sense when planning to engage in deductive coding, but less so when planning to use a more inductive (or ‘open’) approach. However, to avoid fooling yourself, it is always wise to develop an initial set of codes. This practice can be seen as an operationalization of your positionality: exercising reflexivity to explicate your prior ideas about your study in an initial set of codes helps you to avoid fooling yourself. After all, those prior ideas will influence your coding regardless of whether you’re transparent about them or not, and if you didn’t explicitly elaborate and document them a priori, it is easy to fool yourself into thinking your ultimate coding structure represents insights obtained from the data, when in fact it represents your presupposed ideas instead.

Facilitating transparency about your pre-existing expectations and assumptions isn’t the only benefit gained by establishing your initial set of codes. A second benefit is that the process of coding becomes more consistent: every coding decision becomes an explicit interaction between the existing code structure and the data fragment(s) you are evaluating (see chapter Coding, chapter 7 in this version of the book). The third benefit is closely related: this more uniform approach to coding makes it easier to comprehensively document and justify your decisions. This, in turn, facilitates reflexivity again.

5.1 Code organization

One of the decisions you will have to take when you want to code qualitative data is how your codes will be organized. Codes can be organized in any way you want, provided you have a system available that supports that organization. A consequence of this (i.e. the availability of such systems constraining the code organizations that are used) is that in practice, only three organizational modes are common. The mode you select is determined by the organizational structure that you assume exists in the patterns you are looking for in the data. This organizational mode, therefore, is simultaneously a manifestation of your prior beliefs about your research topic, and something of a mold that will shape your results.

5.1.1 Flat code organization

Sometimes, codes are simply not organized at all. You then attach codes to fragments of data and leave it at that. You can still refine the codes by working iteratively, but this results in either replacing existing codes attached to a data fragment or attaching additional codes. You select this mode of code organization if you assume that the patterns you are interested in do not exhibit a structure mimicked by an available organizational mode.

Because a flat code organization imposes no structure on your data, you do not risk believing you discovered a certain structure in the data while in fact it is just the structure your codes were cast into by the organizational mode you chose. However, the drawback of a flat code organization is that if there is an intrinsic structure to the data that, when reflected in your codes, would help you understand the data better, you will be less likely to discover it.

5.1.2 Hierarchical code organization

The most common mode of code organization is hierarchical organization. This means that codes are clustered within other codes. This organizational mode fits well with many qualitative paradigms and their approaches where the code structure is refined through iterative coding. Each iterative round then often consists of adding more specific codes as ‘children’ of the codes that the round started with.

The concept of codes representing more specific manifestations of overarching codes is intuitive to many people. In addition, this organizational mode also lends itself well to a combination of deductive and inductive coding: the initial code set then resembles a shallow hierarchical code tree, and these codes become the parent codes of the inductive codes that are added.

5.1.3 Network code organization

Organizing codes in a network is both more complicated and more versatile than the other two organizational modes. The tree structures used for hierarchical code organizations can be represented using networks, and so can flat code structures. However, networks can also be used to represent causal relationships expressed in a dataset, or specific types of structural relationships. Therefore, choosing this organizational mode means you have quite clear ideas about the structure of what you are looking for, and that entails both the highest potential for introducing bias and the highest potential for accurate representation of your data.

A clear benefit of the network organizational mode is that the relationships between two codes can vary. In a hierarchical code structure, the relationship between a parent code and a child code is itself normally not coded: the meaning of that parent-child relationship is therefore implicit, and usually the same for the entire code tree. If it is not the same for the entire tree, this risks introducing some ambiguity: for example, in some branches parent-child relationships may represent a structural relationship (e.g. “children are parts of their parents”), whereas in other branches parent-child relationships may represent a specifying relationship (e.g. “children are more specific concepts than their parents”).

When using network coding, every relationship is explicitly coded. This clearly captures the coder’s belief that a data fragment expresses, for example, a causal relationship (“coding causes headache”), a structural relationship (“qualitative research is a type of research”), or an attributive relationship (“qualitative data is very rich”).

Perhaps the biggest benefit is that the structure of the codes is not constrained in any way. A code (e.g. “coding”) can simultaneously be a cause of another code (“causes headache”), have an attribute (“takes time”), and have parts (“choosing a code idenfifier”), and this can all be represented in the code structure with high fidelity. This means that generally, using the network organizational mode means that coding stays closer to the data: it requires less imposition of a given (e.g. hierarchical) structure.

A drawback related to this versatility is that networked code structures can quickly become complex. In addition, once multiple sources have been coded, the resulting code structures can be hard to combine. For example, in one source, the causal relationship “coding has a positive causal relationship with headache” can be coded, whereas in another source, the causal relationship “coding has a negative causal relationship with headache” can be coded. Some people may get a headache from coding, but for others, coding may help to ameliorate headaches. Reconciling such code structures can be a hard task.

5.1.4 Combining and changing organizational modes

If your situation calls for it, you can also combine multiple organizational modes. In addition, if you initially suspected that a given organizational mode would fit well, but during coding, you discover that you were wrong, you can always switch. That does entail some costs, though, which become higher as you switch later in the process: you will have to do more and more recoding, after all. Therefore, it pays to reflect thoroughly on the organizational mode with which you start: this is not a decision you should take lightly.

5.2 Compiling your initial set of codes

Once you decided on the initial organizational mode for your initial code set, you can start thinking about the codes populating it. In this process, it helps to determine where you are located on the spectrum from the intention to work fully deductively to the intention to work fully inductively.

On the one extreme of this spectrum, where you aim to work fully deductively, you only apply codes that you developed in advance. The code structure you end up with is identical to the code structure you started with. In this scenario, you already have one or more theories or other guiding principles that gave rise to this initial code structure, and you’re not interested in adapting that structure based on what you encounter in the data - instead, you’re usually mainly interested in the fragments of data you end up labeling with every code. At this end of the spectrum, developing an initial code structure is relatively easy, since explicitly specified theories or guiding principles already exist.

On the other extreme of this spectrum, where you aim to work fully inductively, your situation is opposite: you try to be as blank a slate as possible, and represent whatever patterns your find in the data as well as possible. However, nobody who has developed the competence to conduct a qualitative study, or to code data, is truly a blank slate: people have many representations of the world, and these influence how they interpret fragments of that world such as qualitative data. At this end of the spectrum, it can be quite hard to develop an initial code structure, as this will require considerable self-interrogation so as to unearth the various relevant expectations and preconceptions that form the breeding ground for the produced codes.

As the name implies, these two extremes are quite rare. Usually researchers find themselves in one of the infinite positions on the spectrum in between. To offer guidance to develop an initial code structure wherever you may find yourself, the approach from each extreme will be discussed, so you can take the elements that suit your specific situation.

5.2.1 Developing initial codes based on extant specifications

If you are working from extant specifications, such as a theory (or several) or another collection of ideas that will guide your coding, the following steps can be useful to develop your initial set of codes. The order in which these steps are presented isn’t meant to be prescriptive; the process of fleshing out your initial code structure is often iterative and somewhat messy.

  • Think about the concepts and relationships defined in the theory and decide for which ones you want to identify corresponding data fragments.

  • Once you have that list ready, discuss it with everybody in your project team (i.e., all collaborators) as well as relevant external stakeholders (e.g., representatives of relevant groups to your study, such as the population you’re interested in). Make sure everybody agrees on this initial list of concepts.

  • For everything you want to code, write up a comprehensive preliminary definition. “Comprehensive” here means that it should be so specific and accurate that discussing it with others will reveal (potentially subtle) differences in each individual’s definition.

  • Once you have comprehensive preliminary definitions, discuss these with all collaborators and external stakeholders. This process will often result in decisions to split or merge or otherwise reconfigure the concepts and relationships, and can take quite some time.

  • At this point, you have a list of things you want to code, and your team has shared definitions of what these things are. The next step is to develop coding instructions – this is discussed in the section below.

5.2.2 Developing initial codes based on personal preconceptions

If you are working without pre-existing specifications to guide your coding, this means the codes you will apply to the data will be the product of both the data and your mind. You see the world in a specific way based on your life so far, and this produces the lens through which you will perceive the data, and so defines the language and conceptual elements you will use to devise codes.

Therefore, when developing initial codes, your aim is to use the fact that a study has a relatively narrow contextual scope to reflect on your perspective and how this will guide your inclinations as you code the data in your study.

  • Think about the kinds of things you’ll expect to find in the data. Try to imagine prototypical expressions of those things, and then think about the categories these would fit into.

  • Give every category you come up with a label and a description. In this process, you may reconfigure your categories, as describing them more in detail may reveal similarities and differences that were not immediately obvious. The categories you end up with represent your preliminary initial set of codes.

  • If you will be the only person coding, you can now develop coding instructions. If multiple people will code, you have to decide how to proceed. If you want to retain the differences between coders as would be the case if each would do the project individually, each coder develops their own coding instructions. However, if you want the coders to start from the same initial perspective, first discuss the preliminary sets of codes each prospective coder can up with. Through these discussions, merge and reorganize the code sets into one preliminary code set that you all feel represents the conceptual starting point for the project well. Then, you can move on to develop coding instructions to accompany these codes.

5.3 Developing coding instructions

Having an initial list of codes isn’t enough to be able to start coding. The process of coding involves many, many decisions (see chapter Coding, chapter 7 in this version of the book), and so offers many opportunities to fool yourself. Therefore, you will need to support your future self (or other coders) to avoid that. A valuable tool to this end is a clear coding instruction.

Coding instructions bridge the gap between what a code represents on a conceptual level and how you recognize expressions of it in qualitative data. It is not uncommon that, as you develop your coding instructions, discussing them with others will lead to revision of the concepts you want to code. Because coding instructions are more concrete than definitions, often this reveals differences in mental models that are easy to neglect to flesh out when discussing conceptual definitions. Developing coding instructions is therefore a useful tool to make sure your initial code set is sufficiently well described.

5.3.1 Edge cases

An important part of coding instructions is the explicit description of edge cases. Coding instructions often center around the core of the relevant code. This is helpful to identify data fragments that clearly represent expressions of the relevant code, but less helpful when encountering data fragments where matters are less clear-cut.

This is where edge case descriptions come in. Edge cases are data fragments that have certain characteristics matching the relevant code, but other characteristics that could justify not applying that code, or applying a different code instead. Thinking through such edge cases in advance and explicitly describing them helps you to delineate your initial codes. It helps you to clarify, to others as well as to yourself, what exactly you mean with each code.

Edge case descriptions also allow you to specify under which conditions one code should be applied, and under which conditions another code should be applied instead. Thinking about where one code ends and another starts and documenting this in cross-references between your codes is very helpful for others and future you to conceptualize your codes in relation to each other.

5.3.2 Examples

As discussed above, while code labels and descriptions are conceptual, coding instructions are more concrete and bridge the gap to expressions you can encounter in your data. Examples are the most concrete parts of coding instructions, and can be very helpful both for yourself and your collaborators to test whether you’re on the same page regarding what codes represent, and for others to wrap their heads around what exactly should and should not be coded with a code.

In practice, while you think about and discuss your initial code book, it is likely you will already do this at the hand of real or fictional data fragments that express something that you intend to be captured with this code – or to not be captured with this code. These examples are a great starting point for your list of examples of data fragments that should or should not be coded with your code.

Specifically, for every code, try to include at least one example in each of the four categories formed by crossing hits and misses with core and edge cases. Hits refer to examples of data fragments that should be coded with a code, wherea misses refer to examples of data fragments that should not be coded with a code. Core cases refer to examples of data fragments that, when people read your code label and description are straightforward to correctly classify as a hit or a miss. Edge cases refer to examples of data fragments that are more ambiguous and would be hard to correctly classify without the elaboration you added to your coding instruction when you worked on edge cases.

These two characterics (hit vs. miss; and core vs. edge) combine to form the follow four categories of examples:

  • Core hits: Examples of data fragments that should clearly be coded with this code, even if somebody just reads the code label and description (let alone with the coding instruction).

  • Core misses: Examples of data fragments that could be considered to belong to the code if somebody only reads the code label (and maybe still when they read the code description), but where the coding instruction clearly clarifies that it should not be coded with this code.

  • Edge hits: Examples of data fragments that are manifestations of edge cases that fall within the scope of this code. As edge cases, whether these fragments should be coded with this code remains relatively ambigious even with the coding instruction, but your consideration and explicit discussion of edge cases should help coders correctly classify this data fragment.

  • Edge misses: Examples of data fragments that are manifestations of edge cases that fall without the scope of this code. As edge cases, whether these fragments should be coded with this code remains relatively ambigious even with the coding instruction, but your consideration and explicit discussion of edge cases should help coders correctly classify this data fragment.

If you find that drafting this list of examples is hard, that could be an indication that your codes aren’t sufficiently elaborated. If codes are still vaguely defined, it can be very hard to think of how they could look if they show up in qualitative data. Having relatively indeterminate codes does not need to be a problem: this mostly depends on your intention during the coding phase. If you’re mostly towards the deductive end of the deductive-inductive spectrum, it is important that your codes are well-defined, unambiguous and unequivocal. However, if you intend to to engage in mostly inductive coding, having codes that aren’t comprehensively fleshed out is less of a problem. In that case, you’re developing your initial code set as a tool to exercise transparency as to your initial expectations and ideas. If those are relatively vague, it is fine to have this reflected in relatively vague codes.

Because in some situations, it can be acceptable to have relatively underspecified codes, in such situations you can also decide to not include examples at all. However, do not make that decision too lightly: thinking about examples of the kinds of data fragments that you expect to find and how you would categorize them conceptually is a very useful exercise even if you aim to engage in mostly inductive coding. As discussed in the beginning of this chapter, it is easy to fool yourself, and though thinking about your expectations and how you see the world can be very hard, it is also a powerful tool to avoid fooling yourself – and others.

5.3.3 Piloting the coding instructions

Depending on how you plan to code in terms of the deductive-inductive coding spectrum, you may want to run a pilot with a small portion of the data. This will allow you to refine the coding instructions and the definitions. Because piloting effectively shifts a number of decisions from the coding stage to the preparation stage, piloting makes more sense as your aim is to code more deductively.

During piloting, you will encounter ambiguities in your coding instructions and corresponding code descriptions and will be able to resolve them. This enhances the consistency with which codes will be attaches to data fragments during the actual coding phase.

However, because process entails interaction with the data and can result in refining of your code structure, it can justifiably be seen as an activity that belongs in the coding phase. Especially if you take a more inductive approach, therefore, you may prefer to not conduct pilots.

5.4 Segmentation

In some projects, it can be useful to segment data. In fact, working with the ROCK requires at least one level of segmentation: segmentation of the data into utterances, as smallest codeable data fragments. However, in many projects, utterances serve a purely pragmatic function, providing an anchor to attach codes to. Higher-level segmentation is meant to distinguish segments based on some definition. For example, you could segment in such a way that the utterances comprising participants’ answers to different questions in an interview are in the same segment; or in such a way that when the data covers a different topic, a new segment starts. We will first discuss utterances as lowest-level segmentation, and then go into the higher-level segments.

5.4.1 Utterances

In any ROCK project, you need to decide what constitutes an utterance. Depending on your plans for your analysis, the impact of this decision on your results can be anything from trivial to decisive.

For example, if you aim to use a hierarchical organizational mode to describe people’s thoughts and feelings related to, for example, how a global pandemic affected their social lives, it won’t make a difference whether you define utterances as words, phrases, sentences, or paragraphs. In this approach, coding serves to organize the data, making it easy to view all data fragments coded with a specific code (or an ancestor of that code). If you also code inductively, in addition, your ultimate code structure will be a concise description of the patterns you found in the data.

In this scenario, to which word, sentence, or paragraph exactly you attach a code doesn’t really matter. One of the main strenghts of qualitative data is its richness, and this also means that interpreting data fragments usually requires their (rich) context. Therefore, when inspecting coded fragments, you usually include a certain number of preceding and following utterances to enable accurate interpretation. Therefore, you could say that you generally code a cluster of utterances: to which one specifically you attach a code doesn’t really matter, because as long as you attach the code to one of them, you achieve your goals (i.e., selectively looking at the data fragments where a code appears, and optionally yielding an ultimate code structure that represents your results).

However, sometimes you have different aims. For example, you might have an epistemological, theoretical, and methodological framework that affords drawing inferences from code co-occurrences. That framework then dictates the conditions within which such inferences are valid. If you think that co-occurrence of two codes is indicative of, for example, the proximity of the concepts those codes represent in people’s mental models of the world, for that inference to be valid the codes have to be occur sufficiently close together (what exactly “sufficiently close” is should be defined in the framework that affords the inferences in the first place).

In such situations, it can matter a great deal how utterances are defined. If co-occurrence is determined at the utterance level (instead of at a higher level of segmentation), then how much data constitutes an utterance determines how many co-occurrences will be found. In such situations, it is also important that coding is accurate: whereas in the first scenario, it doesn’t matter much whether a code is attached to one utterance or the next (given that patterns will often manifest in the data diffusely, not cleanly organized within utterances) in the second scenario, the utterance a code is attached to determines which co-occurrences are produced.

You will generally find yourself in one of two situations. In the first scenario, as illustrated by the first example, there’s no risk that your utterance definition biases your results. In that case using a sentences as an utterance seems a nice compromise between accuracy and feasability. In the second scenario, as illustrated by the second example, there is a considerable risk that your utterance definition biases your results: for example, if your utterances are too large, inferences from co-occurrences are no longer valid; and if they are too small, you miss relevant co-occurrences. In that scenario, however, the very framework that affords drawing inferences from co-occurrences in the first place does so by virtue of a justification grounded in your epistemological, theoretical, and methodological assumptions, which therefore also prescribe the appropriate utterance definition(s).

5.4.2 Higher-level segmentation

In addition to the definition of utterances, sometimes you want to use higher-level segmentation. There are two common scenarios where you want to use higher-level segmentation. The first is pragmatic: you may want to distinguish different phases of data collection, or different stages in a process. The second is methodological: you may require higher-level segmentation to draw certain inferences. Another example of that same situation is when you are interested in code frequencies per segment, or when you want to map networked codes separately for each segment.

Higher-level segmentation is indicated in your sources through coding (but by adding section breaks instead of codes). That means that each type of higher-level segmentation also has a definition and coding instructions. Like for codes, even if you do not document those in advance, ultimately the segmentation will be applied accordingly to some system – documenting it makes it transparent, but doesn’t create it.

Where definitions and coding instructions for codes capture the essence of the corresponding construct, definitions and coding instructions for segmentation instead capture transitions between segments. For example, if you want segments to represent different questions asked by an interviewer, the coding instruction might instruct coders to look for those questions in the interview transcript and insert section breaks just above each question.

Alternatively, you may want segments to represnt turns of talk. In that case, your coding instructions would describe how you define a turn of talk. For example, if one person is speaking, but another briefly interjects to voice agreement, do you consider that a turn of talk? Or do turns of talk require something more, for example that the new speaker expresses something original? Or uses at least a clause? Or do you use another definition?

You may also want something (even) more complicated: for example, you may want segments to distinguish discussion of different topics. In that case you will have to define what a topic change is and how it can be recognized in a source with qualitative data. What exactly you end up choosing depends, again, on the implications of that choice, informed your epistemological, theoretical, and methodological assumptions. Because these same assumptions inform your analyses and justify the inferences from the results of those analyses, they can also guide how you define topic changes.

5.5 Your initial code book

At this point, you should have a first version of your initial code book. It represents a manifestation of your initial expectations, assumptions, and ideas about what you will encounter in the data. It should normally consist of the following:

  • A decision for a specific organizational mode, ideally with an explicit justification to help other researchers as well as future you understand why you chose that organizational mode;
  • For each code:
    • A unique identifier for the code. Identifiers are machine-readable, always start with a letter, and can only contain a-z, A-Z, 0-9, and underscores (_).
    • A human-readable label for the code. This is usually how you will refer to it in your manuscript and other presentations of your project.
    • A description of the code where you describe what it is intended to capture more in detail.
    • A coding instruction that can be applied to data fragments to arrive at a decision as to whether that code should be applied to that data fragment or not, ideally including explicit discussion of edge cases and cross-references to other codes where applicable.
    • Ideally, as a part of the coding instruction, examples from the four categories (core hits, core misses, edge hits, and edge misses).
  • For each segmentation, the same as for every code.

You can include this initial code book when you preregister your study. Publicly freezing it like this is an efficient way to help yourself to not only not fool yourself, but also not fool others. It makes it much easier to compare the code book you end up with after you analyzed all the data with what you started out with in the first place. In addition, it helps you to recognize when you shift in your understanding based on your interaction with the data by making changes from your initial expectations more salient, creating clear opportunities for you to justify those decisions.

It is important to realize that like a preregistration, your initial code book is a plan, not a prison: you publicly freeze it to be transparent, not to create a straightjacket to dogmatically adhere to. Even when your goal is to work completely deductively, your code book may evolve over the course of your coding. That is one of the reasons why preregistration is so important: it helps keep track of such changes and comprehensively document their justifications.

References

Gelman, A., & Loken, E. (2014). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Psychological Bulletin, 140(5), 1272–1280. https://doi.org/dx.doi.org/10.1037/a0037714
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632