Chapter 15 A ROCK workflow

This chapter describes, or perhaps more accurately, prescribes, a recommended workflow to use when working with the Reproducible Open Coding Kit (ROCK). Consistent with the aims of the ROCK, this workflow is designed to optimize transparency and reproducibility.

This ROCK workflow leans heavily on the rock R package and the iROCK interface, but of course, any of the actions described can be implemented in other ways as well.

15.1 A basic ROCK workflow

In qualitative studies where the collected data are already clean and only one coder is used, the workflow is very simple. This workflow is explained first: both because it is all some readers will need, and because it will give an impression of the core elements to those readers who will need the more advanced functionality.

15.1.1 A bit of project management

All projects require some minimal management. When working with computers, this management concerns, among other things, how to organise the related files. When working with qualitative data, there will usually be two types of files: files with participants’ personal data, and files without personal data. The latter can safely (and relatively indiscriminately) be synchronized with other computers, while the former requires more care. It is important to have clear procedures for anonymizing data and for making sure the right files are backed up in the right way. For the anonymized transcripts, analysis scripts, and other scientific materials, we recommend using a version control system such as Git. However, file and project management go beyond the scope of this book.

One trick that does fall within the scope of this book is the functionality of R Projects. An R Project is a collection of related files that sit in a directory. In RStudio, you can create an R Project though the New Project menu option in the File menu. You can either create a project in an existing directory; create a fresh project in a new directory; or connect to a version control system such as Git to clone an existing project to your computer (see this chapter in the Psy Ops Guide for more information about Git).

Once you created a project, RStudio will remember which files you had opened in your script file panel (top-left). It also conveniently shows the files and directories in your project in the Files tab of the bottom-right panel, and if you use Git, allows you to synchronize your changes using the Git tab in the top-right panel. Most importantly for our present purposes, using an R Project allows easy access to your files and directories regardless of where on the PC they are located. Therefore, start by creating an R Project. One you created it, to continue working on this project, simply open the associated file (with the “.Rproj” extension).

15.1.2 Source collection and preparation

We will assume that the data are organised into one or more sources, which are plain-text files where each smallest codable element is placed on a separate line (i.e. each utterance is separated by newline characters). Note that the smallest codable element is not the smallest element that could be coded in theory, but instead represents the smallest element that the researchers are interested in coding. If the data are not yet organised like this, you will first need to clean and organise them; please refer to those sections in the extensive workflow.

In each source, add case identifiers. Case identifiers indicate which utterances belong to which case. Cases are usually data providers, such as participants or organisation. Case identifiers can be, for example, numbers, letters, codes, or pseudonyms. Case identifiers are added to the ROCK like this:

[[cid=1]]

The “1” is the identifier itself; this could also be, for example, “F23b”, “Alice”, or “F”, depending on the system used to identify the sources. This case identifier is used as an efficient way to attach attributes to the relevant utterances. By default, case identifiers are so-called “persistent identifiers”, which means that once they have been specified on a line, all subsequent utterances in that source will be considered to have been coded with that case identifier, until a new case identifier is encountered.

15.1.3 Attribute specification

Cases function as a method for attaching attributes to utterances. For example, if sources are transcripts from interviews, cases can be the participants that were interviewed. This enables attaching participants’ attributes to utterances, such as their age, gender, or area of residence (of course, constructs measured by measurement instruments such as questionnaires can also be used, such as scores on extraversion or self-efficacy).

These attributes can be defined in so-called YAML fragments that are delimited with three dashes (“---”), such as this fragment:

---
ROCK_attributes:
  -
    caseId: 1
    sex: female
    age: 50s
---

Such fragments can be placed in sources (usually at the beginning or the end, although the rock doesn’t care), but it may make more sense to combine them all in one separate file. To combine the attribute specifications for multiple cases, simple repeat the same information, including the dash:

---
ROCK_attributes:
  -
    caseId: 1
    sex: female
    age: 50s
  -
    caseId: 2
    sex: male
    age: 30s
---

Note that when working with YAML, indentation is very important. The word “ROCK_attributes” must always start at the beginning of the line; the dash that indicates that attributes for a new case start must always be indented exactly two spaces; and the caseId and attribute names and their values must always be indented exactly four spaces. If this is violated, the yaml package will throw a “Parser error”.

15.1.4 Coding

To code, one can use any text editor able to edit plain text files, such as Notepad, TextEdit, Notepad++, Vim, or BBEdit. However, in this workflow, we will work with iROCK, an interface optimized for working with the ROCK. iROCK is a simple, userfriendly and GDPR-compliant interface for coding sources. The iROCK interface is discussed in detail in Chapter 14. To load it, visit https://sci-ops.gitlab.io/irock/ in your browser (or follow an alternative method as explained in Chapter 14).

In iROCK, import a source by clicking the red rectangle marked “Sources”. Then, if you want to engage in inductive coding, you can simply start coding. At the right-hand side, you can create codes, and once created, a code can be dragged and dropped from the list onto the utterance you want to apply it to. Click an applied code (in the source) to remove it again. To indicate that a code falls under another code, use the ROCK hierarchy marker: “>”, e.g. “parentCode>childCode”.

If you want to use deductive coding, import your codes by clicking the red rectangle marked “Codes”. You can import a plain text file: every line of the file will be imported as one code. If you already coded one or more sources, you can use the rock R package to efficiently create this list from the used codes. First, use the rock::parse_sources() function to import the sources. This is explained more in detail in the next section, but looks roughly like this:

parsedSources <-
  rock::parse_sources(input = here::here("data", "coded"));

Then, use the rock::export_codes_to_txt() function to export the codes. For example, to export all codes, including their so-called “paths” (their explicitly specified position in the hierarchy using the ROCK hierarchy marker, “>”, use:

rock::export_codes_to_txt(input = parsedSources,
                          output = here::here("codes", "exported-codes.txt"));

Alternatively, you can specify that you only want the “leaves” of the code tree, in other words, you don’t want to select codes that have child codes, using “leavesOnly=TRUE”, and you can specify that you don’t want to include the path, but instead only want the codes themselves, using “includePath=FALSE”:

rock::export_codes_to_txt(input = parsedSources,
                          output = here::here("codes", "exported-codes.txt")
                          leavesOnly=TRUE,
                          includePath=FALSE);

To only select codes with a given parent, specify a value for “onlyChildrenOf”, and to only select codes that match a given regular expression, specify it as “regex”.

The rock::export_codes_to_txt() function will write a plain-text file to disk that can then be directly imported into the iROCK interface.

15.1.5 Analysing the results

15.1.6 (Re)Coding

15.1.7 Publishing the project

15.2 An extensive ROCK workflow

Below follows a more extensive workflow description. For the sake of completeness, this also includes common tasks in qualitative research that are unrelated to the ROCK. It is, after all, the extensive workflow.

15.2.1 Planning

15.2.1.1 Research questions

Like with any study, it is vital to have a clear research question. The research question determines which methods can be used. For example, not all research questions can be studies using qualitative research (and not all research questions can be studied using quantitative research). Typical research questions that require quantitative research are questions about associations or causality. Typical research questions that require qualitative research are questions about experiences, narratives, and contents of constructs. And if strong conclusions are desired, research syntheses are required, rather than a single study.

The research question is important because once it is clear, the required analysis approach can be determined, which allows determining the required coding approach, which allows determining how to collect the data. Because the research question is so fundamentally connected to all other aspects of the study, one approach to clarify your research question is to think about what potential answers may look like.

One important implication of a research question is whether the coding will be deductive or inductive. Deductive coding uses predefined codes, and inductive coding uses codes created during the coding. Deductive (‘closed’) coding allows more transparency and reproducibility and enables procedures to minimize bias. The price the researcher pays for these advantages is less flexibility during coding. Inductive (‘open’) coding allows identifying patterns and categories in the data that could not be anticipated a priori. In that sense, inductive coding plays to the strengths of qualitative research: it imposes no constraints on analysis. The downside is that it cannot use tools to manage subjectivity; such tools inevitably impose structure and as such decrease the purely inductive nature of the coding.

In practice, coding is often a mix. Researchers rarely start collecting data in a field where no relevant theory exists. Therefore, those theories often shape the coding, in which case making that explicit by prespecifying a deductive coding structure aids transparency. This coding structure can then form the basis for the coding process, while still allowing coders to add more code trees to the coding structure’s root (for codes that cannot be captured by the prespecified codes and their definitions) and to add more codes as ‘children’ of prespecified codes.

15.2.1.2 Coding instructions

Unless there are no preconceived ideas about the coding process whatsoever, the coding process will inevitably require matching utterances to some definition. It is therefore important to have the relevant definitions available, in sufficiently clear and explicit formulations, as well as coding instructions. Coding instructions are important for decreasing undesirable, invisible subjectivity and bias and in the coding process. They explicitly capture the characteristics that a piece of data must satisfy to code it with a given code, including explicit guidelines for resolving edge cases (see section 20.3.2).

It is usually desirable to make sure the coding instructions are consistent over studies. Ideally, the exactly same coding instructions are used in all studies in a lab, department, or even institution (also see section 20.2.1).

If the qualitative study concerns humans, and therefore, the codes relate to constructs (e.g. psychological, sociological, or anthropological constructs), using a decentralized construct taxonomy (DCT) supports clear definitions that can consistently be applied over multiple studies. These are introduced in Chapter 20. For studies with humans, it is therefore strongly recommended to not proceed until a set of DCT specifications has been produced and the coding instructions have been generated from those DCTs.

If coding another type of content, it is still important to develop clear, unequivocal coding instructions before proceeding. The coding instructions should ideally be good enough to render individual coders more or less interchangeable.

15.2.1.3 A note on data management

encryption
password management

15.2.2 Data collection

The operational aspects of data collection vary with the type of data that are collected. We will cover two scenarios here: recording audio from individual interviews, group interviews or focus groups, and collecting existing data such as social media posts or archive materials.

15.2.2.1 Recording audio

2 recorders (one backup; redundancy)

15.2.2.1.1 Transcription into sources

with group interviews, pay attention to distinguishing group members; make sure they introduce themselves

15.2.2.2 Collecting existing data

15.2.3 Source cleaning

Once a dataset has been collected, it is usually necessary to perform some cleaning. In a ROCK workflow, this cleaning includes rudimentary segmentation into utterances (see Chapter 10). In the ROCK specification, utterances are separated by a newline character: in other words, every utterance is on its own line. Utterances are the smallest codable unit, and as such, this is not a trivial step. The logic underlying the convention that utterances are separated by newline characters is that although sentences are themselves often hard to fully understand without context, at least they are often self-containing, whereas parts of a sentence are rarely comprehensible on their own.

To clean sources, we will use the rock package function rock::clean_sources() (for more details, see Section 12.3.1). We assume here that the data are located in a directory called data in your Project directory (see Section 15.1.1). In this directory, we assume that the raw sources (i.e. the raw transcripts) are located in the subdirectory called raw. In addition, we will write the cleaned sources to another subdirectory of the data directory called cleaned.

The following command reads all files in the data/raw directory in your Project directory, applies the default cleaning operations (e.g. add a newline character following every sentence ending), and writes the cleaned sources to a directory called data/cleaned in your Project directory:

rock::clean_sources(input = here::here("data", "raw"),
                    output = here::here("data", "cleaned"));

Note that if you only want to read files that have a certain extension, such as .txt or .rock, you can specify this by specifying a regular expression to match against filenames as argument filenameRegex. For example, to only read both .txt files and .rock files, you would pass the regular expression "\\.txt$|\\.txt$", to only read .txt files, the regular expression "\\.txt$", and to only read files with the .rock extension, you would use this command:

rock::clean_sources(input = here::here("data", "raw"),
                    output = here::here("data", "cleaned"),
                    filenameRegex = "\\.rock$");

The rock::clean_sources() function has many other functions, which you can read about by requesting the manual page. You can do this by typing ?rock::clean_sources in the R console (the bottom-left panel in RStudio).

15.2.4 Prepending utterance identifiers

15.2.5 Coding and segmentation

15.2.5.1 Automating coding and segmentation

In a sense, cleaning codes already applies some automatic segmentation: after all, the default

rock::code_sources()

15.2.6 Manual coding and segmentation

To manually code and segment, any software that can open and save plain text files can be used. To achieve this software independence was, after all, one of the reasons the ROCK was developed. Most operating systems come with basic plain text editors (e.g. notepad on Windows; TextEdit on MacOS; and vim with most Unix systems), and many Free/Libre and Open Source Software (FLOSS) alternatives exist, such as the powerful Notepad++ for Windows and BBEdit for MacOS.

These editors can be used to open sources and add codes and section breaks. Many allow the creation of plugins that can further facilitate this, and in addition, plain text editors can be used in tandem with spreadsheet applications such as the FLOSS LibreOffice Calc. This allows having neatly organised coding structure in a spreadsheet, which can then easily be copied to the clipboard and pasted in a source in a text editor. By using key combinations such as Alt-Tab (Windows) or Command-Tab (MacOS) the coder can quickly switch between the source and the code overview.

In addition, iROCK can be used, a rudimentary graphical user interface that simply allows appending predefined codes to utterances and inserting section breaks (for segmentation). More details are available in Chapter @(irock).

Finally, the ROCK standard enables development of a variety of other tools for specific use cases.

15.2.7 Inspecting coder consistency

15.2.8 Merging sources

rock::merge_sources

15.2.9 Viewing source fragments by code

rock::collect_coded_fragments

15.2.10 Recoding

Sometimes, it is desirable to change codes. For example, a set of codes that was initially obtained through inductive coding may, upon inspection, have a hierarchical structure. When using the ROCK, ideally, the originally coded sources remain in their original state so as to enable scrutiny of the coding process. Instead, the recoding is applied using a command that opens the sources, makes the changes, and then writes them to disk again.

At present, the rock R package has four functions to code and recode.

rock::search_and_replace_in_sources

15.2.11 Generating HTML versions

rock::export_to_html

The ROCK book