This is an empty project optimized for projects using the Reproducible Open Coding Kit (the ROCK), the iROCK interface for coding, and the R rock package for analysis. This project accompanies the ROCK book (https://rockbook.org).

This project is an RStudio Project. The easiest way to work with it is to download the directories and files in this project to your local computer. You can then delete all the example sources in this project and replace them with your own files, and edit this file to fit your own project. If you’re familiar with Git, you can clone the project; if you’re not familiar with Git and don’t want to use it, simply download this project as a zip file.

You can clone or download this project by visiting its GitLab repository (see the Open Science Manifest below).

If you do not yet have R and RStudio installed, or if you are working on a machine where they are installed but you’re not familiar with them, check out the Chapter about R and the Chapter about RStudio in the Psy Ops Guide for more information. If you would like to start using Git as a version control system and/or to smoothly collaborate with others, see the Chapter about Git.

For an explanation of how to install the rock R package, see section “Downloading and installing the rock package” in the ROCK book. Note that this Empty ROCK Project uses some of the newest functionality of the rock R package, and therefore, you will probably have to install the development version of the rock package to be able to run all examples in this proejct.

Open Science Manifest

This table collects information about project resources (e.g. repositories, articles, presentations, etc).

Resource type URL
Open Science Framework repository Not available yet
GitLab repository https://gitlab.com/sci-ops/rock-tutorial
Website https://sci-ops.gitlab.io/rock-tutorial
Preprint DOI and URL Not available yet
Final published article DOI and URL Not available yet
License Not available yet
Other important resources Not available yet

Preparation

The ROCK was designed to facilitate transparency and reproducibility of research, as well as to comply with the General Data Protection Regulation. This requires a certain level of due diligence where it comes to data management, or more specifically, file management.

This bare bones ROCK project assumes you already converted your original data sources to plain text files. For audio recordings, this will mean that they were transcribed, and ideally, anonymized. For other sources (video, photographs, etc), it will often mean that the relevant information was described in text files to enable coding them with the ROCK.1

When doing qualitative research, raw data is often cleaned and restructured and information is added to the data to aid interpretation. This process often occurs in consecutive steps, and a convenient way to optimize transparency of this process is storing the intermediate versions. A drawback is that the number of files quickly proliferates, which can make it hard to retain an overview. Therefore, it is important to organise these files well.

Organizing your project

There are basically two approaches to organising the source files. The first is to create new directories (‘folders’) for each action that changes the sources. In this approach, one directory contains all raw sources; a second directory contains all sources after the rock command to clean them was executed; a third directory contains all sources after the rock command to add the Utterance Identifiers was executed; et cetera. The advantage of this approach is that every directory is in itself a snapshot of the data at the corresponding stage in the project. A disadvantage is that you may end with quite a lot of directories.

The second approach is to create the files in the same directory, but to use a convention to rename them. In this approach, some string of characters is added to the sources’ filenames for each stage in the project. In this approach, the raw sources all have their original name; when the rock command to clean them is executed, it stores its result in filenames that have, for example, "_clean" appended; when the rock command to add the Utterance Identifiers is executed, it appends, for example, "_uids", et cetera. The drawbacks of this approach are that you end up with a lot of files, and very long filenames.

In practice, what works best will depends on individual preference. We recommend combining both approaches: appending something to the filenames and storing the files of each stage in a different directory. This approach makes it easiest for others to follow what you did. Therefore, that is the approach implemented in this bare bones project.

For example, look at the directories in this project:

## ├── data
## │   ├── 01--raw-sources
## │   ├── 02--cleaned-sources
## │   ├── 10--sources-with-UIDs
## │   ├── 11--autocoded-sources--round-01
## │   ├── 12--autocoded-sources--round-02
## │   ├── 20--manually-coded-sources
## │   ├── 21--recoded-sources--round-01
## │   └── 21--recoded-sources--round-02
## ├── public
## ├── results
## └── scripts

The directories within the data directory all start with double digits. These ensure the directories are shown in the right order (note that these directory names apply the conventions laid out in the Psy Ops Guide; we recommend you do the same). The reason double digits are used is that this allows structuring of the directories in phases. Directories holding sources from preparatory steps start with a 0; directories holding sources from automatic coding steps start with 1; and directories holding sources from manual coding and recoding start with 2.

Organizing the files

Once you created these directories, copy the raw sources to the 01--raw-sources directory. The next step is to organise the names of the files in your directory structure. If some data need to stay private (for example, because the informed consent forms that were used did not provide for sharing the raw data; see the Psy Ops Guide), you need to make sure you include a specific pattern in the filenames of the private files to enable easy identification. We recommend prepending the string “_PRIVATE_” (an underscore, then the word ‘private’ in capitals, then another underscore).

This allows you to easily exclude all these files from synchronizing to, for example, public repositories for your project (e.g. by specifying the pattern in .gitignore).

In addition, it’s a good idea to use relatively brief filenames, as each successive analysis step will add another suffix.

Basic setup in R

This R code does some basic setup tasks. It installs the most recent versions of the ufs, yum, and rock R packages; checks for presence of the here package (if it isn’t installed, install it using install.packages('here');) ; and finally sets the knitr chunk option echo to TRUE, which means that by default, each chunk’s R code will be included in the rendered HTML file.

### Get most recent versions of some packages, and the development
### version of the rock package - set quiet to FALSE to see info
quiet = TRUE;
quietGitLabUpdate("r-packages/ufs", quiet=quiet);
quietGitLabUpdate("r-packages/yum", quiet=quiet);
quietGitLabUpdate("r-packages/rock@dev", quiet=quiet);

### Get additional packages
ufs::checkPkgs('here');

### Set options
knitr::opts_chunk$set(echo = TRUE,
                      comment = "");

Configuring the rock package

The ROCK package has a number of defaults that can be customized. This customization happens in this section. A complete overview of the default settings can be obtained by running rock::opts$defaults.

In this template project, we will set two identifiers to be persistent identifiers (which means that once they were applied to an utterance in a source, the rock package will auto-apply them to all subsequent utterances) and we define two patterns for section breaks (section breaks are important when segmenting sources, but can be disregarded otherwise).

Also, the default setting of silent = TRUE is explicitly included here to make it easy for you to set it to FALSE if you want to receive more detailed feedback from the rock functions.

### Set the non-default ROCK options
rock::opts$set(
  persistentIds = c("caseId",
                    "coderId"),
  sectionRegexes = c(paragraphBreak = "---<<paragraph-break>>---",
                     topicListSwitch = "---<<topiclist-switch>>---"),
  silent = TRUE
);

Preparing and cleaning sources

Freshly transcribed sources are not always very neatly and consistently formatted. Therefore, some cleaning is often beneficial. The main purpose of cleaning is making sure that utterances are split by the utterance marker (the ROCK default is a newline character, \n, which most operating systems render as a new line).

By default, the rock package tries to smartly insert newline characters between all sentences (see ?rock::clean_sources for more details). You can also use this to do replacements before or after the insertion of the utterance markers. For example, if the interviewed participants live in six cities, say Amsterdam, Beijing, Canberra, Dhaka, Edinburgh, and Freetown, and have relatively rare afflictions, to preserve their anonymity, all city names can be replaced with the text “GEOGRAPHICAL_REFERENCE” by combining the city names into regular expression “^Amsterdam$|^Beijing$|^Canberra$|^Dhaka$|^Edinburgh$|^Freetown$” and specifying both in the “extraReplacementsPre” argument, as done in this example.

rock::clean_sources(
  input = here::here("data",
                     "01--raw-sources"),
  output = here::here("data",
                      "02--cleaned-sources"),
  extraReplacementsPre =
    list(
      c("^Amsterdam$|^Beijing$|^Canberra$|^Dhaka$|^Edinburgh$|^Freetown$",
        "GEOGRAPHICAL_REFERENCE")
    )
);

Instead of using a regular expression, you could also specify six pairs to search and replace: one for each city. In this example, regular expressions are efficient, but not necessary. In many cases, however, regular expressions can make one’s life considerably easier. To learn more about regular expressions, see the Psy Ops Guide.

Note that if you don’t want to clean, or if you have already cleaned your sources, you can always do any replacements using the rock::search_and_replace_in_source() and rock::search_and_replace_in_sources() commands.

Prepending utterance identifiers (UIDs)

Once all sources have been cleaned, you can add the utterance identifiers. These are character strings that uniquely identify each utterance.

rock::prepend_ids_to_sources(input = here::here("data",
                                                "02--cleaned-sources"),
                             output = here::here("data",
                                                 "10--sources-with-UIDs"));

Coding

After the UIDs were prepended, you can start coding the sources.

Automatic coding

However, before you start manually coding, you may want to apply some codes automatically.

Simple autocoding

For example, you may want to code all utterances that contain the word “words” with code language>words (i.e. the code words as a child code of code language). You can perform such autocoding using rock::code_sources().

rock::code_sources(input = here::here("data",
                                      "10--sources-with-UIDs"),
                   output = here::here("data",
                                       "11--autocoded-sources--round-01"),
                   codes = c("words" = "language>words"),
                   outputSuffix = "_acWords");

Note that we manually specified a suffix to use so that the filename reflects what we autocoded in this step. If you perform a lot of autocoding, adding a suffix for each step can lead to excessively long filenames. Therefore, you may choose to not add suffixes, but just use the different directories to keep track of the source’s states over the autocoding steps. In that case, you would specify outputSuffix = NULL. Alternatively, you can use minimal, non-semantic, suffixes, such as “_ac1”, for autocoding round 1, “_ac2” for round 2, et cetera.

More elaborate autocoding

Often, you will not simply want to search for single words when autocoding, but instead allow variations or multiple words. For this, you can use regular expressions (again, see the Psy Ops Guide).

rock::code_sources(input = here::here("data",
                                      "11--autocoded-sources--round-01"),
                   output = here::here("data",
                                       "12--autocoded-sources--round-02"),
                   codes = c("smile|[tf]ear|sorrow|pain|emotions?" = "emotions"),
                   outputSuffix = "_ac2");

In this regular expression, the pipe (|) means ‘or’; in other words, the four pipes mean that this regular expression consists of five different regular expression, and it matches any of those. The straight brackets are a so-called character class, and such a class matches exactly one of the characters contained in it, so the second ‘sub-regex’ matches both ‘tear’ and ‘fear’. The question mark means that the preceding character is optional, so the last regex matches both ‘emotion’ and ‘emotions’.

Note that we now read the sources to autocode from the directory where the results of the previous autocoding round were stored. Also note that we write the results of this round to another directory, so that our progress is very easy to follow for interested researchers (which includes our future self).

Normally, Explanations and justifications of each step would be added, so that others can follow the process. These can be inserted as regular text, like in this R Markdown file, or as comments in an R script.

Manual coding

In most projects, after zero or more rounds of autocoding, manual coding is inevitable. For manual coding, you can use a plain text editor or a dedicated interface such as iROCK. iROCK is a Free/Libre Open Source Software interface (you can download it from this git repository, or directly use the version hosted by GitLab Pages here; this interface is discussed in this chapter in the ROCK book).

In this example, some manual codes were added to the sources and they were then written to directory “20--manually-coded-sources”.

Note that up until this stage, the entire process followed deterministically from the combination of the raw sources and this file. This means that in principle, only the raw sources and this file would need to be saved; we stored the sources in each intermediate stage to facilitate inspection of the process, but if those files would get lost, it would not be a problem.

However, manual coding is, of course, manual. That means that that stage involves humans and cannot be automatically repeated. This is signified by the increased digit in the directory name (the first character is now “1”) and by the removal of all previously added suffixes and addition of the suffix “_coded1”. The “1” here is a counter for the manual coding round; sometimes multiple manual coding rounds are required. However, usually, after the first coding round, (re)coding is accomplished using rock commands, so that they can be easily documented and decisions can be elaborated and justified.

If coding is deductive (i.e. if prior knowledge about the subject of study is available, and has been captured in a codebook), the manual coding phase normally consists of application of the coding instructions in the codebook to the data in the sources. Creation of new (sub-)codes then happens after that manual coding round (through rock commands, and documenting the decisions).

Identifiers and attributes

Often, qualitative sources have a set of characteristics that are known independent of coding. In the ROCK, these are called “attributes”, and thet are designated to sources using “identifiers”. An identifier is a special code that can be used to identify sources. For example, identifiers can be used to code which participant was interviewed; who the interviewer was; what the location of the interviews was; or whether the interview was conducted during the morning, afternoon, or evening.

These identifiers can then be used to associate attributes to utterances. By default, the identifier “caseId” is recognized. It is configured as a so-called “persistent” identifiers, which means that it does not only apply to the coded utterance, but is automatically applied to all following utterances until another caseId identifiers is encountered. This is convenient, because it means that for sources with single data providers (e.g. individual interviews, or documents from single organisations), ony one case identifier has to be added (to one of the first lines). For sources with multiple data providers, the case identifier has to be repeated each time the data provider changes (but only once).

Adding identifiers

These identifiers are ideally added during or just after transcription, but in this example, they were added during the manual coding. When following the default rock settings, caseIds are coded using “[[cid=XXX]]”, where “XXX” specifies the unique identifier for that case. For example, one can use numbers, or random strings, or pseudonyms (of course, never use participants’ real names!).

Specifying attributes for identifiers

The attributes for identifiers are specified in a fragment delimited by two lines that each contain exactly three dashes. Such a fragment looks like this:

---
ROCK_attributes:
  -
    caseId: 1
    artistName: "Dream Theater"
    songName: "Metropolis Part 1: The Miracle And The Sleeper"
    year: 1996
  - caseId: 2
    artistName: "The Dear Hunter"
    songName: "Gloria"
    year: 2008
---

These attributes (in this case, the names of the artists, songs, and the years the songs were released) are then attached to all utterances coded with that case identifier. This makes it possible to view, for example, all utterances from female participants, or all utterances from female participats that were coded with a specific code.

Initial analysis

Before coded sources can be viewed or processed further, they must be parsed by the rock package. During parsing, all deductive (closed) and inductive (open) codes are collected and compiled into separate and merged coding trees, and a dataframe is created where each utterance is a row, and the associated attributes and codes are stored in columns. The command to parse sources is called rock::parse_sources(). The following fragment reads all sources in the “data” directory (and by default, in all subdirectories) that match the regular expression “_coded1|attributes”, which means that we only parse the manually coded fragments and the file with the attributes.

dat <-
  rock::parse_sources(here::here('data'),
                      regex = "_coded1|attributes");

We store the object containing all the parsed sources under the name dat. This allows us to pass it on to other functions for inspection.

Showing inductive code trees

To view the inductive code trees, pass the object with the parsed sources on to function rock::show_inductive_code_tree().

rock::show_inductive_code_tree(dat);

Inductive code tree for codes

          levelName
1 codes            
2  ¦--emotions     
3  ¦--reference    
4  ¦   ¦--location 
5  ¦   °--person   
6  ¦       ¦--third
7  ¦       °--first
8  °--language     
9      °--words    

Showing number of coded utterances for each code

Sometimes it can be useful to get a quick impression of how often codes occur. A frequency histogram of the codes can be produced with the function rock::code_freq_hist(). If an object with parsed sources is passed to this function, it creates a code frequency histogram where colours are used to represent the frequencies per source. It is also possible to pass a single parsed source, in which case a code frequency histogram for that source will be shown.

rock::code_freq_hist(dat);

Showing the attributes for each case

To get a quick overview of all attributes that were specified, use rock::show_attribute_table().

rock::show_attribute_table(dat);
caseId artistName songName year
1 Dream Theater Metropolis Part 1: The Miracle And The Sleeper 1996
2 The Dear Hunter Gloria 2008

Showing the fragments coded with a specific code

Usually, you will want to inspect the coded fragments for each code to determine how to recode (e.g. merge codes, split codes, or code with different codes). The function rock::collect_coded_fragments() collects all coded fragments from all sources and shows them. The codes contains a regular expression, and all codes matching that regular expression will be processed (the default value is a regular expression that matches everything). Use context to specify how many preceding and succeeding utterances you want to show as context.

rock::collect_coded_fragments(dat,
                              codes = "emotions",
                              context=2);

Collected coded fragments for codes ‘emotions’ with 2 lines of context

emotions (path: codes>emotions)


Source: _PRIVATE_metropolis-a_coded1.rock

[[uid=772lbsdw]] The smile of dawn [[emotions]]
[[uid=772lbsdx]] Arrived early May
[[uid=772lbsdy]] She carried a gift [[reference>person>third]]
[[uid=772lbsdz]] from her home [[reference>location]]
[[uid=772lbsf1]] The night shed a tear [[emotions]]

Source: _PRIVATE_metropolis-a_coded1.rock

[[uid=772lbsdy]] She carried a gift [[reference>person>third]]
[[uid=772lbsdz]] from her home [[reference>location]]
[[uid=772lbsf1]] The night shed a tear [[emotions]]
[[uid=772lbsf2]] To tell her of fear [[emotions]]
[[uid=772lbsf3]] And of sorrow and pain [[emotions]]

Source: _PRIVATE_metropolis-a_coded1.rock

[[uid=772lbsdz]] from her home [[reference>location]]
[[uid=772lbsf1]] The night shed a tear [[emotions]]
[[uid=772lbsf2]] To tell her of fear [[emotions]]
[[uid=772lbsf3]] And of sorrow and pain [[emotions]]
[[uid=772lbsf4]] She’ll never outgrow [[reference>person>third]]

Source: _PRIVATE_metropolis-a_coded1.rock

[[uid=772lbsdz]] from her home [[reference>location]]
[[uid=772lbsf1]] The night shed a tear [[emotions]]
[[uid=772lbsf2]] To tell her of fear [[emotions]]
[[uid=772lbsf3]] And of sorrow and pain [[emotions]]
[[uid=772lbsf4]] She’ll never outgrow [[reference>person>third]]

Source: _public_metropolis-a_coded1.rock

[[uid=772lbsdw]] XXX XXXXX XX XXXX [[emotions]]
[[uid=772lbsdx]] XXXXXXX XXXXX XXX
[[uid=772lbsdy]] XXX XXXXXXX X XXXX [[reference>person>third]]
[[uid=772lbsdz]] XXXX XXX XXXX [[reference>location]]
[[uid=772lbsf1]] XXX XXXXX XXXX X XXXX [[emotions]]

Source: _public_metropolis-a_coded1.rock

[[uid=772lbsdy]] XXX XXXXXXX X XXXX [[reference>person>third]]
[[uid=772lbsdz]] XXXX XXX XXXX [[reference>location]]
[[uid=772lbsf1]] XXX XXXXX XXXX X XXXX [[emotions]]
[[uid=772lbsf2]] XX XXXX XXX XX XXXX [[emotions]]
[[uid=772lbsf3]] XXX XX XXXXXX XXX XXXX [[emotions]]

Source: _public_metropolis-a_coded1.rock

[[uid=772lbsdz]] XXXX XXX XXXX [[reference>location]]
[[uid=772lbsf1]] XXX XXXXX XXXX X XXXX [[emotions]]
[[uid=772lbsf2]] XX XXXX XXX XX XXXX [[emotions]]
[[uid=772lbsf3]] XXX XX XXXXXX XXX XXXX [[emotions]]
[[uid=772lbsf4]] XXX’XX XXXXX XXXXXXX [[reference>person>third]]

Source: _public_metropolis-a_coded1.rock

[[uid=772lbsdz]] XXXX XXX XXXX [[reference>location]]
[[uid=772lbsf1]] XXX XXXXX XXXX X XXXX [[emotions]]
[[uid=772lbsf2]] XX XXXX XXX XX XXXX [[emotions]]
[[uid=772lbsf3]] XXX XX XXXXXX XXX XXXX [[emotions]]
[[uid=772lbsf4]] XXX’XX XXXXX XXXXXXX [[reference>person>third]]

Source: gloria-a_coded1.rock

[[uid=772lbsdq]] I wasn’t wrong to fend their ambiguity [[reference>person>first]] [[reference>person>third]]
[[uid=772lbsdr]] Then I learned to turn emotions into weaponry [[emotions]]
[[uid=772lbsds]] One too many words said with the wrong inflection [[language>words]]
[[uid=772lbsdt]] Leading me to throw up my hands [[reference>person>first]]

Showing the fragments coded with a specific code for participants with a specific attribute

Sometimes it is useful to compare the coded fragments between groups of participants (e.g. older versus younger participants, or male versus female participants, etc). You can achieve this by specifying one or more values for one or more attributes using the attributes argument. Specifically, you pass a list where every element’s name is a valid (i.e. occurring) attribute name, and every element is a character value with a regular expression specifying all values for that attribute to select.

rock::collect_coded_fragments(dat,
                              codes = "emotions",
                              heading = "Fragments coded with 'emotions' by The Dear Hunter",
                              context=2,
                              attributes = list(artistName='The Dear Hunter'));

Fragments coded with ‘emotions’ by The Dear Hunter

emotions (path: codes>emotions)


Source: gloria-a_coded1.rock

[[uid=772lbsdq]] I wasn’t wrong to fend their ambiguity [[reference>person>first]] [[reference>person>third]]
[[uid=772lbsdr]] Then I learned to turn emotions into weaponry [[emotions]]
[[uid=772lbsds]] One too many words said with the wrong inflection [[language>words]]
[[uid=772lbsdt]] Leading me to throw up my hands [[reference>person>first]]

Recoding

[still has to be added]

Final results

The analyses for the final results are the same as the one you do along the way; except that of course the results are now final.

dat <-
  rock::parse_sources(here::here('data'),
                      regex = "_coded1|attributes");
rock::code_freq_hist(dat);

Anonymizing sources

Ideally, if you used the right informed consent forms (see the Psy Ops Guide) and if during transcription, all data were anonymized, you can simply publish the sources in their current state. This is by far the most transparent (and as such, the most desirable) approach. However, sometimes informed consents did not provide for licensing participants’ data, sometimes data were not, or badly, anonymized during transcription, or sometimes data anonymization is not possible at all.

In those cases, you need to anonymize your datasets. Fortunately, if you followed the guidelines set out in section Organizing Your Project, you will have always included some string (for example, “_PRIVATE_”) in all filenames of all files that cannot be made public. Therefore, it is easy for you, or for a computer, to obtain a list of those files, which also means that it is easy to systematically process all of them.

The rock package has a function to anonymize sources thet replaces all lowercase and uppercase letters as well as all digits with a symbol (by default, a capital “X”). You can do this either for the entire source, or for a proportion of the utterances, which will then be randomly chosen. To anonymize all sources, you can use rock::mask_sources().

The following command processes all sources that contain “_PRIVATE_” in their filename, anonymizes them completely, and writes the file with the resulting anonymized source to the same directory. By default, this function appends “_masked” to each filename, and it also replaces the text “_PRIVATE_” with “_public_”. If we do not override these defaults (which we could do by passing a value for arguments “outputSuffix” and “filenameReplacement”, respectively), both changes will be applied to the filenames. However, we only need one systematic change to signify that we anonymized a source. Therefore, we manually specify that we do not want to append any text to the filenames of the anonymized sources using “outputSuffix = ""”.

rock::mask_sources(input = here::here(),
                   output = "same",
                   outputSuffix = "",
                   filenameRegex = "_PRIVATE_");

If the text “_PRIVATE_” is included in this project’s “.gitignore” file (see ), this new anonymized source will now automatically be synchronized with this project’s Git repository, and if that is synchronized with other repositories (e.g. an Open Science Framework repository), it will also synchronize to those repositories.

If you want to mask a lower proportion of a source, use the “proportionToMask” argument. By default, it is set to “1”, but if you specify, for example, “0.5”, only half of the source will be masked.

You may wonder why it is important to still publish sources even if you mask all of them completely. The reason is that meta-scientists can learn from patterns in coding and analysis even based on, for example, the density of codes per source, or relative to the number of words in a source or in an utterance, or in other ways we cannot conceive of yet at this moment. Also, by providing masked sources, you still allow some insight into your process which would be absent otherwise.


  1. The ROCK standard is based on using plain text files; the standard has as yet not been extended to other formats.