This is an empty project optimized for projects using the Reproducible Open Coding Kit (the ROCK), the iROCK interface for coding, and the R
rock package for analysis. This project accompanies the ROCK book (https://rockbook.org).
This project is an RStudio Project. The easiest way to work with it is to download the directories and files in this project to your local computer. You can then delete all the example sources in this project and replace them with your own files, and edit this file to fit your own project. If you’re familiar with Git, you can clone the project; if you’re not familiar with Git and don’t want to use it, simply download this project as a zip file.
You can clone or download this project by visiting its GitLab repository (see the Open Science Manifest below).
If you do not yet have R and RStudio installed, or if you are working on a machine where they are installed but you’re not familiar with them, check out the Chapter about R and the Chapter about RStudio in the Psy Ops Guide for more information. If you would like to start using Git as a version control system and/or to smoothly collaborate with others, see the Chapter about Git.
For an explanation of how to install the
rock R package, see section “Downloading and installing the
rock package” in the ROCK book. Note that this Empty ROCK Project uses some of the newest functionality of the
rock R package, and therefore, you will probably have to install the development version of the
rock package to be able to run all examples in this proejct.
This table collects information about project resources (e.g. repositories, articles, presentations, etc).
|Open Science Framework repository||Not available yet|
|Preprint DOI and URL||Not available yet|
|Final published article DOI and URL||Not available yet|
|License||Not available yet|
|Other important resources||Not available yet|
The ROCK was designed to facilitate transparency and reproducibility of research, as well as to comply with the General Data Protection Regulation. This requires a certain level of due diligence where it comes to data management, or more specifically, file management.
This bare bones ROCK project assumes you already converted your original data sources to plain text files. For audio recordings, this will mean that they were transcribed, and ideally, anonymized. For other sources (video, photographs, etc), it will often mean that the relevant information was described in text files to enable coding them with the ROCK.1
When doing qualitative research, raw data is often cleaned and restructured and information is added to the data to aid interpretation. This process often occurs in consecutive steps, and a convenient way to optimize transparency of this process is storing the intermediate versions. A drawback is that the number of files quickly proliferates, which can make it hard to retain an overview. Therefore, it is important to organise these files well.
There are basically two approaches to organising the source files. The first is to create new directories (‘folders’) for each action that changes the sources. In this approach, one directory contains all raw sources; a second directory contains all sources after the
rock command to clean them was executed; a third directory contains all sources after the
rock command to add the Utterance Identifiers was executed; et cetera. The advantage of this approach is that every directory is in itself a snapshot of the data at the corresponding stage in the project. A disadvantage is that you may end with quite a lot of directories.
The second approach is to create the files in the same directory, but to use a convention to rename them. In this approach, some string of characters is added to the sources’ filenames for each stage in the project. In this approach, the raw sources all have their original name; when the
rock command to clean them is executed, it stores its result in filenames that have, for example, "_clean" appended; when the
rock command to add the Utterance Identifiers is executed, it appends, for example, "_uids", et cetera. The drawbacks of this approach are that you end up with a lot of files, and very long filenames.
In practice, what works best will depends on individual preference. We recommend combining both approaches: appending something to the filenames and storing the files of each stage in a different directory. This approach makes it easiest for others to follow what you did. Therefore, that is the approach implemented in this bare bones project.
For example, look at the directories in this project:
## ├── data ## │ ├── 01--raw-sources ## │ ├── 02--cleaned-sources ## │ ├── 10--sources-with-UIDs ## │ ├── 11--autocoded-sources--round-01 ## │ ├── 12--autocoded-sources--round-02 ## │ ├── 20--manually-coded-sources ## │ ├── 21--recoded-sources--round-01 ## │ └── 21--recoded-sources--round-02 ## ├── public ## ├── results ## └── scripts
The directories within the
data directory all start with double digits. These ensure the directories are shown in the right order (note that these directory names apply the conventions laid out in the Psy Ops Guide; we recommend you do the same). The reason double digits are used is that this allows structuring of the directories in phases. Directories holding sources from preparatory steps start with a 0; directories holding sources from automatic coding steps start with 1; and directories holding sources from manual coding and recoding start with 2.
Once you created these directories, copy the raw sources to the
01--raw-sources directory. The next step is to organise the names of the files in your directory structure. If some data need to stay private (for example, because the informed consent forms that were used did not provide for sharing the raw data; see the Psy Ops Guide), you need to make sure you include a specific pattern in the filenames of the private files to enable easy identification. We recommend prepending the string “
_PRIVATE_” (an underscore, then the word ‘private’ in capitals, then another underscore).
This allows you to easily exclude all these files from synchronizing to, for example, public repositories for your project (e.g. by specifying the pattern in
In addition, it’s a good idea to use relatively brief filenames, as each successive analysis step will add another suffix.
This R code does some basic setup tasks. It installs the most recent versions of the
rock R packages; checks for presence of the
here package (if it isn’t installed, install it using
install.packages('here');) ; and finally sets the
knitr chunk option
TRUE, which means that by default, each chunk’s R code will be included in the rendered HTML file.
### Get most recent versions of some packages, and the development ### version of the rock package - set quiet to FALSE to see info quiet = TRUE; quietGitLabUpdate("r-packages/ufs", quiet=quiet); quietGitLabUpdate("r-packages/yum", quiet=quiet); quietGitLabUpdate("r-packages/rock@dev", quiet=quiet); ### Get additional packages ufs::checkPkgs('here'); ### Set options knitr::opts_chunk$set(echo = TRUE, comment = "");
The ROCK package has a number of defaults that can be customized. This customization happens in this section. A complete overview of the default settings can be obtained by running
In this template project, we will set two identifiers to be persistent identifiers (which means that once they were applied to an utterance in a source, the
rock package will auto-apply them to all subsequent utterances) and we define two patterns for section breaks (section breaks are important when segmenting sources, but can be disregarded otherwise).
Also, the default setting of
silent = TRUE is explicitly included here to make it easy for you to set it to
FALSE if you want to receive more detailed feedback from the
### Set the non-default ROCK options rock::opts$set( persistentIds = c("caseId", "coderId"), sectionRegexes = c(paragraphBreak = "---<<paragraph-break>>---", topicListSwitch = "---<<topiclist-switch>>---"), silent = TRUE );
Freshly transcribed sources are not always very neatly and consistently formatted. Therefore, some cleaning is often beneficial. The main purpose of cleaning is making sure that utterances are split by the utterance marker (the ROCK default is a newline character,
\n, which most operating systems render as a new line).
By default, the
rock package tries to smartly insert newline characters between all sentences (see
?rock::clean_sources for more details). You can also use this to do replacements before or after the insertion of the utterance markers. For example, if the interviewed participants live in six cities, say Amsterdam, Beijing, Canberra, Dhaka, Edinburgh, and Freetown, and have relatively rare afflictions, to preserve their anonymity, all city names can be replaced with the text “
GEOGRAPHICAL_REFERENCE” by combining the city names into regular expression “
^Amsterdam$|^Beijing$|^Canberra$|^Dhaka$|^Edinburgh$|^Freetown$” and specifying both in the “
extraReplacementsPre” argument, as done in this example.
rock::clean_sources( input = here::here("data", "01--raw-sources"), output = here::here("data", "02--cleaned-sources"), extraReplacementsPre = list( c("^Amsterdam$|^Beijing$|^Canberra$|^Dhaka$|^Edinburgh$|^Freetown$", "GEOGRAPHICAL_REFERENCE") ) );
Instead of using a regular expression, you could also specify six pairs to search and replace: one for each city. In this example, regular expressions are efficient, but not necessary. In many cases, however, regular expressions can make one’s life considerably easier. To learn more about regular expressions, see the Psy Ops Guide.
Note that if you don’t want to clean, or if you have already cleaned your sources, you can always do any replacements using the
Once all sources have been cleaned, you can add the utterance identifiers. These are character strings that uniquely identify each utterance.
rock::prepend_ids_to_sources(input = here::here("data", "02--cleaned-sources"), output = here::here("data", "10--sources-with-UIDs"));
After the UIDs were prepended, you can start coding the sources.
However, before you start manually coding, you may want to apply some codes automatically.
For example, you may want to code all utterances that contain the word “words” with code
language>words (i.e. the code
words as a child code of code
language). You can perform such autocoding using
rock::code_sources(input = here::here("data", "10--sources-with-UIDs"), output = here::here("data", "11--autocoded-sources--round-01"), codes = c("words" = "language>words"), outputSuffix = "_acWords");
Note that we manually specified a suffix to use so that the filename reflects what we autocoded in this step. If you perform a lot of autocoding, adding a suffix for each step can lead to excessively long filenames. Therefore, you may choose to not add suffixes, but just use the different directories to keep track of the source’s states over the autocoding steps. In that case, you would specify
outputSuffix = NULL. Alternatively, you can use minimal, non-semantic, suffixes, such as “
_ac1”, for autocoding round 1, “
_ac2” for round 2, et cetera.
Often, you will not simply want to search for single words when autocoding, but instead allow variations or multiple words. For this, you can use regular expressions (again, see the Psy Ops Guide).
rock::code_sources(input = here::here("data", "11--autocoded-sources--round-01"), output = here::here("data", "12--autocoded-sources--round-02"), codes = c("smile|[tf]ear|sorrow|pain|emotions?" = "emotions"), outputSuffix = "_ac2");
In this regular expression, the pipe (
|) means ‘or’; in other words, the four pipes mean that this regular expression consists of five different regular expression, and it matches any of those. The straight brackets are a so-called character class, and such a class matches exactly one of the characters contained in it, so the second ‘sub-regex’ matches both ‘tear’ and ‘fear’. The question mark means that the preceding character is optional, so the last regex matches both ‘emotion’ and ‘emotions’.
Note that we now read the sources to autocode from the directory where the results of the previous autocoding round were stored. Also note that we write the results of this round to another directory, so that our progress is very easy to follow for interested researchers (which includes our future self).
Normally, Explanations and justifications of each step would be added, so that others can follow the process. These can be inserted as regular text, like in this R Markdown file, or as comments in an R script.
In most projects, after zero or more rounds of autocoding, manual coding is inevitable. For manual coding, you can use a plain text editor or a dedicated interface such as iROCK. iROCK is a Free/Libre Open Source Software interface (you can download it from this git repository, or directly use the version hosted by GitLab Pages here; this interface is discussed in this chapter in the ROCK book).
In this example, some manual codes were added to the sources and they were then written to directory “
Note that up until this stage, the entire process followed deterministically from the combination of the raw sources and this file. This means that in principle, only the raw sources and this file would need to be saved; we stored the sources in each intermediate stage to facilitate inspection of the process, but if those files would get lost, it would not be a problem.
However, manual coding is, of course, manual. That means that that stage involves humans and cannot be automatically repeated. This is signified by the increased digit in the directory name (the first character is now “
1”) and by the removal of all previously added suffixes and addition of the suffix “
_coded1”. The “
1” here is a counter for the manual coding round; sometimes multiple manual coding rounds are required. However, usually, after the first coding round, (re)coding is accomplished using
rock commands, so that they can be easily documented and decisions can be elaborated and justified.
If coding is deductive (i.e. if prior knowledge about the subject of study is available, and has been captured in a codebook), the manual coding phase normally consists of application of the coding instructions in the codebook to the data in the sources. Creation of new (sub-)codes then happens after that manual coding round (through
rock commands, and documenting the decisions).
Often, qualitative sources have a set of characteristics that are known independent of coding. In the ROCK, these are called “attributes”, and thet are designated to sources using “identifiers”. An identifier is a special code that can be used to identify sources. For example, identifiers can be used to code which participant was interviewed; who the interviewer was; what the location of the interviews was; or whether the interview was conducted during the morning, afternoon, or evening.
These identifiers can then be used to associate attributes to utterances. By default, the identifier “
caseId” is recognized. It is configured as a so-called “persistent” identifiers, which means that it does not only apply to the coded utterance, but is automatically applied to all following utterances until another
caseId identifiers is encountered. This is convenient, because it means that for sources with single data providers (e.g. individual interviews, or documents from single organisations), ony one case identifier has to be added (to one of the first lines). For sources with multiple data providers, the case identifier has to be repeated each time the data provider changes (but only once).
These identifiers are ideally added during or just after transcription, but in this example, they were added during the manual coding. When following the default
caseIds are coded using “
[[cid=XXX]]”, where “
XXX” specifies the unique identifier for that case. For example, one can use numbers, or random strings, or pseudonyms (of course, never use participants’ real names!).
The attributes for identifiers are specified in a fragment delimited by two lines that each contain exactly three dashes. Such a fragment looks like this:
--- ROCK_attributes: - caseId: 1 artistName: "Dream Theater" songName: "Metropolis Part 1: The Miracle And The Sleeper" year: 1996 - caseId: 2 artistName: "The Dear Hunter" songName: "Gloria" year: 2008 ---
These attributes (in this case, the names of the artists, songs, and the years the songs were released) are then attached to all utterances coded with that case identifier. This makes it possible to view, for example, all utterances from female participants, or all utterances from female participats that were coded with a specific code.
Before coded sources can be viewed or processed further, they must be parsed by the
rock package. During parsing, all deductive (closed) and inductive (open) codes are collected and compiled into separate and merged coding trees, and a dataframe is created where each utterance is a row, and the associated attributes and codes are stored in columns. The command to parse sources is called
rock::parse_sources(). The following fragment reads all sources in the “
data” directory (and by default, in all subdirectories) that match the regular expression “
_coded1|attributes”, which means that we only parse the manually coded fragments and the file with the attributes.
dat <- rock::parse_sources(here::here('data'), regex = "_coded1|attributes");
We store the object containing all the parsed sources under the name
dat. This allows us to pass it on to other functions for inspection.
To view the inductive code trees, pass the object with the parsed sources on to function
levelName 1 codes 2 ¦--emotions 3 ¦--reference 4 ¦ ¦--location 5 ¦ °--person 6 ¦ ¦--third 7 ¦ °--first 8 °--language 9 °--words
Sometimes it can be useful to get a quick impression of how often codes occur. A frequency histogram of the codes can be produced with the function
rock::code_freq_hist(). If an object with parsed sources is passed to this function, it creates a code frequency histogram where colours are used to represent the frequencies per source. It is also possible to pass a single parsed source, in which case a code frequency histogram for that source will be shown.
To get a quick overview of all attributes that were specified, use
|1||Dream Theater||Metropolis Part 1: The Miracle And The Sleeper||1996|
|2||The Dear Hunter||Gloria||2008|
Usually, you will want to inspect the coded fragments for each code to determine how to recode (e.g. merge codes, split codes, or code with different codes). The function
rock::collect_coded_fragments() collects all coded fragments from all sources and shows them. The
codes contains a regular expression, and all codes matching that regular expression will be processed (the default value is a regular expression that matches everything). Use
context to specify how many preceding and succeeding utterances you want to show as context.
rock::collect_coded_fragments(dat, codes = "emotions", context=2);
Sometimes it is useful to compare the coded fragments between groups of participants (e.g. older versus younger participants, or male versus female participants, etc). You can achieve this by specifying one or more values for one or more attributes using the
attributes argument. Specifically, you pass a list where every element’s name is a valid (i.e. occurring) attribute name, and every element is a character value with a regular expression specifying all values for that attribute to select.
rock::collect_coded_fragments(dat, codes = "emotions", heading = "Fragments coded with 'emotions' by The Dear Hunter", context=2, attributes = list(artistName='The Dear Hunter'));
[still has to be added]
The analyses for the final results are the same as the one you do along the way; except that of course the results are now final.
dat <- rock::parse_sources(here::here('data'), regex = "_coded1|attributes"); rock::code_freq_hist(dat);
Ideally, if you used the right informed consent forms (see the Psy Ops Guide) and if during transcription, all data were anonymized, you can simply publish the sources in their current state. This is by far the most transparent (and as such, the most desirable) approach. However, sometimes informed consents did not provide for licensing participants’ data, sometimes data were not, or badly, anonymized during transcription, or sometimes data anonymization is not possible at all.
In those cases, you need to anonymize your datasets. Fortunately, if you followed the guidelines set out in section Organizing Your Project, you will have always included some string (for example, “
_PRIVATE_”) in all filenames of all files that cannot be made public. Therefore, it is easy for you, or for a computer, to obtain a list of those files, which also means that it is easy to systematically process all of them.
rock package has a function to anonymize sources thet replaces all lowercase and uppercase letters as well as all digits with a symbol (by default, a capital “
X”). You can do this either for the entire source, or for a proportion of the utterances, which will then be randomly chosen. To anonymize all sources, you can use
The following command processes all sources that contain “
_PRIVATE_” in their filename, anonymizes them completely, and writes the file with the resulting anonymized source to the same directory. By default, this function appends “
_masked” to each filename, and it also replaces the text “
_PRIVATE_” with “
_public_”. If we do not override these defaults (which we could do by passing a value for arguments “
outputSuffix” and “
filenameReplacement”, respectively), both changes will be applied to the filenames. However, we only need one systematic change to signify that we anonymized a source. Therefore, we manually specify that we do not want to append any text to the filenames of the anonymized sources using “
outputSuffix = ""”.
rock::mask_sources(input = here::here(), output = "same", outputSuffix = "", filenameRegex = "_PRIVATE_");
If the text “
_PRIVATE_” is included in this project’s “
.gitignore” file (see ), this new anonymized source will now automatically be synchronized with this project’s Git repository, and if that is synchronized with other repositories (e.g. an Open Science Framework repository), it will also synchronize to those repositories.
If you want to mask a lower proportion of a source, use the “
proportionToMask” argument. By default, it is set to “
1”, but if you specify, for example, “
0.5”, only half of the source will be masked.
You may wonder why it is important to still publish sources even if you mask all of them completely. The reason is that meta-scientists can learn from patterns in coding and analysis even based on, for example, the density of codes per source, or relative to the number of words in a source or in an utterance, or in other ways we cannot conceive of yet at this moment. Also, by providing masked sources, you still allow some insight into your process which would be absent otherwise.
The ROCK standard is based on using plain text files; the standard has as yet not been extended to other formats.↩