Long-term SSNA data storage/documentation manual

alberto · March 3, 2020, 9:00am

1. About

Edgeryders is involved in research activities, and our most distinctive tool is a method called semantic social networks, or SSNs (described in this paper). This manual describes the recommended process to document, share and long-term store data produced in the course of an SSNA project. It also explains the rationale behind this process.

2. Principles

We are committed to doing science in an ethical, responsible way. This means:

Be generous. Results, methods, collected and generated data etc. should be shared with everyone, with as much openness as we can get away with.
Be accountable. Our processes should be visible, our results replicable.
Be mindful of humans, their integrity and privacy. This means paying very close attention to data protection, privacy, and informed consent to research.

All this has implication for how we handle data.

3. Data management standards

We adopt the following standards for SSN data:

License: Creative Commons Attribution 4.0 International (CC BY 4.0).
Data format: CSV, an open format which is the de facto standard for much data science at the time of writing this manual.
Data documentation: Data Package, a standard developed by the Open Knowledge Fundation and based on the frictionless data approach.
Long-term storage: Zenodo, a facility maintained by CERN (and therefore expected to last a long time) and fully integrated with OpenAire, the European Union’s initiative for open science.

4. Pseudonymization: why and how

It is Edgeryders policy that SSN data should be pseudonymized. Not anonymized, because SSN analysis requires that all content by a participant is attributed to that user in order to preserve information about the social dynamics of the conversational environment.

Pseudonymization might seem pointless, because:

SSNA is based on the online discussion on an open forum. If you have access to the full text of a post from a pseudonymized dataset, you can easily know the Edgeryders username of its author: just copy-paste the text into a web search engine.
Edgeryders does not enforce a real name policy, so usernames are already pseudonyms.
Everyone participating in Edgeryders research has given consent to their data being used for research.

Nevertheless, we believe pseudonymization takes care of a specific data protection problem, that arises when a participant, after the research project is over, decides to erase some of her posts, or to delete her Edgeryders account altogether. We have a procedure to deal with this, and indeed that participant’s content will disappear from the edgeryders.eu database. But these changes will not affect research datasets, which are snapshots of parts of that database taken at a time when her content was still there. Even if we could commit to keeping past datasets stored on Zenodo in sync with the live database of edgeryders.eu (and we cannot credibly promise that), we have no way to even know if those datasets have been downloaded by other researchers in the mean time. Storing datasets in pseudonymized form allows a participant to “disappear” from the live DB of edgeryders.eu without affecting the integrity of the past dataset.

Edgeryders is aware that pseudonymization, and even anonymization techniques rarely offer total protection in the modern world, and that de-anonymization is often possible. We have policies to discourage anyone in sharing on Edgeryders any content that requires strong protection.

5. Export a SSN dataset

5.1. Prepare the data

Make sure that the online conversation related to your project is all appropriately tagged. While most research projects in Edgeryders have a category as their home, being in the right category is neither necessary nor sufficient for a topic to be included in the SSN. Instead, each topic should be tagged with a Discourse tag identifying the research project. This is how our ethnographic coding software, OpenEthnographer, identifies the topics to be coded by researchers. The naming convention for such tags is ethno-PROJECTNAME.

5.2. Export and pseudonymize

Download the scripts in this repository. You will also need our Python module for API access (available here). Finally, you are going to need an API key to edgeryders.eu. If you do not have one, ask @matthias.
Run the download_and_pseudonymize.pyscript, replacing the Discourse tag for your project as the argument of the function called in the __main__. The script accesses live data on edgeryders.eu and saves them in the form of four files:
- annotations.csv. The ethnographic annotations from the project.
- codes.csv. The ethnographic codes from the project.
- participants.csv. A pseudonymized list of participants to the project’s conversation.
- posts.csv. All posts in the project’s conversation, where the IDs for author/recipient and the text of each posts have been pseudonymized. For example, a mention like @alberto in the text of a post becomes @anon12345678.

6. Document the dataset with Data Package

There are several ways to produce the datapackage.json file to accompany the CSV files.

Create it manually, with a text editor. This process is much faster if you start from our example file.
Infer it from your data. To this purpose, the Frictionless Data project provides software libraries for several languages and frameworks. You still have to write the descriptions of files and fields in each file, or re-use and adapt the example file.
Upload your CSV files as a Data.World dataset (example). Data.World provides a very usable interface to add description to whole files and each individual column in each file. Upon downloading the dataset, Data.World generates automatically the datapackage.json.

If you follow this last method, be aware that Data.World has some quirks in how it represents tabular datasets. It creates two folders:

An original folder with the files you uploaded.
A data folder with the same files.

The datapackage.json documents all of these resources, i.e. 8 files. To get rid of this problem, the trim_datapackage.py script produces a streamlined version of the datapackage.json. The full dataset to be uploaded consists of only the original files, plus the streamlined datapackage.json.

7. Store the dataset on Zenodo

Zenodo has an intuitive user interface. If your project is funded by the EU, choose “European Commission Funded Research (OpenAire)” in the field called “communities”. This opens another field to enter the project’s acronym or number. It is important to do this, as it helps project officers and evaluators.

At the time of writing, Zenodo has a long-standing issue preventing the seamless upload of JSON files. I have opened a ticket with them.