Advice needed: exporting and pseudonymizing SSNA data for long-term storage

alberto · February 11, 2020, 10:23am

I have decided to write up a fresh script (or set thereof) for converting Edgeryders data into something re-usable by other researchers interested in SSNA. As you know, publishing open research data is a requirement under Horizon 2020, and even if it were not I would like to do it anyway. I want to do it myself, because most academics are not invested in the publication of open data, and tend to play fast and loose, not document datasets and accompanying code properly, and so on.

There are two set of problems that the tools need to address.

1. How close to graph form?

SSNs are graphs; the whole interpretive framework is based on this. But the Discourse DB thinks in tables, not in graphs. Hence, graphs for SSNA require external software to build and navigate them.

Given this, when we store data we can do two things.

We could just export an “as is” dump of the relevant part of the DB in JSON form, and leave it to the future user to re-assemble it in graph form.
We could “pre-cook” our data dump into a graph-ready form. For example, you would have a JSON file encoding nodes (users), and another one encoding edges (interactions). This makes it super easy to build a graph with any software. I would probably stick to JSON as a format; storing data in graph format (like GraphML or Tulip) won’t make the work of the re-user much easier, and those formats might even put off some people, since they are far less universal than JSON.

This choice implies a choice of investment in possible audiences of re-users. Code-literate folks do not have many problems going from tabular data to graphs. Guy, or Bruno, or Ben could code up a script in less than an hour. But then there are people like me, from a social sciences background but with some coding skills; and humanities researchers who use software, but do not code. @amelia, do you have any recommendations here? Are there people out there who would consider re-using our data but might be put off by the idea of having to re-assemble them into graph form?

2. Pseudonymizing

This is something I have never done.

I understand the minimum requirements we must meet are the following. Matthias, can you confirm?

the data do not contain any username
the user IDs are scrambled (user IDs are accessible to anyone via API call).

I would do this in the following way:

Loop over usernames.

For each username, use the random library in Python to generate a random string.

>>> import random
>>> myhash = random.getrandbits(128) 
>>> print(myhash)
181710139364666073670306802841889258260
>>> print ("%32x" % myhash)
88b419906ec88286af914e37343bbb14

Replace the username with the random everywhere in the dataset.

Is this acceptable?

Based on this information I can build a workflow. Thanks in advance, everyone.

matthias · February 11, 2020, 1:54pm

And don’t forget the obvious: removing the e-mail addresses from user records. Also have a look at the final output in case that other personally identifiable information makes its way through in user records (IP address of last login etc.).

Yes, and esp. also in @mentions. These are not their own field values but appear in plain text as part of posts, so replacing needs some regex work.

You might want to align the naming scheme of the pseudonymized usernames to those created by Discourse when using the “anonymize user” feature. For example @anon87623122 is a user on whom we used that feature.

alberto · February 11, 2020, 2:08pm

This should not be a problem, because the list of users is built up from looking up posts. what we get looks like this:

{'username': u'PSEUDONYM1', 'post_number': 3, 'user_id': OMIT, 'raw': u' Is indeed good news. I imagine that effective mentoring would need to be tailored pretty closely to the individuals who need the mentoring, unless you already have some areas of expertise in mind other than very broad such as technical, UX or business.', 'created_at': u'2019-06-27T21:25:48.283Z', 'reply_to_post_number': 1, 'post_id': 55805, 'target_username': u'PSEUDONYM2', 'reply_to_post_id': 55802}

Noted.

Great tip. Meta does not provide a schema. anon + a number of… how many digits? Does it matter?

Thanks, @matthias, very helpful as always.

hugi · February 11, 2020, 2:08pm

By far, the easiest solution right now is to provide a dump of the Neo4j database for the relevant Graphryder install. Neo4j Community Edition is open source software that can be installed easily on a regular laptop, and it comes with a free (but proprietary) dashboard. It is a very accessible solution.

That gives all the data they need in graph form, and the graph can be explored in the interactive dashboard or through scripts that run locally and connect to the database.

We can also dump a Neo4j database to JSON or to a number of CSVs.

Scrubbing the usernames and scrambling IDs before we provide the dump is probably just three or four lines of Cypher code.

matthias · February 11, 2020, 2:52pm

No, it does not matter how many digits. These are just IDs, not secrets of any kind. 8 digits as in the example seems good – still visually memorizable while making clashes reasonably improbable. When generating and assigning these random numbers, just make sure to keep those already assigned in memory and check if a new random number by chance is the same as an already assigned one. In that case, just try again and create another random number.

alberto · February 11, 2020, 3:55pm

This is easy and therefore attractive, but it misses out on the principle of exporting data directly from the primary source. For example, now we cannot really trust Graphryder dashboards to really update – there are always small, unaccounted for, discrepancies between posts/annotations as recorded in the Discourse database and what we see in Graphryder.

Anyway, let’s acquire @amelia’s input. If graph-friendliness turns out to be super important, then Neo becomes more attractive. If tabular data are preferred – which they are, all other things being equal, because there are better standards to provide metadata like Data Package – then it’s probably better just output pseudonymized JSONs straight from the Discourse tap.

alberto · February 12, 2020, 2:44pm

After some reflection, I have decided to do the following.

1. Generate the data

I export data directly from Discourse. They go into four files:

users (social network nodes)
posts (social network edges)
annotations
codes

Logic: a simple API call to the topic already gets the username of the author of the post that each post is a reply to. So, this maps very simply and reliably onto a “list of nodes and links” representation of a social network.

Annotations and codes are not arranged in network form, although annotations do preserve the IDs of the posts that they annotate, and those of the codes they are annotated with. I think arranging the codes in a graph might be too limiting: after all, almost no ethnographer outside of Edgeryders thinks in networks.

Files are saved in CSV, rather than JSON, format, for reasons explained below.

2. Pseudonymize

See above.

3. Add the metadata

OKFN maintains a nifty Python library that creates Data Package-compliant metadata from a batch of data files:

import datapackage
package = datapackage.Package()
package.infer('**/*.csv')
package.descriptor

However, the library does not work on JSON data, only on “flat” tabular data – which seems to be the new standard anyway, according to OKFN’s Frictionless Data project. So, we default to CSV for our own research data.

alberto · February 12, 2020, 4:42pm

Test post, hitting the “Reply” button.

amelia · February 13, 2020, 10:04am

I’d say that most don’t know how to code and many would be intimidated by having to go from table to graph form. However, if we give very clear instructions on how to do it, I think it will be fine — people have been really keen to use the software, so I think if they’ve already gotten to the point where they’re excited about it, if we can give them clear instructions that aren’t too long or difficult they’ll be up to give it a go. Plus I reckon if we make it clear enough they’ll be excited that they’ve done something a little DIY

alberto · February 13, 2020, 10:39am

More processed data are easier to re-use in the same way as the people who publish them, but harder in any other way. Closer-to-raw data are less user friendly, but open to more types of re-use.

amelia · February 13, 2020, 10:43am

Makes sense, it’s a tradeoff. I am a willing test subject for experimenting with levels of difficulty on this. It seems it’ll be about hitting a sweet spot — easy enough so that people won’t give up before they start, but not so easy that the data is only reusable in very limited fashion.

matthias · February 13, 2020, 12:46pm

Can’t you bundle the data with a Python script (with no to minimal dependencies beyond the standard library) that can transform the data into graph form? Seems doable to me, and would be a good solution of settling on one standard data format while also making it very simple to convert to other formats.

Settling on one single data format is the way to go (I think) because it avoids redundancy, which in turn avoids that users of the data might be unsure if the two formats really represent the same information. With one format and a script, they can make sure of that themselves.

alberto · February 13, 2020, 3:34pm

I plan to have everything in CSV.

But the posts.csv file also contains a field called target_username. This makes it trivial to build a social network of interactions, without having to trace back to the author of the post that each post is a reply to, the way we did in Drupal.

alberto · February 13, 2020, 5:08pm

@matthias, data protection advice needed.

I have a pseudonimyze function in my export script, as agreed.

But here is a problem. If I start from a specific conversation (say, that around the tag ethno-poprebel), I end up with a list of usernames of the people who participated into that conversation, and pseudonymize those. So far, so good. But, if one of them were to @mention or [quote] someone that is not part of that conversation, but nevertheless is part of Edgeryders, that person would not be pseudonymized.

Should I instead check each post against all usernames in Edgeryders? Or so we consider that mentioning someone who did not, herself, participate in that conversation is not compromising of the person? We are talking typically of high-profile, but long inactive Edgeryders, like @hexayurt.

FYR, the code:

        clean_text = str(post['text'])
        # remove pictures and replace with a placeholder
        clean_text = re.sub("!\[.{5,}.jpeg\)", "<image here> \n", clean_text)
        # this needs to change, possibly multiple times if the post mentions multiple people

        for name in pseudonym_map:
            if post['source_username'] == name:
                clean_post['source_username'] = pseudonym_map[name]
            if post['target_username'] == name:
                clean_post['target_username'] = pseudonym_map[name]
            # the following takes care of the @mentions
            clean_text = clean_text.replace('@' + name, '@' + pseudonym_map[name])
            # the following takes care of the [quote="username"]
            clean_text = clean_text.replace('[quote="' + name, '[quote="'+ pseudonym_map[name])
            # the following takes care of the /u/username legacy HTML mention
            clean_text = clean_text.replace('/u/' + name, '/u/' + pseudonym_map[name])
            # this does NOT clean up mentions of people who are not on this particular convo! 
        clean_post['text'] = clean_text

The images are removed via regex.

matthias · February 13, 2020, 6:25pm

I would simply make sure that no @mentions stay in the post texts in their original form:

In a first pass, replace @mentions with pseudonymized usernames of people participating in the conversation.
In a second pass, anonymize the remaining @mentions by replacing them with a single pseudo-username like @anon.

Of course you could also pseudonymize those, but anonymizing is simpler and does not lose any information for building the network.

A related but different issue is that due to the import from Drupal, not all our @mentions follow the @username format. Some use [username], some have links to the username etc… These are the remainder of cases that could not be fixed with a script. Not sure if it affects the OpenCare data (the earliest one that got published), but if it becomes a relevant issue we can let Anu fix these manually.