I have decided to write up a fresh script (or set thereof) for converting Edgeryders data into something re-usable by other researchers interested in SSNA. As you know, publishing open research data is a requirement under Horizon 2020, and even if it were not I would like to do it anyway. I want to do it myself, because most academics are not invested in the publication of open data, and tend to play fast and loose, not document datasets and accompanying code properly, and so on.
There are two set of problems that the tools need to address.
1. How close to graph form?
SSNs are graphs; the whole interpretive framework is based on this. But the Discourse DB thinks in tables, not in graphs. Hence, graphs for SSNA require external software to build and navigate them.
Given this, when we store data we can do two things.
- We could just export an “as is” dump of the relevant part of the DB in JSON form, and leave it to the future user to re-assemble it in graph form.
- We could “pre-cook” our data dump into a graph-ready form. For example, you would have a JSON file encoding nodes (users), and another one encoding edges (interactions). This makes it super easy to build a graph with any software. I would probably stick to JSON as a format; storing data in graph format (like GraphML or Tulip) won’t make the work of the re-user much easier, and those formats might even put off some people, since they are far less universal than JSON.
This choice implies a choice of investment in possible audiences of re-users. Code-literate folks do not have many problems going from tabular data to graphs. Guy, or Bruno, or Ben could code up a script in less than an hour. But then there are people like me, from a social sciences background but with some coding skills; and humanities researchers who use software, but do not code. @amelia, do you have any recommendations here? Are there people out there who would consider re-using our data but might be put off by the idea of having to re-assemble them into graph form?
This is something I have never done.
I understand the minimum requirements we must meet are the following. Matthias, can you confirm?
- the data do not contain any username
- the user IDs are scrambled (user IDs are accessible to anyone via API call).
I would do this in the following way:
Loop over usernames.
For each username, use the
randomlibrary in Python to generate a random string.
>>> import random >>> myhash = random.getrandbits(128) >>> print(myhash) 181710139364666073670306802841889258260 >>> print ("%32x" % myhash) 88b419906ec88286af914e37343bbb14
Replace the username with the random everywhere in the dataset.
Is this acceptable?
Based on this information I can build a workflow. Thanks in advance, everyone.