Privacy on Edgeryders, iteration two
Alberto and Lyne - thanks for valuable input, you made me think again. So I've let my thoughts mature for a day or so, and now let's see if we can make any progress in our little discussion, esp. in defining what the problem is (if there is one, of course).
There's really no reason to rethink the complete approach of the Edgeryders project. Using software support to find generalizations within a collection of stories is actually a good idea. Allowing that to others is a good idea, too - in the sense of supporting open science, and ya know, I'm fan of all things open ...
So the only issue is, how can the Edgeryders data be shared without any harm for the privacy of the users, even in the long run of say, 20+ years (who knows, actually) that such a data package might be used in research.
Alberto, you already mentioned that you are going to anonymize user names and handles. So you're aware that there's a need to protect privacy in a data dump. Now the issue is, identifiability is not just in the user name field of a post on this site, but also in: mentioning the name of a fellow Edgeryder in some comment (see how I started this paragraph?), mentioning organization names and workplaces and living locations, mentioning links to websites of own projects, and probably more.
All this happens naturally in social discourse, and for proper anonymization, all this has to be edited out. Because, with some Google queries, everybody can connect these little names and facts back to the real-world identities of people - in many cases at least. When I am not mistaken, the actual names of projects, people, locations etc. are not at all relevant for the task of making sense of a collection of stories. Because on this aggregate level, one will look at how often similar positions and events are told, and what stories have in common. At this level, it's not about inividual great ideas and projects to select from any more (that's what happens in social discourse), but about getting the big picture from many small stories. So I guess redacting out all the info that makes us re-identifiable would not hurt for analysis. It's of course some effort ... but so is reading the whole stories and doing the ethno-annotations.
Now why is redacting out all this so essential, in my view? (I'm sorry Lyne but I need to get technical again ... it just is a technical matter that I'm trying to explain.) Edgeryder users agreed to licence their contributions CC-BY. From potential experience with open content, we are aware that this includes licencing for human usage, for serving as a basis for adaptation, translation, being incorporated in other texts, turning up in Google and so on. And I have no issue with all these modes of human processing - it's not more dangerous to me than putting all this Edgeryders content as CC-BY on my own website, which I clearly would do. To do any major harmful analysis of this content (like creating a personality profile of me), people would have to read through all this for hours and hours, understand it, annotate it, run statistics on it etc.. That's too much work even in the case of investigative journalism: these folks look for the few condemning pieces, which are not in this content.
But when we transform content to data, in this case by ethno-annotations, network analysis and offering as well-structured, downloadable archive - then we enter a different world: away from slow human processing, here comes computation. Doing harm is not prohibitively expensive any more once the semantic annotations have been done. That's also why the laws are about "data privacy", not "content privacy": the real dangers come with the novel options of automated data connection, computation, deduction etc.. (And on a side note, "open data" regularly is about anonymous aggregate data, or to non-personal data - isn't it?)
Specifically, I'm still not easy about the content of these ethno annotations: Alberto's examples are clearly benign, but might annotations also include things like "case of bad success", "economic dead end road", "low income", "poor", "social difficulty"? ... I simply don't know because it's not stated so far. Even if it is, given the opportunity for other researchers to add to this content in crowd-sourcing manner, nobody knows what they will add. And then somebody adds an algorithm to create automated (and public!) personality profiles from this data (the nightmare of data protection), and somebody else follows some links and annotates the data with background on our projects and with our real names ...
That's why I argue now for radical anonymization, resulting in a set of stories that cannot be traced back to real persons. Such procedure would also be in unison with the principles behind data privacy laws (using the German example now). One is the principle of data economy: do not collect more data than necessary. Collecting data about people, project, location etc. names is not needed for story evaluation - so omit it. The leap to take because of the novel content-to-data conversion principle behind Edgeryders is this: omit these names in the content, because else it might end up as data in some research projects lateron. That's not in the laws, but should be a logical conclusion in an environment like Edgeryders where content-to-data conversion is done. Or put another way round: the ethno-annotations are structured data, so it would be required per privacy laws from the Edgeryders project to state upfront what data will be collected about users and what uses will be made from the data.
@Alberto: so what do I propose in practical terms:
- Go on with the Edgeryders project and evaluation as intended - it's a great idea after all! But make really sure that no traces of personally identifiable content is left that could later turn up converted to structured data. That means eliminating in the data dump package all references to names, organizations, own websites etc., see above.
- This redaction can be done by the team, or we could help you folks by redacting our own content. But for that, you'd have to make comments editable, also when they have replies. I proposed to use a separate "content freeze" period at the end of the project for this, as the content would be morphed to something that is pretty useless for socializing on this site any longer, while still being useful for ethnographic evaluation.
- Proper anonymization also includes to remove the unredacted content from this original Edgeryders site once the data dump is placed online. Because else, data correlation for de-anonymization could be made automatically.
@Lyne: Have to add a special part for you! Many thanks for pointing out the importance of bravery in these times of change. And I'm with you here, and already have ideas and projects that I'll fight for. Yet a fight it is, and in a fight you don't leave cover for no reason, and you take some care. While CoE has an admirably positive attitude towards us, there will be others with adverse interests and ideas, now and in the future, and they will have likewise access to the Edgeryders data ...
We're just learning, as a global society, to deal with the Web and all the openness it enables. Over here in Germany, we regularly have cases that illustrate how far away we are from having learnt that: cases where some careless but essentially harmless comment is successfully used against a person in office or responsibility. Take the example of Horst Köhler, a former German Federal President. Or just these days, some careless comments of Pirate Party Germany members that included Nazi Germany comparisons, used in their usual Twitter-esque manner, but exploited by the press ... . I am not ok with such petty-minded instrumentation of words for lobbying against persons - folks, get used to what happens in the speedy interactons of the web and that it's essential there to forgive and to really forget. But this quarrelling persists, and "they" dig it all out ...
So because I expect this to still persist in 10-20 years where my personal reputation might be key to some influential larger open society project or whatever, I take a bit care now to not leave a trace of ammo for my future hunters. I will fight then, of course, but "the fight is won before it starts by the logistics people" as we used to say ...
There must be spaces, like Edgeryders, to give these people the freedom to be who they are. Because otherwise, the future of society is compromised.
This means taking risks in life, such as participating in an experimental project as Edgeryders --- by leaving a trail of data --- and collectively push together the boundaries a little bit further.
Always open for experimentation over here, and the risk that is necessarily in it. Yet also add some reasonable care (not fear!) to the mix, and it gets even better. That's why I joined Edgeryders, and contribute to an emerging project, but also point out emerging problems. And those problems can be handled, see above. It's essentially just the application of existing data privacy laws to the unchartered area of data extraction from content ...