I’m on an odd mission again, this time exploring a strange issue of what may happen to your privacy when your Edgeryder content is finally released as annotated open data package to the general public. (It all started with a discussion with Alberto over here, and we’re continuing here to let you all find it and contribute.)
It started with finding out about just how ethnographic research will be done on the Edgeryder project’s data, and that this also includes the intention of releasing the complete package of mission reports, comments, ethnographic annotations and social network graphs to the public as “open data” [source]. That raises some concerns, and Alberto proposed that this merits a discussion with all of us here …
I’m just doing copy & paste of my original post for now, will polish it later:
I do like the Edgeryders project and its potential for bringing change in Europe, yet I was quite … surprised, to say the least, to find out about the actual mode of research and data handling. In my view, Edgeryders raises some very novel issues of how to handle data dumps that include semantic data that’s manually annotated to open content.
I can’t pinpoint it exactly so far, but I’d say that it’s a conflict between open content and open data. The two are not the same. Data is ordered, orchestrated, annotated content, stuff from which new data can be derived by automated means. There have been hardly issues about that difference so far (to my knowledge) … but Edgeryders is on the cutting edge of that with this ethno software thingy and by providing open data dumps of CC-BY licenced material which was semantically annotated afterwards. So we better look at these issues now before it’s too late …
The core of the problem is that a raw, open data style dump of the ethno-annotated Edgeryders content is far from everything happening with Creative Commons licenced content usually. (I’d even say it’s far from everything the licence creators had in mind; else there might be a CC-BY-NM (“no mining”) licence perhaps.) Such a dump enables unforseen uses. The Edgeryders data is special data, as it comes from social interactions, so naturally allows to easily identify most of the people who contributed (even when there are no names anymore; but people often refer to their projects etc., as is natural in social talk and encouraged on Edgeryders, so just use Google). It is even more special because of the added ethno annotations (whatever these are). Ethno annotations and personal identification might allow malign uses never imagined when writing the content. Examples? “Find me all people who did post in Spanish at least once, whose political ideas are considered anticapitalist and commons oriented and who mention any affiliation with a Spain based organization and who consider themselves as leader types.” (Cf. the Edgeryders tribal signs …). Might come in handy as preemptive measure against Occupy style movements in Spain this year … see what they had in mind already. The Edgeryders user base is too small for that, but … you get the idea.
The reason why this issue does not evolve normally with CC-BY licenced content is that usually, no person has the time to do these ethno annotations and to provide the data in a nice, structured database for download. The moment that natural language understanding by computers arrives, or that forensic linguistics matures, this will be all different and we will need to develop new open content licences, and linguistic obfuscation software. But for now, Edgeryders is pretty much alone with this issue.
Please don’t get me wrong: I am not against open content. Not at all, I am an avid fan of all things open. I have several thousand pages of stuff on my page, all licenced CC-BY. And I’ve never had a problem. But that’s because (1) I can avoid most potential problems by being in control of the data, requesting attribution to a pseudonym, and taking them down or even moving the site when needed, (2) the data is not structured and nobody has the time to structure / annotate it, so it’s pretty forseeable what will be done with this data and what not. I don’t release structured data about me in any way on the Internet; so for example, I have only very very basic and / or wrong stuff in my Facebook profile, I use ad blocking and track-me-not software on the web, I don’t indicate my interests to Facebook by clicking any “Like” for a corporate page etc… But I’ve no issue at all with releasing unstructured open content, as it is for human use only, not fit for automated processing. (Normally, until this present issue with Edgeryders where the structure / semantic annotations will be added afterwards by a third party.)
This type of managing my content is in line with my personal “privacy policy” that I settled on after quite a bit of consideration. There are three levels in it: content that I allow search engines to connect to my real-world identity, content that I allow law enforcement to do so, and content that I allow nobody to do so. (Note that this level three is completely empty and I don’t need it currently; but I know the toolchain, and people under repressive regimes need it right now.) Content that I have complete control of (like on an own website) is on level one. Edgeryders content is at level two.
However my privacy policy seems to have a hole in it, and I realize it only now because of the Edgeryders project. There has to be a new level (between 1 and 2) for content that I allow curious members of the public or of the research community to connect to my real-world identity by means of manual work. Did not run into this before - it was a complete no-issue so far … . Content in this new level would be something where I am still happy if people connect it to my real-world identity by “spying” on me (or in other use cases, doing research on me). I doubt that Edgeryders would’ve made it into this level.
This data handling issue leads to another, and I think there has to be a clearer position here. On the one hand, Edgeryders is a playful platform that promotes social interaction in a cohesive community. There is also this upcoming conference, and we’re treated as subjects with ideas and opinions to contribute there. Such a social setting naturally means that people talk about stuff and give hints to personal facts that they would not be happy to find again as structured data in a public database where you can run all kinds of queries on. On the other hand, just that seems to happen. By means of this ethno-annotation thing, we as users are treated in quite objectified manner, no longer as subjects, as peers. But if we are supposed to be in a lab setting for scientific observation, it would be better to introduce people right away to that idea when they sign up (just “used for research” is quite vague). There will be people who are available for this, esp. when adding anonymization tech to the platform. Or even simpler, do the ethno annotation in all the forums etc. where the Edgeryders topics have been discussed already. So in all I’d propose to either let us be subjects or objects, but this mixed state is the strangest online environment that I have been in so far. Does not feel good.
Some practical proposals to discuss:
- Clearly indicate what uses you will authorize for the open data dump of the open content entered into the Edgeryders site. Esp. please point out in the site's Legals what this ethno annotation software is that you use and what it can do and what people can potentially do with the data created that way.
- Discuss if you really want to do this open notebook science approach for the Edgeryders project, that is, providing a full, ethno annotated database dump for third party research efforts. This is data about still identifiable persons (even with names removed), not about the weather, water temperature or that kinda trivia.
- Give us a two week "content freeze" period near the end of the project, to look through the data we provided and to delete or adapt everything we don't want to be retrievable as structured data. This means that all content has to be editable, including comments that have replies.
- Make a user's profile page restrictively licenced by the author. The one place without a right to reuse the info on there, so we could easily share more personal info on there without worrying what will become out of it. Such a change seems needed to retain the social-network type character of Edgeryders, actually the basis of its success.
- Provide an option to the user to allow and disallow search engines to retrieve the info on the profile page. (This esp. has to include the profile image because, mind you, face recognition is deployed already.)
- Provide a field in the user profile to identify just how CC-BY attribution should happen on reuse. The CC-BY licence mandates that the author has a say over how he wants to be attributed, and I usually request that people attribute to my pseudonymous name when sharing things CC-BY. (That way, I can still break the link to my person by removing the complete website where I published the material.)
- Show a link to this attribution requirement details on every piece of content throughout the Edgeryders website.