What can happen to your Edgeryders data ... and you

I’m on an odd mission again, this time exploring a strange issue of what may happen to your privacy when your Edgeryder content is finally released as annotated open data package to the general public. (It all started with a discussion with Alberto over here, and we’re continuing here to let you all find it and contribute.)

It started with finding out about just how ethnographic research will be done on the Edgeryder project’s data, and that this also includes the intention of releasing the complete package of mission reports, comments, ethnographic annotations and social network graphs to the public as “open data” [source]. That raises some concerns, and Alberto proposed that this merits a discussion with all of us here …

I’m just doing copy & paste of my original post for now, will polish it later:

I do like the Edgeryders project and its potential for bringing change in Europe, yet I was quite … surprised, to say the least, to find out about the actual mode of research and data handling. In my view, Edgeryders raises some very novel issues of how to handle data dumps that include semantic data that’s manually annotated to open content.

I can’t pinpoint it exactly so far, but I’d say that it’s a conflict between open content and open data. The two are not the same. Data is ordered, orchestrated, annotated content, stuff from which new data can be derived by automated means. There have been hardly issues about that difference so far (to my knowledge) … but Edgeryders is on the cutting edge of that with this ethno software thingy and by providing open data dumps of CC-BY licenced material which was semantically annotated afterwards. So we better look at these issues now before it’s too late …

The core of the problem is that a raw, open data style dump of the ethno-annotated Edgeryders content is far from everything happening with Creative Commons licenced content usually. (I’d even say it’s far from everything the licence creators had in mind; else there might be a CC-BY-NM (“no mining”) licence perhaps.) Such a dump enables unforseen uses. The Edgeryders data is special data, as it comes from social interactions, so naturally allows to easily identify most of the people who contributed (even when there are no names anymore; but people often refer to their projects etc., as is natural in social talk and encouraged on Edgeryders, so just use Google). It is even more special because of the added ethno annotations (whatever these are). Ethno annotations and personal identification might allow malign uses never imagined when writing the content. Examples? “Find me all people who did post in Spanish at least once, whose political ideas are considered anticapitalist and commons oriented and who mention any affiliation with a Spain based organization and who consider themselves as leader types.” (Cf. the Edgeryders tribal signs …). Might come in handy as preemptive measure against Occupy style movements in Spain this year … see what they had in mind already. The Edgeryders user base is too small for that, but … you get the idea.

The reason why this issue does not evolve normally with CC-BY licenced content is that usually, no person has the time to do these ethno annotations and to provide the data in a nice, structured database for download. The moment that natural language understanding by computers arrives, or that forensic linguistics matures, this will be all different and we will need to develop new open content licences, and linguistic obfuscation software. But for now, Edgeryders is pretty much alone with this issue.

Please don’t get me wrong: I am not against open content. Not at all, I am an avid fan of all things open. I have several thousand pages of stuff on my page, all licenced CC-BY. And I’ve never had a problem. But that’s because (1) I can avoid most potential problems by being in control of the data, requesting attribution to a pseudonym, and taking them down or even moving the site when needed, (2) the data is not structured and nobody has the time to structure / annotate it, so it’s pretty forseeable what will be done with this data and what not. I don’t release structured data about me in any way on the Internet; so for example, I have only very very basic and / or wrong stuff in my Facebook profile, I use ad blocking and track-me-not software on the web, I don’t indicate my interests to Facebook by clicking any “Like” for a corporate page etc… But I’ve no issue at all with releasing unstructured open content, as it is for human use only, not fit for automated processing. (Normally, until this present issue with Edgeryders where the structure / semantic annotations will be added afterwards by a third party.)

This type of managing my content is in line with my personal “privacy policy” that I settled on after quite a bit of consideration. There are three levels in it: content that I allow search engines to connect to my real-world identity, content that I allow law enforcement to do so, and content that I allow nobody to do so. (Note that this level three is completely empty and I don’t need it currently; but I know the toolchain, and people under repressive regimes need it right now.) Content that I have complete control of (like on an own website) is on level one. Edgeryders content is at level two.

However my privacy policy seems to have a hole in it, and I realize it only now because of the Edgeryders project. There has to be a new level (between 1 and 2) for content that I allow curious members of the public or of the research community to connect to my real-world identity by means of manual work. Did not run into this before - it was a complete no-issue so far … . Content in this new level would be something where I am still happy if people connect it to my real-world identity by “spying” on me (or in other use cases, doing research on me). I doubt that Edgeryders would’ve made it into this level.

This data handling issue leads to another, and I think there has to be a clearer position here. On the one hand, Edgeryders is a playful platform that promotes social interaction in a cohesive community. There is also this upcoming conference, and we’re treated as subjects with ideas and opinions to contribute there. Such a social setting naturally means that people talk about stuff and give hints to personal facts that they would not be happy to find again as structured data in a public database where you can run all kinds of queries on. On the other hand, just that seems to happen. By means of this ethno-annotation thing, we as users are treated in quite objectified manner, no longer as subjects, as peers. But if we are supposed to be in a lab setting for scientific observation, it would be better to introduce people right away to that idea when they sign up (just “used for research” is quite vague). There will be people who are available for this, esp. when adding anonymization tech to the platform. Or even simpler, do the ethno annotation in all the forums etc. where the Edgeryders topics have been discussed already. So in all I’d propose to either let us be subjects or objects, but this mixed state is the strangest online environment that I have been in so far. Does not feel good.

Some practical proposals to discuss:

  • Clearly indicate what uses you will authorize for the open data dump of the open content entered into the Edgeryders site. Esp. please point out in the site's Legals what this ethno annotation software is that you use and what it can do and what people can potentially do with the data created that way.
  • Discuss if you really want to do this open notebook science approach for the Edgeryders project, that is, providing a full, ethno annotated database dump for third party research efforts. This is data about still identifiable persons (even with names removed), not about the weather, water temperature or that kinda trivia.
  • Give us a two week "content freeze" period near the end of the project, to look through the data we provided and to delete or adapt everything we don't want to be retrievable as structured data. This means that all content has to be editable, including comments that have replies.
  • Make a user's profile page restrictively licenced by the author. The one place without a right to reuse the info on there, so we could easily share more personal info on there without worrying what will become out of it. Such a change seems needed to retain the social-network type character of Edgeryders, actually the basis of its success.
  • Provide an option to the user to allow and disallow search engines to retrieve the info on the profile page. (This esp. has to include the profile image because, mind you, face recognition is deployed already.)
  • Provide a field in the user profile to identify just how CC-BY attribution should happen on reuse. The CC-BY licence mandates that the author has a say over how he wants to be attributed, and I usually request that people attribute to my pseudonymous name when sharing things CC-BY. (That way, I can still break the link to my person by removing the complete website where I published the material.)
  • Show a link to this attribution requirement details on every piece of content throughout the Edgeryders website.
Your input on this, fellows? Thanks!

A think tank (and yes, it’s open)

Matthias, thanks again for great insight. This entry forced me to rethink our approach. However, at the end of the rethinking, I still stand for it. I think the risks to privacy in participating in the Edgeryders project are minimal. Let me explain why.

  • Focus on the collective dimension. Everything in Edgeryders is about the extent to which my experiences can be somehow generalized, how they resonate with others. So, it is not about trying to find out idiosyncratic behavior that makes individuals stand out, but rather about rethinking what might look idiosyncratic as, rather, shared and therefore socially justified. For example, many people here have spoken out against ACTA. Others have highlighted the importance of a free Internet with freely shareable content as a place for learning, a professional resource for finding work, or for simply making everyone's life better through the creation and enhancement of digital commons. What Edgeryders does is tie these positions together into a policy recommendation that more or less says "emphasizing IPRs is likely to hinder the transition of youth to an independent active life. We should handle the issue with care, and not assume hacktivists are troublemakers and should go look for jobs. They need the Internet to be free to even carve a professional niche for themselves." So, while the trends we pick up are certainly interesting for government, individual behavior is not.
  • Focus on world building. Edgeryders has no missions about consumer behavior. It is not interested in what movies we watch or what clothes we wear. It looks at world building exercises (how to combine crafts and e-commerce; designing bottom-up currencies; collaborative living; peer-to-peer learning). Not much interesting data for business here - at least, not for sales. Some might be interested, again, in the long-term market implications of some of the stuff discussed here, but not much in the way of finding individual consumers to sell stuff to.

These two features are reflected in the ethnographic coding, which I think you might have misunderstood. In ethnographerese, coding means tagging. As you write you participated in Erasmus, the ethnographer tags with the sentence with a code like “spatial mobility”. If you heard about a job opening from a friend, she would tag it with “opportunities from peers” or something similar. Here’s an example from a study about dementia carers:

So, by design there is not a lot of attention on anything individually sensitive. That goes both for the raw and for the coded data.

For any remaining potential problems, we were careful to frame Edgeryders interaction away from danger. We did this through three tools.

  • Validation. The results of the ethnographic study are published on the website for community validation. If you disagree with something, you can always speak up. The first bit of results is here.
  • Think tank metaphor. We define Edgeryders as a "distributed think tank on youth policy". This means clearly that it is a space for work, somewhere you go wearing a suit, as I like to say. Or maybe you prefer informal attire, but it is unlikely you'd show up in a bathrobe, no matter how informal your working environment. I think this is pretty clear from Edgeryders literature, for example our [presentation video](https://www.youtube.com/embed/DCocK4bKIFE).
  • finally, we were quite explicit in the Legals page: Please be aware that Edgeryders is a public space where you are responsible for what you post: if you are unsure whether it is legal or appropriate to share something, please don’t.

It goes on to mention research purposes, and specifies an open license for user generated content. Finally, it mentions email as the only real bit of personal data we store (people can and do use handles, like you do yourself); that’s covered by Council of Europe data protection policy .

And yes, it’s open, as open as I can get away with. Because it is research, paid for by the European taxpayer.

How can we improve? People can ask to be removed as users already; they can edit of change their content (there are now a, “edit this mission” and a “delete this mission” link next to your mission reports, so I see no reason for a content freeze period . Maybe we could explain better what the methodology of the project is. Actually, the methodology is itself emergent, and was not as clear to me when we wrote the Legals page.

Any other suggestions?

Let’s work for them!

I think is not a matter of fear. It’s a matter of knowledge. The reachest people in the world own “bits”, “data”. It’s fun! But, why do they have to own the whole information? I don’t know what will be, but I’m “autistic” (http://www.autistici.org/it/index.html), so I always think about the worst future. And if we create a new way of sharing data and this new way will be used by lobbies against the community? Is the community stronger than the corporantions?

I vote yes for the anonymous section only because they do not have a digital “agorà” where people can say what really want. Is this project affordable? Do we really need an independent lab inside the biggest lobby where everyone can analyse important data alone?

http://hackmeeting.org/

Some useless note in Italian:

Se apriamo la cosa a tutti cosa ne esce fuori? Qualcosa che i potenti della terra possono usare contro la comunità? O la comunità insieme riuscirà a combattere corporations and lobbies con questo nuovo strumento? Peggio di così non può andare, il mondo è pieno di segreti da cui noi persone comuni siamo tagliati fuori. Cosa nascondono? Qual è il loro potere? Cosa sanno realmente? Perché ciò che ci fanno sapere lo conosco abbastanza… Un centro studi indipendente dentro il capo delle lobby dove tutti possono analizzare dati importanti.

The antidote to insecurity

As I'm not so technical, I give you my non-technical version of how I see this situation.

There will always be fears and insecurity that comments on an open platform could eventually harm. There are all sorts of insecurities about participation in social media.

I have heard many people tell me they don’t want to say a word, because they think they will loose their job. Those people will never participate in experimental projects like Edgeryders.

I have heard people tell me they fear that their Edgeryders content could eventually decrease their chances of optaining a job in the near future. I tried to reassure those by letting them know they should firmly believe that they are appreciated and accepted for who they are.

We should not fear about what we do (our thoughts, our actions, our writing). We should instead take that fear away from our heads, and replace it with confidence.There is nothing to fear, you are safe.’ This society has thought us to respect fear as an essential contributor to our survival. But fear has become an existential anxiety. As a result, we fall prey to anxiety about who we are and where we belong.

The Edgeryders projet allows to detach from it, because it was part of the initial CoE’s vision to see young people with new eyes (offering the possibility of approaching youth from a different perspective).

There comes a time in life when fear takes less extent, and beliefs and vision occupy so much space in our minds that these considerations are seen as less important. Talking loud and clear about what fills one’s heart then becomes a priority. Edgeryders focuses on the actions of participants and hightlights them.

Many people who came forward on Edgeryders are already very active in their environment, and it is not too difficult for anyone to identify them. I suppose that they already could be considered as activists, agitators, visionaries, or other many names, with varying degrees of negative connotations.

They already have enough problems, trying to gain acceptance for the solutions they provide and projects they try to run.

There must be spaces, like Edgeryders, to give these people the freedom to be who they are Because otherwise, the future of society is compromised.

Yesterday, I was discussing with policy makers from my country, and we talked about youth. It’s amazing how young people make them feel uncomfortable. They use social media, they dispute decisions, they strike, they lie in the streets to protest, they are very organized.I listened to this list with googly eyes. They apparently do all sorts of things to make their hair stand on their head!

What each person is trying to achieve is right and proper. Many now realize they won’t settle for a listless, uneventful existence. This means taking risks in life, such as participating in an experimental project as Edgeryders by leaving a trail of data — and collectively push together the boundaries a little bit further.

You see, I’m more worried to what will happen to me, and the rest of the Edgeryders gang, than what will happen to the data. As a matter of fact, I want things to happen to this data!

I would like this data to be studied and re-studied, analyzed in great depth. I would like this data to be separated into subsets, and be scrutinized by researchers, analysts, and others professionals of various backgrounds. I would like sub-Edgeryders projects to be created, building on this data, to continue its mission.

The fact that the brave participants involved in Edgeryders will be recognized as partners (their data not intended to harm, but to build the future), recognized as citizens-experts, I see this as an antidote to the insecurity of the youth.

Privacy on Edgeryders, iteration two

Alberto and Lyne - thanks for valuable input, you made me think again. So I’ve let my thoughts mature for a day or so, and now let’s see if we can make any progress in our little discussion, esp. in defining what the problem is (if there is one, of course).

There’s really no reason to rethink the complete approach of the Edgeryders project. Using software support to find generalizations within a collection of stories is actually a good idea. Allowing that to others is a good idea, too - in the sense of supporting open science, and ya know, I’m fan of all things open …

So the only issue is, how can the Edgeryders data be shared without any harm for the privacy of the users, even in the long run of say, 20+ years (who knows, actually) that such a data package might be used in research.

Alberto, you already mentioned that you are going to anonymize user names and handles. So you’re aware that there’s a need to protect privacy in a data dump. Now the issue is, identifiability is not just in the user name field of a post on this site, but also in: mentioning the name of a fellow Edgeryder in some comment (see how I started this paragraph?), mentioning organization names and workplaces and living locations, mentioning links to websites of own projects, and probably more.

All this happens naturally in social discourse, and for proper anonymization, all this has to be edited out. Because, with some Google queries, everybody can connect these little names and facts back to the real-world identities of people - in many cases at least. When I am not mistaken, the actual names of projects, people, locations etc. are not at all relevant for the task of making sense of a collection of stories. Because on this aggregate level, one will look at how often similar positions and events are told, and what stories have in common. At this level, it’s not about inividual great ideas and projects to select from any more (that’s what happens in social discourse), but about getting the big picture from many small stories. So I guess redacting out all the info that makes us re-identifiable would not hurt for analysis. It’s of course some effort … but so is reading the whole stories and doing the ethno-annotations.

Now why is redacting out all this so essential, in my view? (I’m sorry Lyne :wink: but I need to get technical again … it just is a technical matter that I’m trying to explain.) Edgeryder users agreed to licence their contributions CC-BY. From potential experience with open content, we are aware that this includes licencing for human usage, for serving as a basis for adaptation, translation, being incorporated in other texts, turning up in Google and so on. And I have no issue with all these modes of human processing - it’s not more dangerous to me than putting all this Edgeryders content as CC-BY on my own website, which I clearly would do. To do any major harmful analysis of this content (like creating a personality profile of me), people would have to read through all this for hours and hours, understand it, annotate it, run statistics on it etc… That’s too much work even in the case of investigative journalism: these folks look for the few condemning pieces, which are not in this content.

But when we transform content to data, in this case by ethno-annotations, network analysis and offering as well-structured, downloadable archive - then we enter a different world: away from slow human processing, here comes computation. Doing harm is not prohibitively expensive any more once the semantic annotations have been done. That’s also why the laws are about “data privacy”, not “content privacy”: the real dangers come with the novel options of automated data connection, computation, deduction etc… (And on a side note, “open data” regularly is about anonymous aggregate data, or to non-personal data - isn’t it?)

Specifically, I’m still not easy about the content of these ethno annotations: Alberto’s examples are clearly benign, but might annotations also include things like “case of bad success”, “economic dead end road”, “low income”, “poor”, “social difficulty”? … I simply don’t know because it’s not stated so far. Even if it is, given the opportunity for other researchers to add to this content in crowd-sourcing manner, nobody knows what they will add. And then somebody adds an algorithm to create automated (and public!) personality profiles from this data (the nightmare of data protection), and somebody else follows some links and annotates the data with background on our projects and with our real names …

That’s why I argue now for radical anonymization, resulting in a set of stories that cannot be traced back to real persons. Such procedure would also be in unison with the principles behind data privacy laws (using the German example now). One is the principle of data economy: do not collect more data than necessary. Collecting data about people, project, location etc. names is not needed for story evaluation - so omit it. The leap to take because of the novel content-to-data conversion principle behind Edgeryders is this: omit these names in the content, because else it might end up as data in some research projects lateron. That’s not in the laws, but should be a logical conclusion in an environment like Edgeryders where content-to-data conversion is done. Or put another way round: the ethno-annotations are structured data, so it would be required per privacy laws from the Edgeryders project to state upfront what data will be collected about users and what uses will be made from the data.

Privacy laws (at least German ones, as an example) also state that it’s not ok to create novel uses from the data collected, or to correlate them in uses that were not stated upfront. For example, when using Google Analytics on a webshop, I’m not allowed to correlate this data by IP address or time stamping with customer identities, because I stated in my webshop’s privacy policy that “web statistics are collected anonymously”. Applying this to Edgeryders, you would have to disallow that ethno-annotation data on anonymized stories is correlated with identities again. If Edgeryders would do all this research in-house, I’d clearly trust you to abide by this policy (you are trustable people, actually!) … but now that it’s intended to hand out the whole package for any kind of applications by third parties, I don’t have that trust level. (That would amount to dumping pseudonymized customer data plus the full Apache web server access log into the open, enabling people to technically make correlations that are forbidden as per teh webshop’s privacy policy.)

@Alberto: so what do I propose in practical terms:

  1. Go on with the Edgeryders project and evaluation as intended - it's a great idea after all! But make really sure that no traces of personally identifiable content is left that could later turn up converted to structured data. That means eliminating in the data dump package all references to names, organizations, own websites etc., see above.
  2. This redaction can be done by the team, or we could help you folks by redacting our own content. But for that, you'd have to make comments editable, also when they have replies. I proposed to use a separate "content freeze" period at the end of the project for this, as the content would be morphed to something that is pretty useless for socializing on this site any longer, while still being useful for ethnographic evaluation.
  3. Proper anonymization also includes to remove the unredacted content from this original Edgeryders site once the data dump is placed online. Because else, data correlation for de-anonymization could be made automatically.
  4. Create a data privacy policy on the Legals page about the data that will be collected from us via the ethno-annotations, network analysis and other means. CC-BY content licencing is one thing, and we already agreed to it, but we will have to also agree to the collection of structures data.
@Lyne: Have to add a special part for you! Many thanks for pointing out the importance of bravery in these times of change. And I'm with you here, and already have ideas and projects that I'll fight for. Yet a fight it is, and in a fight you don't leave cover for no reason, and you take some care. While CoE has an admirably positive attitude towards us, there will be others with adverse interests and ideas, now and in the future, and they will have likewise access to the Edgeryders data ...

We’re just learning, as a global society, to deal with the Web and all the openness it enables. Over here in Germany, we regularly have cases that illustrate how far away we are from having learnt that: cases where some careless but essentially harmless comment is successfully used against a person in office or responsibility. Take the example of Horst Köhler, a former German Federal President. Or just these days, some careless comments of Pirate Party Germany members that included Nazi Germany comparisons, used in their usual Twitter-esque manner, but exploited by the press … . I am not ok with such petty-minded instrumentation of words for lobbying against persons - folks, get used to what happens in the speedy interactons of the web and that it’s essential there to forgive and to really forget. But this quarrelling persists, and “they” dig it all out …

So because I expect this to still persist in 10-20 years where my personal reputation might be key to some influential larger open society project or whatever, I take a bit care now to not leave a trace of ammo for my future hunters. I will fight then, of course, but “the fight is won before it starts by the logistics people” as we used to say …

There must be spaces, like Edgeryders, to give these people the freedom to be who they are Because otherwise, the future of society is compromised.

Totally right.

This means taking risks in life, such as participating in an experimental project as Edgeryders by leaving a trail of data — and collectively push together the boundaries a little bit further.

Always open for experimentation over here, and the risk that is necessarily in it. Yet also add some reasonable care (not fear!) to the mix, and it gets even better. That’s why I joined Edgeryders, and contribute to an emerging project, but also point out emerging problems. And those problems can be handled, see above. It’s essentially just the application of existing data privacy laws to the unchartered area of data extraction from content …

Modesty

Don’t read me wrong: I’m having great fun with this discussion, and just playing the game of the technical vs the non-technical. OK?

We must have something very different in the way our neurons connect. I have a hard time setting foot in your way of thinking, since I normally do not worry about all the possibilies which could happen in the future. Deepak Chopra said that ‘our present insecurity comes from trying to secure the future’. It was with some awe that I read your lines.

I used to think that way too, but not anymore. Just 3 or 4 years ago, it would have been much easier for me to relate to what you explained… very well.

My non-technical reading of the situation is this: basically, your “I” is feeling insecure. And would like all “I” Edgeryders to feel insecure as well. Extend the insecurity to the point of erasing all the names and organisations, etc.

I try to imagine how much time and resources the erasing phase would take. Is this really necessary? Where does it end? How does the processed data looks like? I am worried that a lot of it would become so altered that it would loose its substance.

Of course, I’m not expert in such things! But there is a solution to general insecurity. I suppose that it could be tested for application to this context.

A solution to lower the insecurity of the ego is modesty.

Usually, the “I” does not like being taken down to a lower position. The “I” greatly protests.

It is possible to lower the ego to a humble position by appreciating the richness of oneself. It ultimately comes to realizing that insecurity and impermanence are inseparable and inescapable.

Each person agrees to take the risk. The manager should not have to carry the risk of everyone for the future, in all possibilities and contingencies. Otherwise, one would tend to freeze and paralyze. Or lose identity. Since the possibility of leading to substantial change and improving public policy, reducing unemployment, erradicating the financial European crisis has not been carried by the manager, could this be extended to the risk of the data?

If it was possible, I’d invite you over my mind for a while, so that you could see how it feels on the other side of the fence, in the happiness stream.

Resilience

Actually I have a hard time thinking what to answer here … this probably cannot be settled in writing, but we might perhaps find some time to discuss it at the conference … .

Just so much: I’m not trying to “secure” the future, but to design it. And in any design, there are some guidelines that proved to work well in earlier projects and in history. The following one affects my view on data: to not give central instances and organizations any large amount of control over you, because there’s always the chance that the wrong people get into power, freak out and totally misuse their powers.

Perhaps I’m more aware of this “design rule” than others (and perhaps a bit too much), as I’m from a country that was Nazi Germany just some decades ago, and there was still a vivid discussion about that time while I was in school … . Anyway, it seems fair to say that central authorities should not get any more power (by data like here, or by tech) than absolutely necessary for their legitimate tasks. Just look at INDECT as a counterexample: a European “homeland security” project that, if fully developed and should the wrong people rise into power again, would make a good best population suppression system. Such tech, or anything similar with the abuse potential of a nuclear bomb, must not even be developed, ever!!

On a meta level, maybe it boils down to this: You hope and expect that people will be overall illuminated, good, kind, well-meaning in the future. I hope and maybe expect that too, but prepare for the opposite as a means of resilience. Being a resilient society is: protected against changes to the worse, resilient when undergoing bad times, always working towards changes for the better …

P.S.: How does it feel to live in a country that has like 1/70th the population density as mine? Does one feel there’s a freedom so vast that it just cannot go away? :wink: Maybe that question sounds silly, but maybe you have experiences that relate to this, or a comparison with a longer time of living in a densely populated area …

Let’s do it

Ok, neodynos (one less comment to anonymize, and I am not joking), excellent piece of community awareness here. Here is a proposal.

  1. Let's start with what I am not prepared to do. I am not prepared to edit manually hundreds of mission reports and thousands of comments. This would kill the whole idea of doing cheap, effective open science by bringing in the smart crowds. So, that's no go.
  2. I am also not prepared to enforce full anonimity on anyone. Some people here want to be recognized; in fact, I expect this will be the case for most of them. An acceptable comparison is this: you are asked to report on some issue you are thought to be competent on to the city council of your town, or your parliament. You might decline or accept, but I have a hard time seeing anyone accepting conditionally on anonimity. Either you want to collaborate with institutions (and then you are going to wear your best suit and show up with pride), or you don't (and then you don't show up at all). This is my point about the think tank metaphor and the very explicit warning on the legals page. 
  3. The anonimity I had already decided upon works as follows: we take the content and copy it on an online archive, overwriting all usernames with gibberish. We do this via some script that goes through Drupal views, not simply opening the access to the server, log files and all. This also takes care of the IP issue. The server can in principle be breached, but it has a normal (commercial) level of security, and we both know nobody cares about hacking us. And anyway the server is rented space, and our rental agreement ends at some point (I believe in January 2013, not in 20 years)! Anyway, given the above, this anonimization is really an extra precaution, just to err on the side of safety.
  4. We can also have an additional layer, which is just what you say: a period in which people can themselves look back on their material (easy to trace via dashboard). You and I can put together a page with the rationale and the instructions, then each person can decide. Let's say we will stop collecting stories at the end of June; generate some stuff for the in-house research straight away; but then do the dump only, say, in early October. Meanwhile, we could set permissions so that people are not permitted to create new content (for research uniformity, because any content created after that date would not be included in the in-house research) but can still edit and delete their own content. How does that sound?
I think this is a valuable approach, and would like to carry it to my next projects. Not because I think many people will take advantage of these extra safeguards, but because the fact that they exist and could be used creates trust. @neodynos, can you volunteer as privacy adviser? Edgeryders who have doubts or questions about privacy and anonymization could ask you, make up their minds and then act.

That’s a way to go

Alberto, your proposal seems good and balanced to me. About being a privacy adviser volunteer: yes why not, count me in. I’m not an expert on all this, but I have a basic understanding what could be done with data … . When you need some input for the page on the anonymization process or the like, just drop me a line.

One addition to the proposal. I’d still propose to add to the “Legals” page more about the research methods that will be used, and that the data will be provided as package for download, and about what data is collected by means of ethnographic annotations and how the anonymizing process will work (according to your above proposal).

And three notes. To clean out remaining misunderstandings:

  • I did not expect that you really want to open access to the server and its logs etc.. That was merely an illustrative general example (not referring to Edgeryders at all) on how de-anonymizing data is prohibited by existing privacy laws, and how "novel use" and correlation of data can lead to unimagined problems ...
  • About your point 2, mentioning "Some people here want to be recognized; in fact, I expect this will be the case for most of them." and the example of collaborating with institutions by reporting to a council or parliament. Let me make the content/data difference still clearer, as my point revolves around that only. Yes, most people, including me, want to be recognized when it is about their content, that is, human perception and processing of their ideas, one idea at a time. Given that people would understand the difference, I doubt that this extends to a desire to be recognized in data, where it's about computational processing of structured information, correlations, network analysis, and not-yet-imagined novel uses (like personality profiling) by anyone who downloads the data package. I for one am not open to be identifiable in data, just like I'm not open to undergo personality profiling or the like before appearing at a parliament ... . In my view, you just don't know what will become out of a data package in the wild, so we better be careful with that ... like you said, "erring on the side of safety". (I know there are these post-privacy people who do away with the concept of privacy altogether in the Internet age, but Edgeryders is not made of these only so it should provide anonymity if so desired.)
  • The notion of "20 years" was a wild guess for how long the Edgeryders data package might be around. Because it's open content, it can be provided by other people for download, even after the Edgeryders site is down. And nobody can do anything to stop that, should it turn out afterwards that leaving people identifiable in the data was not a good idea after all ...
Proposal for future projects. In case you're going to manage more projects like Edgeryders in the future, maybe it's advisable to separate the platform right form the start into a "story collection" and a "socializing" part. The story collection would focus on collecting short accounts of peoples experiences in different areas, maybe even with a max. word count. And it would be stated right at the starte that these stories will be used and published as structured data which can be processed and mined automatically, so people would be advised to not mention any names or projects or organizations. And in the socializing area, there would be a rather open discussion space. It could be likewise open content but would not make it into structured data packages for download, so people would be free to give names etc. if they like. This area might even provide web-based query and evaluation and annotation tools to start evaluating the stories data right on the platform, with the help of the crowd. If that works out, it might help to drive costs down.
 

Done deal

Thanks for volunteering, splendid. I awarded you 500 reputation points an an action badge (shows on your profile) as a token of my gratitude. We’ll figure this out during the summer.

Scribble over usernames?!?!?!?!

To all intents and purposes, once it’s published, it’s published. Google and the Three Letter Agencies spider the web reguarly, it’s all in the data vaults, and everybody knows that when you put something online using a platform like EdgeRyders or even Facebook it’s there for keeps. Even if you delete it, it once existed, and that fact is recorded.

I’d just keep the site as it was on freeze day, personally, usernames and all except in cases where people have a real problem they need us to redact content to solve.

Public means public, and I don’t see the semantic distinction between text and (say) JSON representations of the same site, it’s just a bit of spidering to turn one into the other anyway.

Thoughts? Am I being cavelier here? Surely what’s done is done.

Not exactly

Though you make good points, clearly.

Privacy is not binary, but a continuous variable: it’s all about costs. If obtaining some data is very easy, they are less private than others which are still accessible, but more costly to dig out. We do a form of open science here, with downloadable data dumps and all, and neodynos has a point when he says this makes data mining ridiclously cheap. Since here we are not exactly doing things that are very high on the three-letter agencies priorities list, obtaining data from Egderyders is possible but the cost-benefit relationship is not there… unless we organize everything neatly and serve it on a silver plate.

au contraire

I don’t see the semantic distinction between text and (say) JSON representations of the same site, it’s just a bit of spidering to turn one into the other anyway.

It’s rather the distinction between text and (say) RDF triples or another semantic format, as we’re talking about manual additions to all the Edgeryders content using “ethnographic” semantic annotations. Spiders can’t do that transformation, as computers don’t understand text.

And this lack of understanding is also why Google search and Three Letter Agency’s keyword analysis tools on text are quite ridiculous compared to the computational semantics queries you can run on such semantic data (see for example the “Semantic Web” idea and its possibilities).

For the rest, Alberto hit it to the point: Manual text-to-data transformation normally is prohibitively expensive and that has protected enough of our privacy even though the text is public. So there’s totally no problem with leaving the Edgeryders site up and running as text, with all names in place. But once it got manually transformed to what computers can understand, we should think about privacy again …