Hashing the OpenCare dataset at sunset

alberto · November 24, 2017, 2:56pm

As per our data management plan, we publish the raw OpenCare dataset on Zenodo. In the light of the GDPR, @markomanka and I agreed that it would be a good idea to pseudonymize the updated dataset before we publish it for the second and final time. Handles on Edgeryders are not “pseudonymous enough”. Question to @markomanka: are edgeryders user IDs good enough?

In practice, this will be a bunch of JSON files. @matthias, do you have suggestions?

matthias · November 24, 2017, 5:42pm

The numeric user IDs, as used on Discourse, are not suitable, as the mapping between usernames and IDs is publicly available in the Discourse API (without login – example).

But you can just use randomly generated strings or numbers as IDs.

alberto · November 24, 2017, 6:02pm

Write a Python script? Or are there tools?

matthias · November 24, 2017, 6:29pm

I’m not aware of specialized tools for this. The fastest way to get it done will depend on what process you use for export (you export directly from Discourse? or from a tool like Edgesense with its own secondary database?).

So if you have a script already that creates the JSON output, you could add a function there. If you use a standard tool to export from your database to JSON, it may be faster to create a field with a random ID to the user records there.

alberto · November 24, 2017, 9:46pm

Good idea! Very economical.

markomanka · November 25, 2017, 12:28pm

@alberto The ideal would be to use non biunivocal mappings. If the user IDs can be mapped back, then this would not cut it.

alberto · November 25, 2017, 2:10pm

I think @matthias’s idea solves it. Just add a couple of lines to the export script.