Hashing the OpenCare dataset at sunset

As per our data management plan, we publish the raw OpenCare dataset on Zenodo. In the light of the GDPR, @markomanka and I agreed that it would be a good idea to pseudonymize the updated dataset before we publish it for the second and final time. Handles on Edgeryders are not “pseudonymous enough”. Question to @markomanka: are edgeryders user IDs good enough?

In practice, this will be a bunch of JSON files. @matthias, do you have suggestions?

1 Like

The numeric user IDs, as used on Discourse, are not suitable, as the mapping between usernames and IDs is publicly available in the Discourse API (without login – example).

But you can just use randomly generated strings or numbers as IDs.

Write a Python script? Or are there tools?

I’m not aware of specialized tools for this. The fastest way to get it done will depend on what process you use for export (you export directly from Discourse? or from a tool like Edgesense with its own secondary database?).

So if you have a script already that creates the JSON output, you could add a function there. If you use a standard tool to export from your database to JSON, it may be faster to create a field with a random ID to the user records there.

Good idea! Very economical.

@alberto The ideal would be to use non biunivocal mappings. If the user IDs can be mapped back, then this would not cut it.

I think @matthias’s idea solves it. Just add a couple of lines to the export script.

1 Like