Data management plan: what data are we producing?

alberto · March 22, 2016, 4:22pm

We are required to complete a data management plan (DMP) for OpenCare. It should be ready within six months of starting the project, and kept up-to-date. The trick is this: each dataset we produce (and use for our research) needs to be described in the DMP. So, what data are we producing?

My ideas are clear about the WP1 (Online conversation) and WP4 (Data processing for collective intelligence). These are data that start out as an online conversation, encoded in the Edgeryders database, and then get enriched by data processing. I have made a first pass – still not publishable I guess, so I am keeping it on Google Docs for now. Here it is.

What about data in WP2 (Prototyping), WP3 (Policy) and Task 5.2 (Ethics/Consent) in WP5? What will they be like? Which formats? Which metadata? Where do we archive them?

@Costantino, @zoescope, @Lakomaa, @markomanka: I guess you should each give this some thinking and address the issue in the DMP. The instructions are in the Google Doc, itself modelled on the European Commission’s template. At page 7 you find a template with the help text; in pages 1-6 I have attempted to document the threee datasets that we (Edgeryders and UBx) will produce.

lucechiodelliub · April 4, 2016, 9:30am

Resource for Open Data / Open Access issues

Hi,

I just found this website : http://policy.recodeproject.eu/

This website is part of a FP7 project promoting open access and open data. It gives recommendations and tools on these subjects.

It might be worth having a look while writing the data management plan for further ideas.

lucechiodelliub · May 13, 2016, 1:11pm

Any news on this topic?

ping - @Alberto, @melancon, @Costantino, @zoescope, @Lakomaa, @markomanka

Hello, have you had the time and/or the opportunity to discuss and update the data management plan so far? As Alberto mentioned, we have to submit another version to the EC in June.

You will find the H2020 guidelines for the DMP here (just in case)

alberto · May 13, 2016, 2:36pm

On my side

https://edgeryders.eu/en/opencare-research/data-cleanup-done-ready-for-export

API ready, now we know what the data look like and can quickly decide how to expose them. It could be as easy as a JSON dump.

We could also open up a variant display (with no usernames) and leave the JSON API online. We need to be aware that the names (or, more correctly, the Edgeryders handles) of the users can be scraped from the website. For us (Edgeryders), this is acceptable as a privacy level. What do you say, @markomanka?

markomanka · May 14, 2016, 11:20am

So far…

Through the funnel a user becomes aware of the fact that the posts on this platform are fully browsable (and downloadable), and accepts this policy.

The metadata you are making available via the JSONs that would not be otherwise available to a non logged sniffer are quite limited (he/she could check time stamps, users’ handles and pictures, etc just by downloading the pages)… and mostly if not at all limited to interpreted information from the ethnographer, in the format of metadata… These are also under the consent, anonymous (except of rht risk of remapping, which is common to all anonymized data), and are anyway reflecting the researchers’ interpretation of content, not necessarily a quality of the users…

Unless I am missing something @Alberto , I would say we are ok.

alberto · May 24, 2016, 10:50am

As far as ER is concerned

Ok, so, Marco authorizes us to publish. I agree with him.

we publish via the Edgeryders API (details)
I gathered some good practices on the publishing of metadata. Most open data programmers like Mashape. This, for me, is an excellent example of documenting APIs with Mashape. Resources (for us, only three) are on the side bar; programming languages are on the horizontal navbar. The "request example" box contains a snippet of code to access the API in that language. The "response body" box contains a model of the JSON, that respects the structure of the actual response.
My only concern is that the API documentation would line on Mashape, not on Edgeryders. @Matthias, any thoughts? Does this make sense?
@melancon, are you OK with UBx leading on producing API documentation?

melancon · May 24, 2016, 1:17pm

API doc – only doc ?

Just to make sure I get everything right. Im’ including @bpinaud in this.

We make our data available through a website(page) built on top of Mashape. The page allows one to select different predefined views on the data (sidebar). The horizontal navbar then specifies the appropriate code to get the data from the view, in a variety of languages (and output format? I only saw json).

What needs to be done:

define these views (build them on the Drupal side)
write a short documentation for each of these views (as in the example)
- one sentence description
- request example, response header, response body

Please confirm.

Also, please specify the form this doc must take. Do we simply forward a text description for all this with someone else feeding the Mashape page with it? Or do we need to get our hands dirty with Mashape (takes more time)?

We are at redefining the views to use. For now, we are using custom, ad hoc, dirty hacked views. We need to stabilize fields names across the views, and actually build a data model underlying all of this. We also are building a graph database on top of which we plan to run EdgeSense2 (let’s call it this way).

Incidentally, I thought I should write a post reporting on the work we are doing. A kind of “What’s up” post that we should turn into a series, each partner taking its turn at saying more about what they are doing (an episode a wekk, that’s once evry 6 weeks per partner).

alberto · May 25, 2016, 10:09am

What’s the point of that?

@melancon, I do not see any point in you emailing me a .doc file which I then upload onto Mashape! That just introduces an extra, unncessary step. I think it works like this. First, we define a data management workflow for edgeryders.eu data. We are quite close already: we know about data, we need to figure out where to put the metadata. We have three options:

In-house (static page in the OpenCare minisite on edgeryders.eu). The advantage is that maintaining the same infrastructure takes care of both data and metadata. The disadvantages are aesthetics and findability. There is a tool for publishing beautiful API documentation onto your own serve and Github. It's called Slate: https://github.com/tripit/slate/wiki
A commercial service like Mashape. Advantages: looks nice, probably faster to put together, more findable. Disadvantages: one more thing to keep an eye on; I don't truse commercial services to be online in the long run.
The EC-sponsored service. The Open Research Data Pilot is (loosely) coupled with something called OpenAire, the EC's open access infrastructure. For data, OpenAire seems to have spawned Zenodo (run by CERN). Advantages: long-term reliability, findability, better integration with Horizon 2020. Disadvantages: is seems to be mostly a repository for dumping static data onto, I cannot find a tool or guidelines to document APIs of data kept elsewhere. I have written to the Zenodo team, let's see what they say.

Once we have made this decision, I guess it would be UBx to document the API and put it wherever we want them, and do so directly, without going through me.

We could also go for redundancy: a nice lokking web page on Mashape (or Zenodo), but also a simple page on Edgeryders with a button to download the API doc.

Please remember: this is NOT all of OpenCare’s data management strategy, We still need to figure out how and where to document the output from the workshops in WeMake. That, of course, is for @Costantino to decide. Costa, you will need to make that decision soonish, so we can deliver the data management plan.

Also ping @LuceChiodelliUB

alberto · May 25, 2016, 10:11am

Zenodo fail

“Thank you for your message. Note, due to overwhelming interest, we are currently also refurbishing our infrastructure. As a result the response times to support requests are longer than usual, so please bear with us until the new infrastructure is scheduled to be put in place. Thank you for your understanding and we hope you’ll find the Zenodo improvements well worth waiting for!”

Bah

@markomanka, do you know these guys at CERN?

lucechiodelliub · May 25, 2016, 12:11pm

OpenAire

Completing Alberto’s description: OpenAire was first funded by the EC as a FP7 project, and the EC recommends ICT project members to use it. The OpenAire team also provides advice for data management (if needed).

If you want to know more about OpenAire:

An advantage of OpenAire (minimal, but still) : When reporting to the EC, publications accessible via OpenAire will be displayed automatically in our reports. We will only need to double check that the publications enlisted are linked to the project (see Periodic report template and H2020 guidelines for data management)

melancon · May 25, 2016, 2:43pm

OpenAire is Zenodo

I went to read more about OpenAire and what I see is that OpenAire is or uses Zenodo. Also, they seem to rely on a hierarchical scheme with repository located (and managed?) by local instututions, just as if the whole OpenAire is more about agreeing on a protocol that makes all these interoperable and transparent to use as if they were runnign a unique global service.

So, either we wait for these guys to “refurbish”, or we trade our soul to commercial products.

alberto · May 25, 2016, 3:27pm

The other way around

OpenAire is mostly for papers. When people pointed out nowadays researchers publish datasets as well, it spawned Zenodo. So, I guess Zenodo is the data part of the OpenAire constellation. I would prefer Zenodo to commercial stuff, but I don’t like it that there is no due date for the “refurbished” Zenodo. Can we go back to the EC and tell them the data management plan is on standby until Zenodo comes online?

From the Spaghetti Open Data mailing list:

“The nice thing about Mashape is its community. They test your APIs and give you feedback (in our case, they even open issues when they find duplicate or discordant data), or request additional data.”

melancon · May 25, 2016, 5:02pm

Two things

First. “Can we go back to the EC and tell them the data management plan is on standby until Zenodo comes online?”

Sounds reasonable. Possibly, but we have other issues we need to “negociate” in priority to this one … See second observation before doing any hard thinking on this part of my comment.

Second. “They test your APIs and give you feedback (in our case, they even open issues [etc.]” This is great, if it indeed happens. Not only will we be complying with a directive, we will make usr eour stuff is indeed usable.

Given the context, I would push for solution no 2.

P.S. Just poking my colleague @bpinaud to make sure he follows the story.

alberto · May 26, 2016, 8:55am

Well, no guarantee of that

The guy who raves about Mashape, Paolo, may just be a lucky case. Communities, you know. There’s no telling what will turn them on. His project is on sport data, so sexier than ours

Anyway, the idea is: we follow some wizard or other, produce documentation and go redundant. At the end of the day, we can do it on Mashape, then copy-paste the example query etc. anywhere. Mashape just gives you the color coding (and the community, if it comes).

To summarize:

If we use Mashape, I recommend we also use something else as a backup plan (even a README file will do).
If we use Zenodo, no need for redundancy.

Works?

melancon · May 26, 2016, 7:01pm

Mashape + redundancy then

@Alberto and @bpinaud

Ok, let’s head for Mashape. Would redundancy based on a github wiki be fine ?

bpinaud · May 26, 2016, 8:14am

reading…

I am following, …

alberto · June 3, 2016, 11:24am

Update from Zenodo and final proposal

Zenodo support got back in touch. They say:

Zenodo is an archiving service. You upload one or more files, describe it with a bit of metadata and publish it. After that, we give you DOI so you can cite the data. Zenodo is a CERN service, so your data is stored in our data centres that we also use for the Large Hadron Collider, and CERN guarantee the data’s long term persistence.

It sounds like you have your own “data repository” with live data that you can query and export. The way Zenodo is usually integrated in such as system is by taking snapshots of a dataset and archiving that snapshot in Zenodo (as you can change the files in Zenodo after you have gotten the DOI). This allows you to focus on creating a great service for your data, but not care too much about long-term preservation.

As for API documentation tools we haven’t found sounds that satisfies our need fully. Usually, auto-generated documentation from the code is not very developer friendly but can be display without extra servers, where as developer friendly documentation is usually best hosted by external tools.

So, Zenodo does not really work for “evolving” datasets. It can be a place to store backups. But we have daily backups on Edgeryders, so I fail to see the point.

I propose we host data in the Edgeryders server, accessible via APIs. Documentation is on Mashape, redundancy on Edgeryders itself and possibily on the Github repo for Edgesense 2. Works?