Update from 32C3

alberto · December 29, 2015, 2:50pm

OpenCare will officially start in three days.

A few of us are in 32C3 (@Lakomaa, @zoescope, @Costantino, @Nadia and myself). We are doing good work.

Nadia onboarded several people, including Marie Moe – a computer security expert who wears a pacemaker and has discovered it's buggy, hackable and untransparent since it runs on proprietary software – https://events.ccc.de/congress/2015/Fahrplan/events/7273.html
@msanti, @mstn, @maxlath and I did some work on the "Visualizing self-diagnosis" project (GitHub – at the time of writing most stuff is on the wiki, because almost all we do is struggle with Wikipedia and Wikidata's data models).
We talked to Pirate Party MEP Julia Reda, and decided to stay in touch with a view to giving the European Parliament some ammo to regulate care in a community-friendly direction. Julia has provisionally agreed to show up at LOTE5, and to have a look at points of entry we might use to put OpenCare's results at the disposal of policy makers. What it comes down to: if we do good work, we'll get impact.

Looking forward to this, friends. See you in 2016. No surrender.

melancon · January 4, 2016, 12:08pm

Wikipedia/media data models

@Alberto How are things going, is there a chance we could use this in Feb?

melancon · January 4, 2016, 12:21pm

Great work - Good job!

@Alberto I actually answer my own question … I went on the github repository and read the document you posted. I see the line of reasoning you have, the strategy you envisage. What I miss is the information you expect to get from analyzing the data. That would for instance help me figure out a possible viz. Digg indeed is a good idea. But that all depends on what you want to find, or what service you hope to deliver from digging into the data.

melancon · January 4, 2016, 4:10pm

Wikipedia/media domain coding scheme

@Alberto

I spent time looking at wikipedia pagecount dumps to see how easy/difficult it would be to build a DB to get on with this project – at least try things to see how feasible it is.

I spent time looking for the wikipedia domain and page coding scheme. Reading about how page counts are stored, I understood pages get coded into shorter sequences such as en + Main_Page to stand for Wikipedia, the free encyclopedia. But I didn’t find any place where I could access the whole map code <-> page.

Anyone has ever come across this piece of information?

danohu · January 4, 2016, 4:22pm

do you need it?

do you really need the whole map code <-> page?

For a first attempt, I’d assume starting with just english wikipedia – so you just filter the dump down to en, and interpret them as en.wikipedia.org/wiki/[pagename]

melancon · January 4, 2016, 4:39pm

Good point …

but … The thing is I’m am not interested in the whole wikipedia but only in the medicine page subset. Now, as far as I could see, I can put my hands on the set of all medicine pages by their names – but not their … code …

Since the page counts are given using a map code -> count, I do need the code for the medicine pages to filter out counts of the medicine pages.

?

danohu · January 4, 2016, 4:53pm

but that link lists the pages as, for instance, /wiki/1858_Bradford_sweets_poisoning

Once you chop off the /wiki/, you get the page code, and can look up:

en 1858_Bradford_sweets_poisoning 1 19153

melancon · January 4, 2016, 5:12pm

Is it that simple?

I owe you a big thank. I realize I was looking at project counts, which use codes that do not look like page name at all … Fail/Unfail is also about sometimes being a bit dumb – but hey, learning is about doing things you never did before

We should then be able to put up a process to test a few ideas. I’ll report on this thread when I have something new.

msanti · January 5, 2016, 9:31am

Working on it

I’ve also some notes about the data structure, that I’ve to refine and put on github. Hope to do this by tomorrow (taking advantage of holiday in Italy).

alberto · January 6, 2016, 11:03am

What is this for?

@melancon is a scientist. Rightly, he demands a well-defined question. Here’s my take.

We investigate the size and scope (both in physical space and disciplinary space) of the phenomenon of participatory diagnosis. Folk knowledge tells us use Wikipedia and other Internet resources to diagnose their own conditions, or cross-check a diagnosis made by a physician. The question is: how widespread is the phenomenon? By looking at pageview counts, we can already get unexpected results. I did NOT expect that 95% of all English-language pages in WikiProject: Medicine would be viewed in a randomly chosen hour. Nor did I expect that the top pages would draw 500 hits in an hour. As we refine the query with geographical and other information, we might learn more interesting stuff.

But also, this is a rhetorical move. It is meant to draw attention of policy makers and health professionals on collective intelligence. Wikipedia is built by a community, and used by a community. Collective intelligence is already an important player in care: so, people should pay attention to OpenCare.

alberto · January 7, 2016, 3:07pm

Javascript tool for querying WikiData

The generous @maxlath has just posted a tool that does exactly what we were trying to do in Hamburg: get WikiData entities by Wikipedia article title.

melancon · January 11, 2016, 9:16am

wikidata-sdk

I played with @maxlath code to see how we could use it to prepare for Mon4. The tool is great, thanks @maxlath. I am however unsure how I can use it in a straightforward manner and combine this with page counts, for instance.

That is, I foresee the advantage of working with wikidata entities to properly index page content, but having access to the wikidata entities does not solve our problem of counting page visits in an effort to understand how people use wikipedia pages for auto-diagnosis.

Maybe @maxlath, because he is the designer of the wikidata-sdk, can help us (or at least me)?

P.S. I also have a few questions on how urls returned by running the various search routines can be used. I have unsuccessfully tried to access those urls with my browsers and I sometimes get empty content.

maxlath · January 11, 2016, 10:34am

hello @melancon, I’m interested in knowing where you got blocked once you started using wikidata-sdk and also what to have examples of the urls returning empty results you where mentioning.

On the question of the interest of Wikidata to get page view statistics, two things make it particularly interesting:

Wikidata being structured data, it can be queried it in all sorts of way using the SPARQL endpoint (for which there is no helper in Wikidata-sdk yet): for instance, here is the list of all beers or subclass of beers in Wikidata (the result in JSON)
every Wikidata entities centralize links to Wikipedia pages in all languages (see "sitelinks" in the API results)
Wikidata ids are meant to be stable, while Wikipedia titles can changes, making the maintaining your project on the long run harder

melancon · January 11, 2016, 2:54pm

example puzzling result

Hi Max (I assume I can call you by that name …),

– @Alberto @MoE @dora might also be interested to join this discussion, let’s form a team heading to MoN4 (apart from @Alberto, I am unsure who will be there with us) –

I take “1% rule (aviation medicine)” as title of a medicine page (1% rule (aviation medicine) - Wikipedia, part of the page referencing all medicine pages https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Medicine/Lists_of_pages/Articles).

Using the code snippets you provide on github, I understand I get a URL pointing at the corresponding wikidata entity: MediaWiki API result - Wikidata

When going to this entity I get a quite disappointing result:

{

“servedby”: “mw1201”,

“error”: {

“code”: “unknown_format”,

“info”: “Unrecognized value for parameter ‘format’: 20”,

“*”: “See MediaWiki API help - Wikidata for API usage”

}

or using json format:

{

searchinfo:

{
- search: “1% rule (aviation medicine)”
},
search: [ ],
success: 1

}

It may well have to do with this special page, or the use of the % special character (?). Things work fine when I work out the same snippet with other pages such as the next one (‘1,1,1,2-Tetrafluoroethane’) for which I get:

{

searchinfo:

{
- search: “1,1,1,2-Tetrafluoroethane”
},
search:

[
- {
  - id: “Q423029”,
  - concepturi: “http://www.wikidata.org/entity/Q423029”,
  - url: “//www.wikidata.org/wiki/Q423029”,
  - title: “Q423029”,
  - pageid: 399611,
  - label: “1,1,1,2-tetrafluoroethane”,
  - description: “haloalkane refrigerant”,
  - match:
    
    {
    
    type: “label”,
    
    language: “en”,
    
    text: “1,1,1,2-tetrafluoroethane”
    
    }
  },
- {
  - id: “Q4545638”,
  - concepturi: “http://www.wikidata.org/entity/Q4545638”,
  - url: “//www.wikidata.org/wiki/Q4545638”,
  - title: “Q4545638”,
  - pageid: 4337472,
  - label: “1,1,1,2-tetrafluoroethane”,
  - match:
    
    {
    
    type: “label”,
    
    language: “en”,
    
    text: “1,1,1,2-tetrafluoroethane”
    
    }
  }
],
success: 1

}

I still have to learn how to properly use javascript and jquery so I can have everything work into a single script, and hopefully output useful visualization of page count timelines, for instance. I am also thinking about a rougher solution, grabbing pageviews data over longer time periods, and then computing similarity measures between pages based on these time evolving page views.

As far as MoN4 is concerned, what counts is we are in a position to easily manipulate tis type of data so we can react to questions people have, feed discussions with facts extracted from data analysis, build visuals supporting hypothesis building, etc.

moe · January 13, 2016, 12:40am

interested, joining soon

hi @melancon, nice to meet you

I’m definitely interested in joining the discussion.

I’ve been following this and several other posts and tweets in the last few weeks; enough to get an idea of what is being discussed and considered. We’ve been discussing this ourselves a bit, with @dora.

At the same time, I admit I didn’t have time to give it more attention, nor to start testing or researching myself on anything, just yet. I’ve been a bit swamped but I’m coming out of it now, hopefully

Anyways just consider me in, reading and paying attention

alberto · January 11, 2016, 10:17am

WIkidata and page counts

@melancon, I was thinking to combine WikiData entities and page count data as follows.

Start from the list of English-language Wikipedia page titles of WikiProject: Medicine.
for each page on that list, use @maxlath's tool to get the WikiData item from the title. For example, cholera: https://www.wikidata.org/wiki/Q12090
use the item to get the titles of all the Wikipedia pages (in all languages) that are about that item. The item corresponding to cholera (12090), for example, refers to 112 pages (in: Acèn, Afrikaans, Alemannisch, Aragonès, Arabic...)
For each page associated to the WikiData item, read the number of pageviews in the hourly pageviews stats data dumps (organized by page title and not by WikiData item).

Notice that the test we ran at 32C3 (results) did not do this. We estimated 184,000 views to medicine-related articles in the sample hour, but that referred almost only to English-language Wikipedia. The one exception are names, for example the page “Louis Pasteur” has the same title in the Wikipedias of all languages, or at least all that use the Latin alphabet.

Also notice that @MoE and @dora would also like to participate in MoN.

maxlath · January 12, 2016, 1:39pm

a lib and API that could be of interest for your project:

pageviews.js : “JavaScript client library for the Wikimedia Pageview API for Wikipedia and its sister projects.”

melancon · January 12, 2016, 2:25pm

GitHub - tomayac/pageviews.js: A lightweight JavaScript client library for the Wikimedia Pageviews API for Wikipedia and various of its sister projects for Node.js and the browser.

Yep, I had located it and I plan to use it as well.

I am new to javascript and I am learning as fast as I can to use jquery and ajax. I am trying to put up a small and malleable piece of code I could reuse and modify in the context of Mon4. One issue I cam across is the impossiblity to access out-of-domain content from a client. So I guess I have to design things splitted between server and client side.

Your wikidata-sdk and pageviews code could be run on the client, while the server could grasp things such as the list of all pages titles, for instance.

Any guidance on how to proceed is welcome. I understand you won’t be with us at MoN4 and LOTE5? That’s a pity

Best, Guy

melancon · January 14, 2016, 11:53am

wikidata entity id -> wikipedia page entry ???

– Might also be of interest to @dora and @MoE (and of course @alberto) –

I’ve done quite few experiments with @maxlath’s code which got me comfortable with wikidata’s api (and node.js and javascript in general). I do not see how I can easily manage to put my hands on a (the?) wikipedia page associated to a wikidata entry (in a given language) – I actually do not see why there would be a one-to-one correspondance. One obvious, brute force, way to go would be to grab all html links listed on the right panel of a wikidata entry, but I am looking for a more elegant and efficient of doing things.

Also, it seems the pageviews.js package only offers daily page count

Any help is welcome.

maxlath · January 14, 2016, 12:36pm

to get Wikipedia urls from a Wikidata entity, you got to look in the sitelinks section of your entites. To do this from the API, make sure that you do query sitelinks, either by having “props=sitelinks” in your query (or props=sitelinks|claims|info|…) or no props parameter at all (then you get all properties).

So for instance, for Ebola (Q51993), you can query just the sitelinks like so: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q51993&format=json&props=sitelinks

and get

{

“entities”: {

“Q51993”: {

“type”: “item”,

“id”: “Q51993”,

“sitelinks”: {

…

“dewiki”: {

“site”: “dewiki”,

“title”: “Ebolafieber”,

“badges”: []

}

…

It’s then up to you to rebuild the Wikipedia full URL using those data: “https://#{2 letters lang code}.wikipedia.org/wiki/#{title}”