Update from 32C3

melancon · January 4, 2016, 5:12pm

Is it that simple?

I owe you a big thank. I realize I was looking at project counts, which use codes that do not look like page name at all … Fail/Unfail is also about sometimes being a bit dumb – but hey, learning is about doing things you never did before

We should then be able to put up a process to test a few ideas. I’ll report on this thread when I have something new.

msanti · January 5, 2016, 9:31am

Working on it

I’ve also some notes about the data structure, that I’ve to refine and put on github. Hope to do this by tomorrow (taking advantage of holiday in Italy).

alberto · January 6, 2016, 11:03am

What is this for?

@melancon is a scientist. Rightly, he demands a well-defined question. Here’s my take.

We investigate the size and scope (both in physical space and disciplinary space) of the phenomenon of participatory diagnosis. Folk knowledge tells us use Wikipedia and other Internet resources to diagnose their own conditions, or cross-check a diagnosis made by a physician. The question is: how widespread is the phenomenon? By looking at pageview counts, we can already get unexpected results. I did NOT expect that 95% of all English-language pages in WikiProject: Medicine would be viewed in a randomly chosen hour. Nor did I expect that the top pages would draw 500 hits in an hour. As we refine the query with geographical and other information, we might learn more interesting stuff.

But also, this is a rhetorical move. It is meant to draw attention of policy makers and health professionals on collective intelligence. Wikipedia is built by a community, and used by a community. Collective intelligence is already an important player in care: so, people should pay attention to OpenCare.

alberto · January 7, 2016, 3:07pm

Javascript tool for querying WikiData

The generous @maxlath has just posted a tool that does exactly what we were trying to do in Hamburg: get WikiData entities by Wikipedia article title.

melancon · January 11, 2016, 9:16am

wikidata-sdk

I played with @maxlath code to see how we could use it to prepare for Mon4. The tool is great, thanks @maxlath. I am however unsure how I can use it in a straightforward manner and combine this with page counts, for instance.

That is, I foresee the advantage of working with wikidata entities to properly index page content, but having access to the wikidata entities does not solve our problem of counting page visits in an effort to understand how people use wikipedia pages for auto-diagnosis.

Maybe @maxlath, because he is the designer of the wikidata-sdk, can help us (or at least me)?

P.S. I also have a few questions on how urls returned by running the various search routines can be used. I have unsuccessfully tried to access those urls with my browsers and I sometimes get empty content.

maxlath · January 11, 2016, 10:34am

hello @melancon, I’m interested in knowing where you got blocked once you started using wikidata-sdk and also what to have examples of the urls returning empty results you where mentioning.

On the question of the interest of Wikidata to get page view statistics, two things make it particularly interesting:

Wikidata being structured data, it can be queried it in all sorts of way using the SPARQL endpoint (for which there is no helper in Wikidata-sdk yet): for instance, here is the list of all beers or subclass of beers in Wikidata (the result in JSON)
every Wikidata entities centralize links to Wikipedia pages in all languages (see "sitelinks" in the API results)
Wikidata ids are meant to be stable, while Wikipedia titles can changes, making the maintaining your project on the long run harder

melancon · January 11, 2016, 2:54pm

example puzzling result

Hi Max (I assume I can call you by that name …),

– @Alberto @MoE @dora might also be interested to join this discussion, let’s form a team heading to MoN4 (apart from @Alberto, I am unsure who will be there with us) –

I take “1% rule (aviation medicine)” as title of a medicine page (1% rule (aviation medicine) - Wikipedia, part of the page referencing all medicine pages https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Medicine/Lists_of_pages/Articles).

Using the code snippets you provide on github, I understand I get a URL pointing at the corresponding wikidata entity: MediaWiki API result - Wikidata

When going to this entity I get a quite disappointing result:

{

“servedby”: “mw1201”,

“error”: {

“code”: “unknown_format”,

“info”: “Unrecognized value for parameter ‘format’: 20”,

“*”: “See MediaWiki API help - Wikidata for API usage”

}

or using json format:

{

searchinfo:

{
- search: “1% rule (aviation medicine)”
},
search: [ ],
success: 1

}

It may well have to do with this special page, or the use of the % special character (?). Things work fine when I work out the same snippet with other pages such as the next one (‘1,1,1,2-Tetrafluoroethane’) for which I get:

{

searchinfo:

{
- search: “1,1,1,2-Tetrafluoroethane”
},
search:

[
- {
  - id: “Q423029”,
  - concepturi: “http://www.wikidata.org/entity/Q423029”,
  - url: “//www.wikidata.org/wiki/Q423029”,
  - title: “Q423029”,
  - pageid: 399611,
  - label: “1,1,1,2-tetrafluoroethane”,
  - description: “haloalkane refrigerant”,
  - match:
    
    {
    
    type: “label”,
    
    language: “en”,
    
    text: “1,1,1,2-tetrafluoroethane”
    
    }
  },
- {
  - id: “Q4545638”,
  - concepturi: “http://www.wikidata.org/entity/Q4545638”,
  - url: “//www.wikidata.org/wiki/Q4545638”,
  - title: “Q4545638”,
  - pageid: 4337472,
  - label: “1,1,1,2-tetrafluoroethane”,
  - match:
    
    {
    
    type: “label”,
    
    language: “en”,
    
    text: “1,1,1,2-tetrafluoroethane”
    
    }
  }
],
success: 1

}

I still have to learn how to properly use javascript and jquery so I can have everything work into a single script, and hopefully output useful visualization of page count timelines, for instance. I am also thinking about a rougher solution, grabbing pageviews data over longer time periods, and then computing similarity measures between pages based on these time evolving page views.

As far as MoN4 is concerned, what counts is we are in a position to easily manipulate tis type of data so we can react to questions people have, feed discussions with facts extracted from data analysis, build visuals supporting hypothesis building, etc.

moe · January 13, 2016, 12:40am

interested, joining soon

hi @melancon, nice to meet you

I’m definitely interested in joining the discussion.

I’ve been following this and several other posts and tweets in the last few weeks; enough to get an idea of what is being discussed and considered. We’ve been discussing this ourselves a bit, with @dora.

At the same time, I admit I didn’t have time to give it more attention, nor to start testing or researching myself on anything, just yet. I’ve been a bit swamped but I’m coming out of it now, hopefully

Anyways just consider me in, reading and paying attention

alberto · January 11, 2016, 10:17am

WIkidata and page counts

@melancon, I was thinking to combine WikiData entities and page count data as follows.

Start from the list of English-language Wikipedia page titles of WikiProject: Medicine.
for each page on that list, use @maxlath's tool to get the WikiData item from the title. For example, cholera: https://www.wikidata.org/wiki/Q12090
use the item to get the titles of all the Wikipedia pages (in all languages) that are about that item. The item corresponding to cholera (12090), for example, refers to 112 pages (in: Acèn, Afrikaans, Alemannisch, Aragonès, Arabic...)
For each page associated to the WikiData item, read the number of pageviews in the hourly pageviews stats data dumps (organized by page title and not by WikiData item).

Notice that the test we ran at 32C3 (results) did not do this. We estimated 184,000 views to medicine-related articles in the sample hour, but that referred almost only to English-language Wikipedia. The one exception are names, for example the page “Louis Pasteur” has the same title in the Wikipedias of all languages, or at least all that use the Latin alphabet.

Also notice that @MoE and @dora would also like to participate in MoN.

maxlath · January 12, 2016, 1:39pm

a lib and API that could be of interest for your project:

pageviews.js : “JavaScript client library for the Wikimedia Pageview API for Wikipedia and its sister projects.”

melancon · January 12, 2016, 2:25pm

GitHub - tomayac/pageviews.js: A lightweight JavaScript client library for the Wikimedia Pageviews API for Wikipedia and various of its sister projects for Node.js and the browser.

Yep, I had located it and I plan to use it as well.

I am new to javascript and I am learning as fast as I can to use jquery and ajax. I am trying to put up a small and malleable piece of code I could reuse and modify in the context of Mon4. One issue I cam across is the impossiblity to access out-of-domain content from a client. So I guess I have to design things splitted between server and client side.

Your wikidata-sdk and pageviews code could be run on the client, while the server could grasp things such as the list of all pages titles, for instance.

Any guidance on how to proceed is welcome. I understand you won’t be with us at MoN4 and LOTE5? That’s a pity

Best, Guy

melancon · January 14, 2016, 11:53am

wikidata entity id -> wikipedia page entry ???

– Might also be of interest to @dora and @MoE (and of course @alberto) –

I’ve done quite few experiments with @maxlath’s code which got me comfortable with wikidata’s api (and node.js and javascript in general). I do not see how I can easily manage to put my hands on a (the?) wikipedia page associated to a wikidata entry (in a given language) – I actually do not see why there would be a one-to-one correspondance. One obvious, brute force, way to go would be to grab all html links listed on the right panel of a wikidata entry, but I am looking for a more elegant and efficient of doing things.

Also, it seems the pageviews.js package only offers daily page count

Any help is welcome.

maxlath · January 14, 2016, 12:36pm

to get Wikipedia urls from a Wikidata entity, you got to look in the sitelinks section of your entites. To do this from the API, make sure that you do query sitelinks, either by having “props=sitelinks” in your query (or props=sitelinks|claims|info|…) or no props parameter at all (then you get all properties).

So for instance, for Ebola (Q51993), you can query just the sitelinks like so: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q51993&format=json&props=sitelinks

and get

{

“entities”: {

“Q51993”: {

“type”: “item”,

“id”: “Q51993”,

“sitelinks”: {

…

“dewiki”: {

“site”: “dewiki”,

“title”: “Ebolafieber”,

“badges”: []

}

…

It’s then up to you to rebuild the Wikipedia full URL using those data: “https://#{2 letters lang code}.wikipedia.org/wiki/#{title}”

mstn · January 14, 2016, 12:31pm

Quick update

Hi there, I have a bunch of code I wrote during C3 I should push on gh. I am a bit overwhelmed by other nasty things. I hope to find some time maybe during this weekend.

The general idea was the following:

Fetch medical terms (aka page titles in English) from the Wikipedia project on Medicine. Each term is stored in MongoDB in this form. wikidata field is taken from wikidata by title “1971 Iraq poison grain disaster” (step 2). locales field is fetched from wikidata using max’s tool ‘fetch by OpenData ID’ (step 3). Hours field is calculated using pagecounts (step 4).

{ “_id” : INTERNAL_ID, “day” : ISODate(“2015-11-30T23:00:00Z”), “wikidata” : { “pageid” : 4094969, “ns” : 0, “title” : “Q4284220”, “lastrevid” : 248156717, “modified” : “2015-09-06T00:45:47Z”, “type” : “item”, “id” : “Q4284220” }, “locales” : { “en” : { “title” : “1971 Iraq poison grain disaster”, “hours” : { “0” : 0, “1” : 0, “2” : 0, “3” : 0, “4” : 0, “5” : 0, “6” : 0, “7” : 0, “8” : 0, “9” : 0, “10” : 0, “11” : 0, “12” : 0, “13” : 0, “14” : 0, “15” : 0, “16” : 0, “17” : 0, “18” : 0, “19” : 0, “20” : 0, “21” : 0, “22” : 0, “23” : 0 } }, “es” : { “title” : “Desastre del grano envenenado de 1971 en Iraq” }, “ru” : { “title” : “Массовое отравление метилртутью в Ираке (1971)” } } }

For each term find OpenData ID from page title. I built the request by hand without max’s tool (but what I do is what the new functionality of max’s tool does).
For each term with a known OD Id fetch OpenData details. In this way we are able to find titles for the same page in other languages. Of course, some languages could be missing. For example, there is no Italian entry for “1971 Iraq poison grain disaster”.
Dump pagecount data and update corresponding entries in MongoDB with count information (only for medical terms).

All steps are implemented, but for (3) I am running out of memory and I need another smarter way to do it. I should split the pagecount file into several batches or preprocess terms we are interested in before.

The algorithm could be generalised in such a way that pagecounts and terms are updated automatically.

I use MongoDB to store (partial) results. Firstly, I need to store partial results somewhere because I can’t process the whole thing in a single step and MongoDB is better than files. Then I was going to build a webapp backed with MongoDB with a sort of simulator of Wikipedia hits.

I do not know if this could help for now. I will back soon with the code.

melancon · January 14, 2016, 1:14pm

we’re getting there, thanks !

@mstn @maxlath

Wow, thanks guys for coaching me and being so reactive and helpful.

Guy

melancon · January 15, 2016, 9:09am

So far so good with the wikipedia pageviews

Ok, just to report on what I have been able to assemble this week. Thanks to all, @Alberto, @maxlath, @mstn, @MoE, @dora for helping.

I am now able to process a query about, let’s say some keywords, find related wikidata entities (using maxlath’s package), then grab relavant pages in any given (or all) languages,

and then obtain pageview counts using tomayac’s package (aka Thomas Steiner). Thomas’ package however only offers daily counts – simply because wikipedia does not offer more through their API for the moment. So, as @mstn suggests, we may use a DB that would store these counts (and I understand there is one).

I plan to use the daily counts and the code chain so a user could query the counts concerning some disease, for instance. I could then use the count data and feed it back to the user through a d3 visualization.

I will certainly have time to wrap this up in the coming days (although I have other code to pamper with a deadline for next Wed, so it may only be ready by the end of Jan).

There are tons of other things we could do. It all depends on the task we are supporting. Building wikidata entities into a graph could be useful to guide user towards topics to its initial query, for instance. I’m sure this will be a subject of discussion at MoN4/LOTE5.

Best

alberto · January 15, 2016, 10:15am

You rock!

Well done @melancon!

Let’s all make sure the code and docs get pushed to GitHub. No urgency, obviously.

moe · January 17, 2016, 7:33pm

finally catching up

Hi everybody,

I finally took some time to start looking at wikipedia datamining, statviews, etc.

I’m still far from doing anything fancy, but at least I’m getting an idea of what can be done easily, and how to use the info collected.

I’m focusing on python only, for now, because I’m more familiar with it (@dora too), and because it could be an easy win, in case the data collected is to be managed via networkx and/or passed to EdgeSense.

@Alberto, @melancon, @mstn, @maxlath, @danohu, thank you all for the hints you shared. They all made my job easier.

so far I managed to fetch pages for a certain project (ie. Wikipedia:WikiProject Medicine), gather general info about the page, tell whether links to other pages point to the same project or not, get pageviews.

Though, I didn’t managed to map english pages to the relative translation in other languages (is wikidata a better interface for this?).

I’m also struggling to find an encoding/decoding logic for pagename strings, which is generalized enough to deal correctly with all the special characters (I’m still figuring out whether I can rely on pageid only).

I pushed some code here:

it’s just a bunch of sketches, but if you think they might be handy I can send a pull request.

Will keep you posted on any progress…

alberto · January 18, 2016, 10:33am

Python is good! (and you need to register)

@MoE and @dora, Python is good because @melancon likes to use Tulip for network analysis, and Tulip is Python.

Contributing coding skills to MoN5 earns you two LOTE5 tickets. Please register here.

moe · January 21, 2016, 8:54pm

swamped but still on it…

I’m a bit swamped now, as I have a delivery tomorrow, but will catch up this weekend.

I kept working a bit on our stuff with @dora, though, we now can get the different languages for each page too (not sure it is the same solution which @maxlath mentioned; we used the langlinks argument from MediaWiki’s API).

Also, we started implementing a MongoDB client, to try use solutions compatible with what was used already and to figure out what is the best way to store the mined data, according to how we’ll use it at MoN.

In the repo I linked above, I pushed a few more scripts and the first attempt to re-organize the code in the form of a tool to ease the data collection step. There’s still stuff to do, and much of the code is not polished, but better than nothing