Update from 32C3

mstn · January 14, 2016, 12:31pm

Quick update

Hi there, I have a bunch of code I wrote during C3 I should push on gh. I am a bit overwhelmed by other nasty things. I hope to find some time maybe during this weekend.

The general idea was the following:

Fetch medical terms (aka page titles in English) from the Wikipedia project on Medicine. Each term is stored in MongoDB in this form. wikidata field is taken from wikidata by title “1971 Iraq poison grain disaster” (step 2). locales field is fetched from wikidata using max’s tool ‘fetch by OpenData ID’ (step 3). Hours field is calculated using pagecounts (step 4).

{ “_id” : INTERNAL_ID, “day” : ISODate(“2015-11-30T23:00:00Z”), “wikidata” : { “pageid” : 4094969, “ns” : 0, “title” : “Q4284220”, “lastrevid” : 248156717, “modified” : “2015-09-06T00:45:47Z”, “type” : “item”, “id” : “Q4284220” }, “locales” : { “en” : { “title” : “1971 Iraq poison grain disaster”, “hours” : { “0” : 0, “1” : 0, “2” : 0, “3” : 0, “4” : 0, “5” : 0, “6” : 0, “7” : 0, “8” : 0, “9” : 0, “10” : 0, “11” : 0, “12” : 0, “13” : 0, “14” : 0, “15” : 0, “16” : 0, “17” : 0, “18” : 0, “19” : 0, “20” : 0, “21” : 0, “22” : 0, “23” : 0 } }, “es” : { “title” : “Desastre del grano envenenado de 1971 en Iraq” }, “ru” : { “title” : “Массовое отравление метилртутью в Ираке (1971)” } } }

For each term find OpenData ID from page title. I built the request by hand without max’s tool (but what I do is what the new functionality of max’s tool does).
For each term with a known OD Id fetch OpenData details. In this way we are able to find titles for the same page in other languages. Of course, some languages could be missing. For example, there is no Italian entry for “1971 Iraq poison grain disaster”.
Dump pagecount data and update corresponding entries in MongoDB with count information (only for medical terms).

All steps are implemented, but for (3) I am running out of memory and I need another smarter way to do it. I should split the pagecount file into several batches or preprocess terms we are interested in before.

The algorithm could be generalised in such a way that pagecounts and terms are updated automatically.

I use MongoDB to store (partial) results. Firstly, I need to store partial results somewhere because I can’t process the whole thing in a single step and MongoDB is better than files. Then I was going to build a webapp backed with MongoDB with a sort of simulator of Wikipedia hits.

I do not know if this could help for now. I will back soon with the code.

melancon · January 14, 2016, 1:14pm

we’re getting there, thanks !

@mstn @maxlath

Wow, thanks guys for coaching me and being so reactive and helpful.

Guy

melancon · January 15, 2016, 9:09am

So far so good with the wikipedia pageviews

Ok, just to report on what I have been able to assemble this week. Thanks to all, @Alberto, @maxlath, @mstn, @MoE, @dora for helping.

I am now able to process a query about, let’s say some keywords, find related wikidata entities (using maxlath’s package), then grab relavant pages in any given (or all) languages,

and then obtain pageview counts using tomayac’s package (aka Thomas Steiner). Thomas’ package however only offers daily counts – simply because wikipedia does not offer more through their API for the moment. So, as @mstn suggests, we may use a DB that would store these counts (and I understand there is one).

I plan to use the daily counts and the code chain so a user could query the counts concerning some disease, for instance. I could then use the count data and feed it back to the user through a d3 visualization.

I will certainly have time to wrap this up in the coming days (although I have other code to pamper with a deadline for next Wed, so it may only be ready by the end of Jan).

There are tons of other things we could do. It all depends on the task we are supporting. Building wikidata entities into a graph could be useful to guide user towards topics to its initial query, for instance. I’m sure this will be a subject of discussion at MoN4/LOTE5.

Best

alberto · January 15, 2016, 10:15am

You rock!

Well done @melancon!

Let’s all make sure the code and docs get pushed to GitHub. No urgency, obviously.

moe · January 17, 2016, 7:33pm

finally catching up

Hi everybody,

I finally took some time to start looking at wikipedia datamining, statviews, etc.

I’m still far from doing anything fancy, but at least I’m getting an idea of what can be done easily, and how to use the info collected.

I’m focusing on python only, for now, because I’m more familiar with it (@dora too), and because it could be an easy win, in case the data collected is to be managed via networkx and/or passed to EdgeSense.

@Alberto, @melancon, @mstn, @maxlath, @danohu, thank you all for the hints you shared. They all made my job easier.

so far I managed to fetch pages for a certain project (ie. Wikipedia:WikiProject Medicine), gather general info about the page, tell whether links to other pages point to the same project or not, get pageviews.

Though, I didn’t managed to map english pages to the relative translation in other languages (is wikidata a better interface for this?).

I’m also struggling to find an encoding/decoding logic for pagename strings, which is generalized enough to deal correctly with all the special characters (I’m still figuring out whether I can rely on pageid only).

I pushed some code here:

it’s just a bunch of sketches, but if you think they might be handy I can send a pull request.

Will keep you posted on any progress…

alberto · January 18, 2016, 10:33am

Python is good! (and you need to register)

@MoE and @dora, Python is good because @melancon likes to use Tulip for network analysis, and Tulip is Python.

Contributing coding skills to MoN5 earns you two LOTE5 tickets. Please register here.

moe · January 21, 2016, 8:54pm

swamped but still on it…

I’m a bit swamped now, as I have a delivery tomorrow, but will catch up this weekend.

I kept working a bit on our stuff with @dora, though, we now can get the different languages for each page too (not sure it is the same solution which @maxlath mentioned; we used the langlinks argument from MediaWiki’s API).

Also, we started implementing a MongoDB client, to try use solutions compatible with what was used already and to figure out what is the best way to store the mined data, according to how we’ll use it at MoN.

In the repo I linked above, I pushed a few more scripts and the first attempt to re-organize the code in the form of a tool to ease the data collection step. There’s still stuff to do, and much of the code is not polished, but better than nothing