Quick update
Hi there, I have a bunch of code I wrote during C3 I should push on gh. I am a bit overwhelmed by other nasty things. I hope to find some time maybe during this weekend.
The general idea was the following:
- Fetch medical terms (aka page titles in English) from the Wikipedia project on Medicine. Each term is stored in MongoDB in this form. wikidata field is taken from wikidata by title “1971 Iraq poison grain disaster” (step 2). locales field is fetched from wikidata using max’s tool ‘fetch by OpenData ID’ (step 3). Hours field is calculated using pagecounts (step 4).
{ “_id” : INTERNAL_ID, “day” : ISODate(“2015-11-30T23:00:00Z”), “wikidata” : { “pageid” : 4094969, “ns” : 0, “title” : “Q4284220”, “lastrevid” : 248156717, “modified” : “2015-09-06T00:45:47Z”, “type” : “item”, “id” : “Q4284220” }, “locales” : { “en” : { “title” : “1971 Iraq poison grain disaster”, “hours” : { “0” : 0, “1” : 0, “2” : 0, “3” : 0, “4” : 0, “5” : 0, “6” : 0, “7” : 0, “8” : 0, “9” : 0, “10” : 0, “11” : 0, “12” : 0, “13” : 0, “14” : 0, “15” : 0, “16” : 0, “17” : 0, “18” : 0, “19” : 0, “20” : 0, “21” : 0, “22” : 0, “23” : 0 } }, “es” : { “title” : “Desastre del grano envenenado de 1971 en Iraq” }, “ru” : { “title” : “Массовое отравление метилртутью в Ираке (1971)” } } }
-
For each term find OpenData ID from page title. I built the request by hand without max’s tool (but what I do is what the new functionality of max’s tool does).
-
For each term with a known OD Id fetch OpenData details. In this way we are able to find titles for the same page in other languages. Of course, some languages could be missing. For example, there is no Italian entry for “1971 Iraq poison grain disaster”.
-
Dump pagecount data and update corresponding entries in MongoDB with count information (only for medical terms).
All steps are implemented, but for (3) I am running out of memory and I need another smarter way to do it. I should split the pagecount file into several batches or preprocess terms we are interested in before.
The algorithm could be generalised in such a way that pagecounts and terms are updated automatically.
I use MongoDB to store (partial) results. Firstly, I need to store partial results somewhere because I can’t process the whole thing in a single step and MongoDB is better than files. Then I was going to build a webapp backed with MongoDB with a sort of simulator of Wikipedia hits.
I do not know if this could help for now. I will back soon with the code.