Understanding API calls for building SSNA

Text from @melancon and @jason_vallet in quote mode, like this.

Comments from me in plain text mode, like this.

Let me go through the procedure and point at places where things still block (PB).
We went over the import procedure from Discourse. The import procedure is needed to build a neo4j database containing users, posts and ethno tags from which the different views are fed.

The procedure starts from a tag, allowing to grab topics attached to this tag. Topics contain a stream of posts incrementally numbered as they are published. Posts also have a unique identifier number. Fields in the post allow to detect whether the post initiated the topic or whether it is a reply to the initial post or to another post (previously called comment).

PB
Posts can be withdrawn (by the author). But the reply_to field of incoming posts (replies) is not updated (ex. 37600 post #10 in topic 7263, leaving 37601 and others with no posts to point at). This breaks the reply_to structure from which the social graph is built.

This was also present in Drupal, and is fundamental. Users need to be free to delete their content. By convention, if a post points to a deleted post, we make it point to the post number 1 in the same topic.

Annotations are somewhat a bit more tricky.

Discourse allows to grab annotations attached to a topic. But then launching multiple queries raises a limit exceeded exception. We are thus left with the necessity to grab all annotations (even those not relevant to the starting tag). This is not a problem for now as there are not too many tags. It is just not a good and sustainable strategy (if ER is to multiply the number of projects as it certainly will).

@matthias, does this make sense to you?

PB
Many annotations have no associated ethno tag. That is, they have their tag_id set to null. We have not exhaustively look at all the data — there are many hundreds of cases — but it seems that annotations consisting of a single word may correspond to ethno tags themselves. But in that case, the presumed ethno tag could be attached to the whole post (?), as opposed to a part of the post content.

You may have an explanation for this.

I’m getting confused. I use “code” to refer to ethno codes, and “tag” for Discourse tags. I suppose you mean ethno codes by “ethno tags” in the above. If so, it looks like a pretty sloppy way to code, because there is no attempt to identify when informants are using different vocabularies to convey the same meaning. But then I am no ethnographerm and may be wrong. @amelia, @Digitalanthropology, what’s your call?

This may have some impact on the resulting tag co-occurrence graph (if we are unlucky, tags induced from annotations with tag_id = null may well be isolated ethno tags, thus contributing no edge in the co-occurrence graph).

That would mean either that the underlying ontology is not well-maintained or that there really are disjoint codes. In the former case, it’s back to a second or third pass of coding. In the latter case, we have to live with it.

Sorry for the confusion. It comes from the use of the field named tag_id in annotations. I wrote “ethno tag” to mean “code”. I imagined this points to a code, and not a tag (in the Discourse sense). So, rephrasing my comment:

“Many annotations have no associated ethno code. That is, they have their tag_id field set to null. We have not exhaustively look at all the data — there are many hundreds of cases — but it seems that annotations consisting of a single word may correspond to ethno codes themselves. But in that case, the presumed ethno code could be attached to the whole post (?), as opposed to a part of the post content.”

Yes, that was the old nomenclature in our annotations API, but it has since been renamed to code_id (see).

This sounds like running in the Discourse “max user API requests” limit, which are currently 20 per minute and 2880 per day. We can raise the daily limit if needed. For the limit per minute, we could raise it to 30, but rather let the script sleep a bit in between the calls to be careful with the server.

However @jason_vallet it seems that the real solution to reduce the number of API requests is to not get topics from a Discourse tag, then annotations for these Discourse topics. Instead, request the annotations pr Discourse tag, as well (like ethno-opencare). See our preliminary API docs.

This should not happen, and surely it is not possible to create such annotations within Discourse. So it is either an artefact of how coding happened in Drupal (perhaps indeed “this word is meant as a code”), or it is an error in our import routine. Please find out what is the case, as that will tell us how to correct it. Probably Anu will need to fix these manually one way or the other, and until then they should rather be ignored.

We need to get the topics anyway, because GraphRyder also needs their full content. With that said, I guess Matt’s method would still dramatically reduce the number of API calls to get the secondary data:

/administration/annotator/annotations.json?discourse_tag=ethno-opencare

gets all the annotations; and

/administration/annotator/codes.json

gets all the codes (platform-wide), and serves as a lookup table to get code names from code IDs. So, all the secondary data are pulled in two calls (OK, paginated).

Sorry team, I’m pretty confused by this conversation. Am not clear on these distinctions.

The thing I can perhaps clarify is that a while back we decided that since there is no mechanism to assign a code to an entire post rather than a specific piece of text, that I would instead just assign it to the first word of the post. Highlighting the entire post is better, but then makes it hard to code due to the scrollover function. So the code to one word could be explained by that. Otherwise I’d never just code a single word.

The only other thing that could be making things messy is that duplicates keep arising. I’ve no idea why two words that are exactly the same end up different on the database. I use the merge duplicates function every so often to fix this, but as I’ve raised before it is an issue.

What is making it seem like the underlying ontology is not well-maintained? Would like to fix anything that would suggest that I am a sloppy coder :stuck_out_tongue_winking_eye:

Also I never purposefully assign null codes so those can be deleted if present… again, unsure why those would arise.

That sounds like a bug hidden somewhere. Could you try to observe what is going on please, and then report the bug in our Github issue tracker? I think it could be something about spaces before or after the code name that lead to creating a new code on the fly rather than coding with an existing one.

Ok. So, it seems that:

  1. All annotations should have a tag_id (but it should be called code_id, @matthias) set to something other than null.

  2. Annotations whose snippet consists of the first word of the post are legit. They refer to the whole post.

  3. No other cases of one-word snippets are legit.

How to check for 1

  1. Count the cases with . Few cases can be glitches: hundreds of cases point to a probable error in the import script.

  2. Check the creation dates of the annotations. If many were created in 2016-2017, there is probably something wrong with the import script.

Checking for 2 is trivial, though probably tedious. I thought I could do these checks myself with 20 lines of code, but I get an annoying glitch: the annotations endpoint returns an object that looks JSON-like, but it is not a list, rather an “instancemethod”:

>>> import requests
>>> url = 'https://edgeryders.eu/administration/annotator/annotations.json?per_page=100000'
>>> response = requests.get(url).json()
>>> type(response)
<type 'instancemethod'>

With respect, @matthias, it actually is possible. I created a malformed annotation (snippet of one word, tag_id = null) on this topic. It has the annotation_id = 13130. Open Ethnographer allows saving the annotation without having to enter a code.

Some one-word annotations were created by students at Aalborg Uni. Several people (and even Anders!) started to code in that way. But I do not think they are hundreds, maybe they can be tens.

I have decided that it is a bug that Open Ethnographer allows saving annotations with no codes. Moreover, I do not like the idea that the quote (the snippet of text being highlighted when creating the annotation) becomes the code, as has been proposed for one-word quotes. Reason: we want to nudge ethnographers towards a well-maintained ontology, that makes it intuitive to discover the connections between what different informants are saying in different parts of a large conversation. If we allow “quote as code”, a hurried or lazy ethnographer might be tempted to just highlight the quote, and that would lead to more of a folksonomy than to a structured ontology.

I have created an issue on GitHub. I also would like to get rid of the confusion between “code” and “tag” once and for all (sorry everyone, I know I am being fastidious, but “tag” is a reserved word in Discourse, so we do need use it properly), so I also created a second issue to rename the tag_id field in the annotations API. Wave to @jason_vallet!

May I ask @jason_vallet to share with us his data (claiming “hundreds” of tag_id = null annotations). I find this whole situation quite tiring. I wish we would have been included in the migration process. I personally cannot test much. Copying @alberto 's URL for annotations, I get a file containing all 8644 annotations, of which 697 have a tag_id that points to null.

Let’s hope we can fix this before the end of the month …

A post was split to a new topic: Ethnographic coding practices and SSN structure: what we are learning

Ouf, this is good to hear (read), that it is neither an import or migration bug :slight_smile:

For reference, the Github issue that Alberto created to solve the null-code annotations (and prevent them in the future) is now fixed.

1 Like