Text from @melancon and @jason_vallet in quote mode, like this.
Comments from me in plain text mode, like this.
Let me go through the procedure and point at places where things still block (PB).
We went over the import procedure from Discourse. The import procedure is needed to build a neo4j database containing users, posts and ethno tags from which the different views are fed.
The procedure starts from a tag, allowing to grab topics attached to this tag. Topics contain a stream of posts incrementally numbered as they are published. Posts also have a unique identifier number. Fields in the post allow to detect whether the post initiated the topic or whether it is a reply to the initial post or to another post (previously called comment).
PB
Posts can be withdrawn (by the author). But the reply_to field of incoming posts (replies) is not updated (ex. 37600 post #10 in topic 7263, leaving 37601 and others with no posts to point at). This breaks the reply_to structure from which the social graph is built.
This was also present in Drupal, and is fundamental. Users need to be free to delete their content. By convention, if a post points to a deleted post, we make it point to the post number 1 in the same topic.
Annotations are somewhat a bit more tricky.
Discourse allows to grab annotations attached to a topic. But then launching multiple queries raises a limit exceeded exception. We are thus left with the necessity to grab all annotations (even those not relevant to the starting tag). This is not a problem for now as there are not too many tags. It is just not a good and sustainable strategy (if ER is to multiply the number of projects as it certainly will).
@matthias, does this make sense to you?
PB
Many annotations have no associated ethno tag. That is, they have their tag_id set to null. We have not exhaustively look at all the data — there are many hundreds of cases — but it seems that annotations consisting of a single word may correspond to ethno tags themselves. But in that case, the presumed ethno tag could be attached to the whole post (?), as opposed to a part of the post content.
You may have an explanation for this.
I’m getting confused. I use “code” to refer to ethno codes, and “tag” for Discourse tags. I suppose you mean ethno codes by “ethno tags” in the above. If so, it looks like a pretty sloppy way to code, because there is no attempt to identify when informants are using different vocabularies to convey the same meaning. But then I am no ethnographerm and may be wrong. @amelia, @Digitalanthropology, what’s your call?
This may have some impact on the resulting tag co-occurrence graph (if we are unlucky, tags induced from annotations with tag_id = null may well be isolated ethno tags, thus contributing no edge in the co-occurrence graph).
That would mean either that the underlying ontology is not well-maintained or that there really are disjoint codes. In the former case, it’s back to a second or third pass of coding. In the latter case, we have to live with it.