I am still working in order to lay down useful statistics to compare filtering methods, in the meanwhile, you may want to visually inspect the data.
I would love to have feedback on the usable size of networks: unto how many nodes and edges are these graphs still readable and usable to investigate the data? These figures will help me figuring out how I should proceed to compute the statistics.
Browse the data using two side-by-side node-link views, going up and down two distinct hierarchies of subgraphs, for instance. We may even organize a remote video call to share our impressions.
Please take time to read the markdown README file (better viewed if you have a mathjaxj compatible markdown editor).
This is just a quick-and-dirty round of impressions. It is not made any easier bu the fact that I cannot very well do much without programming. But basically:
K-core reductions have a big problem: they tend to generate graphs that are highly connected, and become difficult to read, even for low numbers of nodes. I can see the codes, but visually not really keeping track of which code is connected to which. This is true of all types of K-core reduction.
Giatsidis has the opposite behaviour: graphs reduced to a small number of codes tend also to have few edges, and they break down into many connected components. A clear giant component is still visible at ~150 codes, but for smaller ones (~50 codes) the giant component breaks down completely. This is not necessarily a problem. Additionally, it tends to produce trees. This may be a problem, as clearly trees cannot be interpreted as hierarchies of codes, and then you do not really know what you are seeing. Triangles and clusters of triangles help see deep, multi-way relationships.
Clique is like Giatsidis, but even more extreme.
I have formed an opinion about the qualitative side of the different reduction methods (do they highlight different things?), but I will not share it here, as I do not want to influence Amelia unduly.
So, here is a suggestion for representing this result in one figure.
Consider a smaller domain (0-200 nodes, for example).
Put your curves onto one single graph
Use the same scale for both axis; if this is impractical, draw a reference 45° line, the locus where the number of nodes is equal to the number of edges.
This will show us how different techniques reduce the graph in the relevant domain. Your own work, @melancon, tells us that networks where number of edges > 4 x number of nodes become unreadable. Ideally, we still want a giant component to emerge, as that will represent the core of the study. This means that the optimal zone is somewhere between 1.1 and 4 edges per each node.
It would be simple to also make curves depicting the number of connected components vs. the number of nodes for each method. I am not sure what that would teach us, because I am an unsure as to how @amelia would interpret a graph divided into many components.
Final issue: I think @amelia needs a bit of help in arriving at graphs that she can attempt to interpret. At the very least, she needs encouragement to install Tulip, and a little help to start engaging with the files.
Not sure what you mean by “missing”. Are you saying the markdown document is not properly rendered? Or are you calling for more detailed explanations?
Thanks for the feedback – although I have replied on other issues, it is only now that I take time to read and process your comments.
I agree with you about how k-core acts, nodes with higher values indeed tend to form a very dense … core There might however be other ways to filter nodes by simultaneously skimming nodes with low and high values to get at the mid-range ones.
One thing though, just to make sure you don’t get confused. Giatsidis is a way to project the two-mode content-code graph onto a one-mode code graph. K-core is a statistics you compute on nodes of any graphs. The filtering method is yet another thing. I have used here a simple and standard approach, by discarding lower value nodes while letting the threshold go up and hit its maximum.
Projection: I propose to use two: standard and Giatsidis (Clique does not seem to make such a big difference after all).
K-core, this is an idea I thought worth exploring. We may well end up not using it but we at least need to explain
considered it before letting it aside (and why).
Filter: using a threshold seemed a most simple and natural thing to look at. When it works it is after the all the one admitting the simplest interpretation. I also plan to use Bobo Nick’s approach. But my code suffered from bad performance as it is now running in O(N^2). Coming son.
@amelia (and @alberto and/or @jason_vallet) let’s plan a remote hangout session. I’ll have time after I’m back from celebrating New Year’s eve with friends, starting Jan 3.
Thanks for the additional explanation, but I recall we had discussed the Giatsidis paper before. My points stand.
I hesitate to make statements like “it’s intuitive”, because everything is obvious, but I imagine that filtered Giatsidis-projected graphs break down because the projection downplays the role of long posts, who get lots of annotations and therefore induce large cliques. It can be good at singling out important dyads: if several people’s contribution result in the co-occurrence of the same two codes, and only them, then these dyads will indeed stand out.
Discussing offline with @jason_vallet I realize the data we have does not form a simple graph – which is due to potential multiple edges connecting a code to a piece of content. I understand this is not what we want, so I need to simplify (in the mathematical sense of making the graph a simple graph) before computing any projection. Stand by.