Visualizing our 7 years of online dialogue with Gource

matthias · November 25, 2019, 5:31pm

Inspired by @marina’s request for a temporal visualization of our forum’s development, @alberto and I developed a nice little idea today. We would like to create a small software to visualize the temporal development of a Discourse online forum in a video format very similar to this – just that this is about the development of the Discourse source code, while we want it for the development of the Discourse topics hosted in a Discourse forum instance:

The software to create this is Gource, an open source tool originally meant to visualize Git repositories. Fortunately it also can read a custom log format.

For Discourse, the custom log file would be a text file with each line containing the following components, delimited by “|”:

timestamp
username
type of action (Added, Modified, Deleted); the “M” is useful esp. for wikis
path to the affected resource. Usually this applies to files, but in our case it would be about posts. This is the part we’d have to adapt to Discourse. Our path format could be “category/subcategory/topic-id/post-id” or it could even include the full threading information with any depth of nesting, such as “category/subcategory/topic-id/parent-post-id/parent-post-id/post-id”
color for the affected resource (optional); this could be chosen according to the category color on edgeryders.eu

Example line:

1275543595|matthias|A|earthos/reef/1234/1

So the task is to convert the information from the Discourse database to the above log format. This can be done with any kind of scripting. Some example scripts are here and here – there is no framework to adhere to, and nothing that I found that directly works with Discourse already. So let’s contribute that and have a pretty cool new mode of visualizing our stuff

Of course, we can also create this kind of videos for parts of the Discourse conversation (say, one Discourse category or tag) by adding a command line option to that script. I propose that the script would directly connect to the Discourse database (with a readonly user account) rather than use an API.

Anybody who’d love to take this on? (If not, we might want to let Anu take a stab with a Python or Ruby script. She’s available to work for us again.)

alberto · November 25, 2019, 6:54pm

I would love to, it looks like a fun project for the Christmas holidays. However, in practice I could only do it via API + Python. AND I am a slow and messy programmer, so if you want to release something I am not sure you want to allow me to write any code!

felix.wolfsteller · November 25, 2019, 9:04pm

What exactly does that question mean? Is there a budget and a timeframe?
Because my answer to the subjunctively (can we adverbalize subjunctive?) put “would love to?”-question is yes, I would love to. But I doubt that it would be a wise decision to say that aloud without knowing the time frame, the budget (if any) and the expected quality/output.

matthias · November 25, 2019, 10:51pm

If we do it as a script, it’s a compact and very self-contained piece of code, and software quality does not matter much even if distributed. Instead of accessing PostgreSQL directly, you could take a database dump of only the relevant Discourse data, import it to SQLite, and then work with your Python script on that.

It really depends what we want to do with this, and how much it’s worth it to us. As a Discourse plugin with API (my latest idea, see below), it would be a different type of software and need more care. Maybe check the concept below and let me know your opinion.

Actually, I didn’t think too much about that question means before asking. Just to see if there’s an interest (there is!), and then we can discuss. Meanwhile I had some more ideas, which I put below. There is not really a defined timeframe for this, and surely nothing tight – up to 3 months would sound alright to me. As for the budget: I have to discuss it with Alberto. We might want to invest a moderate budget. If you want, you could send us your estimate for the first version’s functionality (see below), which would give an orientation for us to decide if we want the version I sketch out below, or rather a proof-of-concept Python script that Alberto proposes above.

Regarding the ideas: when Alberto said “if you want to release something” above, I thought that the right place for this code is in a Discourse plugin that provides a custom API endpoint, serving text files in the Gource custom log format.

First version: There would be a request parameter to only generate the log file for one single Discourse category. If not specifying that, it would be generated for the platform’s whole content. Only content creation would be considered, not edits to posts or wiki posts and also not deleting posts.

Second version: After the proof-of-concept first version, if we decide to go further with this, it would make sense to provide for extended request parameters:

list of Discourse categories (by name or ID) to filter which posts to include
flag to include or exclude the sub-categories of these categories
list of Discourse tags to filter which posts to include
parameter to configure the colors to use for actions

Also, this version should include a simple GUI in the Discourse admin backend that allows to configure these parameters and then to generate and download the corresponding Gource Custom Log file by clicking a button.

I don’t think there is a need to integrate Gource more closely. Downloading the file and then calling the Gource command line tool with that file on ones local computer and uploading the resulting video to YouTube is already quite comfortable. It is also fast, as the code has direct access to the Discourse database.

I think this tool would not be used very widely, but it would be a software that is interesting and creates beautiful outcomes, so it would certainly gets its share of attention.

Third version: Now we could add another way of visualization, namely for our Open Ethnographer data. People would appear to not collaborate around topics, but around ethnographic codes. Our ethnographic codes can be organized in a hierarchy, so it also lends itself to be visualized with Gource easily. Whenever somebody contributes something that is codes with a certain code, a little colored dot would be added to that code’s node in the Gource visualization.

alberto · November 26, 2019, 12:17am

Are we sure we would need the full path? Maybe one parent-post-id is enough, because, after all, each post is a reply to either zero or one post. If you know the ID of the parent-post, you can simply look up that ID to get the ID of the parent-post of the parent-post.

For example, as I write (without counting this post) in this topic there are four posts. 0 is the opening post of the topic 11905. 1 and 2 have the value null corresponding to the key reply_to; 3 has the value 2. The script would create the path starting from the right: in the rightmost place it writes the post_number of the post itself; then move one step left and look at reply_to. If it has value null, then write the topic’s number. If it has a value different from null, then write that, move left again and repeat until you find a parent post whose reply_to value is null

hugi · November 26, 2019, 12:26am

I have already built almost exactly this for a client once. Delivery also included a lot of other code that’s not relevant here, but it should be pretty trivial to pull out only the relevant code and refractor. I’ll try to have a look at it. If we’re lucky it would only take me an hour or so to get us to the first version.

matthias · November 26, 2019, 12:34am

But then you go on and explain how to create the full path. If we want to show the full nesting structure of a topic, the visualization does need the full path (supplied via the Custom Log file), as it does not do any calculations on its own. And the full path would be calculated exactly the way you describe it.

(One more little idea: Discourse connects posts not just via the reply-to information but also via quoting. We can’t visualize that directly with Gource, which only understands a hierarchical structure. But the act of creating a post that quotes other posts can be shown together with flashes between user and these other posts – usually indicating “modification”, but in our case also “quoting”.)

matthias · November 26, 2019, 12:35am

Perfect. Then let’s start from there and see if we like the visualization and can find uses for it in our presentation materials.

felix.wolfsteller · November 26, 2019, 7:57am

Post.all.map do |p|
  [p.created_at.to_i, p.user.username, 'A', p.url].join('|')
end

could be used for a very hacky start. (e.g. from RAILS_ENV=production bundle exec rails console). I guess you need to sort data.csv afterwards.

Using the discourse code base (and the Ruby on Railsish ORM stuff from ActiveRecord) will ultimately bring you faster to your goal, but compared to running against the PostgreSQL of course comes with a heavy performance penalty.

hugi · November 26, 2019, 8:02am

This is very true. Code that I have works with the API because it needed to be run by people who didn’t have and shouldn’t have access to root on the production server. But since we have that access and only need to run the script once, a short ruby script working with ActiveRecord will do everything we need in just a few lines of code.

felix.wolfsteller · November 26, 2019, 9:29am

Damn I couldnt resist: Ein bisschen Spaß muss sein ;-) - Öffentlich - meta-community . Thats a visualization of my discourse instance at meta-community.org basically created by the snippet above and then

sort data.csv | gource -s 0.10 -480x420 -o -  | ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -crf 15 -threads 0 -bf 0 gource-video.mp4

Afterwards, like in a complex design, the possibilities of improvement are endless and I will stop here. As mentioned if there is time and budget, i could clean up, parameterize etc. But its easy and fun enough that I think you guys will play around and just come up with something nice. If you like I can push the X lines and a README into a github repository somewhere.

What I learned: gource has actually an interactive UI (view controls), and is not only a video renderer. Not sure how far all that goes, but its kinda fun playing with it.

felix.wolfsteller · November 26, 2019, 10:28am

Also, some use cases might not need a separate “source” file. Gource itself has some limited filtering capabilities, which might be enough for some cases (like maybe filtering by subcategory), e.g…

  --file-filter REGEX      Ignore files matching this regex
  --file-show-filter REGEX Show only files matching this regex

(what is considered a file here is the path, like category/subcategory/topic-id/parent-post-id/parent-post-id/post-id)

Note that there were some releases lately that added options that might not yet be available in the version included in your distribution (e.g. Ubuntu 18.04 ships gource 0.47, gource 0.51 was released 5 days ago).

And unfortunately there are limited layouting options (understandable, as graph visualization and layout is hairy). But so much fun!

alberto · November 26, 2019, 10:23pm

Wow, @felix.wolfsteller, you absolutely rock. Well done, sir.

matthias · November 26, 2019, 11:29pm

Hehe, this idea seems to develop it’s own life already. Thanks Felix for moving it forward!

We just did our own experiments. So far we only rendered to live output only. The code I used in Ruby console, based on what @felix.wolfsteller provided us:

lines = Post.last(10000).map do 
  |p| [p.created_at.to_i, p.user.username, 'A', p.url].join('|')
end

f = File.new("data.csv", 'w')
lines.map {|l| f << "#{l}\n"}
f.close

With this approach, loading more than 10,000 records at once will use up a lot of CPU and memory. So we don’t.

And then in the shell:

sort data.csv | gource \
  --log-format custom 
  -s 0.10 \
  -480x420 \
  --hide usernames,filenames,dirnames \
  --seconds-per-day 1 \
  --hide-root -

And we got something like this:

EdgerydersLast3000Posts

felix.wolfsteller · November 26, 2019, 11:34pm

f = File.new("data.csv", 'w')

Post.last(10000).each do |p|
  f << [p.created_at.to_i, p.user.username, 'A', p.url].join('|')
  f << "\n"
end

f.close

will reduce the memory footprint.
Even better would be to use Post.all.find_each do |..., to instantiate the Post objects just when needed and to allow temporary p assignments to be garbage collected.

Or is gource itself also hitting its limit?

matthias · November 26, 2019, 11:45pm

No, just ActiveRecord used too much memory and I did not have the time to think about it Thanks!

alberto · May 29, 2020, 4:03pm

I have added to our Python library a function that generates Goource-digestible files based on a category. Downloadable here.

nadia · May 29, 2020, 4:17pm

this looks like an old school game tetris era style

matteo_uguzzoni · January 19, 2021, 8:46pm

Hello @alberto is the function still available somewhere? I’m in the process of reporting Trust in Play and I was interested on rendering the community with a short video

alberto · January 19, 2021, 9:15pm

Yes. Download the function script and the example config file, and put them in a folder together. You will need to enter the Trust in Play master url, your API key, and a folder where you want to put the output file, in the latter.

Then it gets a bit tricky, because Trust in Play has content scattered all over the categories. If you want the whole thing, I suggest you act like this: first assign to all topics a Discourse tag for the purpose, like video. Then run

>>> import sys
>>> sys.path.append('path-for-your-directory')
success = make_gource_file_from_tag('video')

The script generates a csv file that then you can digest with Gource. Download Gource and then configure it the way you want, the documentation is really well made.

Example: