Graphryder 2.0 – Workplan

Graphryder is the technology that we rely on to be able to do “Semantic Social Network Analysis”. In short, the technique is:

  • We annotate the content on Discourse with ethnographic tags. This is done with a gem we include in our custom Discourse fork.
  • There are two concepts to understand, “annotation” and “code”. An annotation is an instance of using a code and refers to a specific location in a specific post of a specific topic.
  • We use a Python service called Graphryder API to build a graph of all users, topics, posts, codes, and annotations in Neo4j, and to then use that graph database to draw interpretations of that graph with a graph algorithm library called Tulip.
  • Tulip data from Graphryder API is accessed and displayed through Graphryder Dashboard, a client-side Javascript application. Graphryder API also reads data from the Neo4j database directly.

What we want

SSNA we can depend on

Our current stack is getting old, and we keep patching it, but some of the technology might break. We already have problems with new versions of Neo4j, and the old versions have other problems and are already end of life. Additionally, there are libraries and components in the old stack that are no longer maintained. We need to be able to rely on our SSNA stack.

Adding SSNA to a project should not incur extra IT costs

Currently, the installation of Graphryder can only handle looking at a single dataset. We currently use the methodology for four projects and need to have four instances of Graphryder API, four Neo4j databases, and four different deployments of Graphryder Dashboard. This is not sustainable, and to be able to offer SSNA projects at a reasonable price we need to be able to add projects without having to go through deployment.

Our goal should be that adding standard SSNA to a project should come at zero development cost apart from carrying its share of paying to the IT overhead fund, which should cover bug fixing and improvement across projects. Within a project, our research budget should be used for research.

Real-time updates

Currently, each Graphryder API instance needs to reload data manually from Discourse in a heavy operation, emptying and rebuilding the Neo4j database every time you want to load new data. We would like to have a graph view of what is happening in our Discourse. Which users are interacting a lot and are there clusters of people who interact with each other? Which ethnographic codes are coming up across projects?

An up-to-date graph can also help community managers and provide reflective feedback to users.

Quickly deploy new experimental views

We want to make it easier for the Edgeryders Research Network to experiment with new ways to visualize and analyze the SSNA and social data from the Edgeryders Discourse.

Graphs everywhere

We want to have components that allow us to embed graphs from our projects on our public-facing websites, showing interesting views of our research using up to date community data.

Challenges with the current setup

Neo4j is crippleware

In January I had a look at making Graphryder API multi-tenant, and came to the conclusion that Neo4j is the real culprit.

Neo4j is a powerful graph database, but the open-source Community Edition is only barely usable in production. Specifically, you need their enterprise edition to enable having more than one graph per install and to enable more than one user with different permissions.

Having something that we could simply tack on to any Discourse deployment as a plugin would be a lot easier.

Graphryder API is a messy codebase

Graphryder API has a lot of dead ends and is a bit hard to understand. This is because it was originally supposed to do more than it does, but was left unfinished. Now that we know what we need it to do, it needs to be restructured.

Graphryder Dashboard is approaching end-of-life

Graphryders Dashboard is an Angular app that uses Grunt and Bower. It’s an old technology that is hard to maintain and difficult to work with. It’s also an oddity in our front end ecosystem, where we usually use Vue.js or React.

Eventually, we want to rebuild the dashboard, probably in Vue.js.

Step by step plan

Phase 1: Replace Neo4j with RedisGraph

Both Neo4j and RedisGraph use Cypher. Theoretically, if a RedisGraph stores the data from Discourse and the annotation data in the same structure as Neo4j, Graphryder API should be able to read from that instead of first having to build the data in Neo4j.

Graphryder API has two major parts:

  • importFromDiscourse, which loads the data from Discourse into Neo4j. It does not load all posts, but instead only topics with a tag set in the config.
  • graphtulip, a library of python classes that read the Neo4j graph and prepare Tulip graphs based on that data.

In Phase 1, we would:

  • Replace importFromDiscourse with a Discourse plugin that builds an up-to-date database based on the Discourse data, the codes, and annotations, and exposes a protected endpoint for Cypher queries of that data.

  • Implement a connector to RedisGraph instead of Neo4j in Graphryder API.

  • Check the Cypher queries of graphtulip functions as needed to now only call the subgraph related to a certain Discourse topic-tag.

  • Patch up the Cypher queries as needed in case the Cypher implementations of Neo4j and RedisGraph are not 1 to 1.

At the end of Phase 1, we should still have a functioning Graphryder API that works with Graphryder Dashboard, but the importFromDiscourse module can now be scrapped. We now have all the raw SSNA data available through an endpoint that accepts Cypher queries.

Phase 2: Make Graphryder API multi-tenant

At the end of Phase 1, Graphryder API is still single-tenant. Graphtulip prepares “tlp” data files that are downloaded by the Graphryder Dashboard. Currently, it has no way of keeping track of multiple subgraphs, and each “tag-focus” subgraph needs to be processed independently for display.

In phase 2, the Graphryder API is refactored to be able to handle keeping track of an arbitrary number of subgraphs and their corresponding tlp files, and serve up the right files to the Graphryder Dashboard when passed the right tag.

Phase 3: Rebuild a multi-tenant Graphryder Dashboard with Vue.js

Graphryder Dashboard has served us well, but its core technologies are aging to the point of being very hard to maintain. We would like to rebuild the dashboard in Vue, which would also allow us to abstract the components and display Graphryder SSNA graphs on our other websites.

Furthermore, one deployment of Dashboard per API should be enough, and the tag focus should be possible to set client-side.

Doing the work

I have previously talked to @gdpelican about this, and he has been experimenting with RedisGraph.

To get us started, I would like to offer him to work on Phase 1. We could pull the budget from a few different places and share it between units, but it’s hard to estimate exactly how much we need without letting him get to work on scoping out the project.

What I can offer directly is work for up to 10 hours on scoping out the time needed to complete phase 1. This only means coming back with a plan and time estimate. I can also guarantee that we have the budget for up to at least 80 hours from projects I run, and I will talk to @matthias and @alberto to see if we can pool resources to get Phase 2 and 3 done.

Are you still interested in working on this @gdpelican?

Yep, I am still interested in this one, and would be happy to start on scoping anytime after this week :slight_smile:

Perfect! Feel free to start anytime, and let’s touch base end of next week?

1 Like

Alright, I’ve had a poke around here, and here’s my rough guess on a timeline for phase 1:

Action items:

  • Pull back the experimental changes around RedisGraph models within Discourse (~2 hrs)

    • While I’m happy with the experimentation to add a REST-like API for querying individual nodes/edges/relationships of the graph, the existing Graphryder queries are a bit more complex than that and would require a more mature system than we really need to build to utilize it properly. In light of that, pulling this bit out and opting for a single endpoint which accepts a string to run as a Cypher query should be the initial push.
  • Sync the Discourse DB so that it writes to RedisGraph as well (~8hrs)

    • Reference the existing ImportFromDiscourse module to determine which Discourse events this needs to occur on to maintain an up-to-date graph in Redis
      • Create / update topic / post
      • Create / update tag
      • Others? Annotations?
  • Write an endpoint to accept Cypher queries to RedisGraph (~4hrs)

    • Ensure this is read-only for the graph; no modification of nodes should be possible
    • Need to think about a security approach for this endpoint; api key provided by the client?
  • Adapt Graphyryder API to call Redisgraph endpoint instead of neo4j (~15-20hrs)

    • Need to get graphryder running locally
    • Should include documentation changes on how the new API functions
    • This phase is the highest risk one at the moment; although replacing the queries should be fairly straightforward, I haven’t been able to get Graphryder running locally just yet, meaning it may be difficult to verify the fixes are working properly and that end-to-end the graphs are unchanged. Hopefully removing the neo4j dependency will help with this part.

This is a short week in New Zealand (happy birthday to the Queen), so I’ll look to get the first three Discourse items together starting next week, then start on the Graphryder API piece the week after. Does this seem like a sensible start to you @hugi?

I’ve set it up a few times now so maybe I can help. What problems are you running into?

Yes, this sounds reasonable! Thanks.

Yes, we need to get annotations, codes and code names from the “annotator-store”. And even though it’s not used in the importer now, we should also include the relations for who has created each annotation and code.

All of that data is already available in ActiveRecord, so getting it shouldn’t be harder than getting the other data.

However, I’m not sure if these end up in the event log in the same way as the standard Discourse stuff, @matthias?

We should use an approach here that syncs with what already exists for the annotator. How do you think this should be implemented @matthias?

The annotator-store gem and our extensions to it store to the PostgreSQL database via a plain ActiveRecord mapping, and that’s it. We don’t implement any Discourse specific interfaces etc., it’s rather so far a standalone Rails engine that does its own thing and relies on Discourse for authentication and a shared database.

Access should be granted with two mechanisms:

  1. Discourse API key for the so-called Admin API, provided by the client application. This is the approach we currently use for all Vue.js applications to access the Discourse API by first doing a SSO login via community.edgeryders.eu and then obtaining the user’s admin API key via this custom API endpoint.

  2. If the user is an authenticated Discourse user and member of group “annotator”, as currently done inside Open Ethnographer. This would be used for the case of implementing Graphryder visualizations as part of the current Open Ethnographer interface. Which is not planned yet, but should be possible. Since it will not be needed in the immediate future, no need to implement it right now. But if there is a way to implement the API endpoint so that this comes out as a side effect, then do that.

We have detailed installation instructions. They reside outside of the Github repositories though … I hope you found them already …

A minor update here:

^^ this is done and pushed, woo! Now we have a plugin which exposes a single endpoint which will execute read-only Cypher queries against the RedisGraph embedded in the Discourse instance.

^^ the bulk of this is complete, and looks to be finished by end of week.

In our first instance of differences between RedisGraph and Neo4j, it appears RedisGraph doesn’t support unique constraints, which Neo4j does. We should be able to dance around this with a bit of ‘upsert’ style code in the plugin, which I have on the docket for tomorrow.

I’ve gotten a (more recent version of) neo4j running locally, which seems to run ok, including the suggested plugins. My trouble with the graphryder API so far has been installing python v3.5 effectively, but will give this another crack tomorrow. (For extra-interested parties, I ran into this and haven’t gotten the suggested commands to work just yet)

Thanks for the update!

I highly recommend going for pyenv to get that working. It’s really the way to go to handle that sort of nonsense with outdated dependencies.