Graphryder 2.0 – Workplan

hugi · May 18, 2020, 7:05pm

Graphryder is the technology that we rely on to be able to do “Semantic Social Network Analysis”. In short, the technique is:

We annotate the content on Discourse with ethnographic tags. This is done with a gem we include in our custom Discourse fork.
There are two concepts to understand, “annotation” and “code”. An annotation is an instance of using a code and refers to a specific location in a specific post of a specific topic.
We use a Python service called Graphryder API to build a graph of all users, topics, posts, codes, and annotations in Neo4j, and to then use that graph database to draw interpretations of that graph with a graph algorithm library called Tulip.
Tulip data from Graphryder API is accessed and displayed through Graphryder Dashboard, a client-side Javascript application. Graphryder API also reads data from the Neo4j database directly.

What we want

SSNA we can depend on

Our current stack is getting old, and we keep patching it, but some of the technology might break. We already have problems with new versions of Neo4j, and the old versions have other problems and are already end of life. Additionally, there are libraries and components in the old stack that are no longer maintained. We need to be able to rely on our SSNA stack.

Adding SSNA to a project should not incur extra IT costs

Currently, the installation of Graphryder can only handle looking at a single dataset. We currently use the methodology for four projects and need to have four instances of Graphryder API, four Neo4j databases, and four different deployments of Graphryder Dashboard. This is not sustainable, and to be able to offer SSNA projects at a reasonable price we need to be able to add projects without having to go through deployment.

Our goal should be that adding standard SSNA to a project should come at zero development cost apart from carrying its share of paying to the IT overhead fund, which should cover bug fixing and improvement across projects. Within a project, our research budget should be used for research.

Real-time updates

Currently, each Graphryder API instance needs to reload data manually from Discourse in a heavy operation, emptying and rebuilding the Neo4j database every time you want to load new data. We would like to have a graph view of what is happening in our Discourse. Which users are interacting a lot and are there clusters of people who interact with each other? Which ethnographic codes are coming up across projects?

An up-to-date graph can also help community managers and provide reflective feedback to users.

Quickly deploy new experimental views

We want to make it easier for the Edgeryders Research Network to experiment with new ways to visualize and analyze the SSNA and social data from the Edgeryders Discourse.

Graphs everywhere

We want to have components that allow us to embed graphs from our projects on our public-facing websites, showing interesting views of our research using up to date community data.

Challenges with the current setup

Neo4j is crippleware

In January I had a look at making Graphryder API multi-tenant, and came to the conclusion that Neo4j is the real culprit.

Neo4j is a powerful graph database, but the open-source Community Edition is only barely usable in production. Specifically, you need their enterprise edition to enable having more than one graph per install and to enable more than one user with different permissions.

Having something that we could simply tack on to any Discourse deployment as a plugin would be a lot easier.

Graphryder API is a messy codebase

Graphryder API has a lot of dead ends and is a bit hard to understand. This is because it was originally supposed to do more than it does, but was left unfinished. Now that we know what we need it to do, it needs to be restructured.

Graphryder Dashboard is approaching end-of-life

Graphryders Dashboard is an Angular app that uses Grunt and Bower. It’s an old technology that is hard to maintain and difficult to work with. It’s also an oddity in our front end ecosystem, where we usually use Vue.js or React.

Eventually, we want to rebuild the dashboard, probably in Vue.js.

Step by step plan

Phase 1: Replace Neo4j with RedisGraph

Both Neo4j and RedisGraph use Cypher. Theoretically, if a RedisGraph stores the data from Discourse and the annotation data in the same structure as Neo4j, Graphryder API should be able to read from that instead of first having to build the data in Neo4j.

Graphryder API has two major parts:

importFromDiscourse, which loads the data from Discourse into Neo4j. It does not load all posts, but instead only topics with a tag set in the config.
graphtulip, a library of python classes that read the Neo4j graph and prepare Tulip graphs based on that data.

In Phase 1, we would:

Replace importFromDiscourse with a Discourse plugin that builds an up-to-date database based on the Discourse data, the codes, and annotations, and exposes a protected endpoint for Cypher queries of that data.
Implement a connector to RedisGraph instead of Neo4j in Graphryder API.
Check the Cypher queries of graphtulip functions as needed to now only call the subgraph related to a certain Discourse topic-tag.
Patch up the Cypher queries as needed in case the Cypher implementations of Neo4j and RedisGraph are not 1 to 1.

At the end of Phase 1, we should still have a functioning Graphryder API that works with Graphryder Dashboard, but the importFromDiscourse module can now be scrapped. We now have all the raw SSNA data available through an endpoint that accepts Cypher queries.

Phase 2: Make Graphryder API multi-tenant

At the end of Phase 1, Graphryder API is still single-tenant. Graphtulip prepares “tlp” data files that are downloaded by the Graphryder Dashboard. Currently, it has no way of keeping track of multiple subgraphs, and each “tag-focus” subgraph needs to be processed independently for display.

In phase 2, the Graphryder API is refactored to be able to handle keeping track of an arbitrary number of subgraphs and their corresponding tlp files, and serve up the right files to the Graphryder Dashboard when passed the right tag.

Phase 3: Rebuild a multi-tenant Graphryder Dashboard with Vue.js

Graphryder Dashboard has served us well, but its core technologies are aging to the point of being very hard to maintain. We would like to rebuild the dashboard in Vue, which would also allow us to abstract the components and display Graphryder SSNA graphs on our other websites.

Furthermore, one deployment of Dashboard per API should be enough, and the tag focus should be possible to set client-side.

Doing the work

I have previously talked to @gdpelican about this, and he has been experimenting with RedisGraph.

To get us started, I would like to offer him to work on Phase 1. We could pull the budget from a few different places and share it between units, but it’s hard to estimate exactly how much we need without letting him get to work on scoping out the project.

What I can offer directly is work for up to 10 hours on scoping out the time needed to complete phase 1. This only means coming back with a plan and time estimate. I can also guarantee that we have the budget for up to at least 80 hours from projects I run, and I will talk to @matthias and @alberto to see if we can pool resources to get Phase 2 and 3 done.

Are you still interested in working on this @gdpelican?

gdpelican · May 18, 2020, 8:20pm

Yep, I am still interested in this one, and would be happy to start on scoping anytime after this week

hugi · May 18, 2020, 8:23pm

Perfect! Feel free to start anytime, and let’s touch base end of next week?

gdpelican · May 25, 2020, 2:40am

Alright, I’ve had a poke around here, and here’s my rough guess on a timeline for phase 1:

Action items:

Pull back the experimental changes around RedisGraph models within Discourse (~2 hrs)
- While I’m happy with the experimentation to add a REST-like API for querying individual nodes/edges/relationships of the graph, the existing Graphryder queries are a bit more complex than that and would require a more mature system than we really need to build to utilize it properly. In light of that, pulling this bit out and opting for a single endpoint which accepts a string to run as a Cypher query should be the initial push.
Sync the Discourse DB so that it writes to RedisGraph as well (~8hrs)
- Reference the existing ImportFromDiscourse module to determine which Discourse events this needs to occur on to maintain an up-to-date graph in Redis
  - Create / update topic / post
  - Create / update tag
  - Others? Annotations?
Write an endpoint to accept Cypher queries to RedisGraph (~4hrs)
- Ensure this is read-only for the graph; no modification of nodes should be possible
- Need to think about a security approach for this endpoint; api key provided by the client?
Adapt Graphyryder API to call Redisgraph endpoint instead of neo4j (~15-20hrs)
- Need to get graphryder running locally
- Should include documentation changes on how the new API functions
- This phase is the highest risk one at the moment; although replacing the queries should be fairly straightforward, I haven’t been able to get Graphryder running locally just yet, meaning it may be difficult to verify the fixes are working properly and that end-to-end the graphs are unchanged. Hopefully removing the neo4j dependency will help with this part.

This is a short week in New Zealand (happy birthday to the Queen), so I’ll look to get the first three Discourse items together starting next week, then start on the Graphryder API piece the week after. Does this seem like a sensible start to you @hugi?

hugi · May 25, 2020, 2:51am

I’ve set it up a few times now so maybe I can help. What problems are you running into?

Yes, this sounds reasonable! Thanks.

hugi · May 25, 2020, 4:20am

Yes, we need to get annotations, codes and code names from the “annotator-store”. And even though it’s not used in the importer now, we should also include the relations for who has created each annotation and code.

All of that data is already available in ActiveRecord, so getting it shouldn’t be harder than getting the other data.

However, I’m not sure if these end up in the event log in the same way as the standard Discourse stuff, @matthias?

We should use an approach here that syncs with what already exists for the annotator. How do you think this should be implemented @matthias?

matthias · May 25, 2020, 12:58pm

The annotator-store gem and our extensions to it store to the PostgreSQL database via a plain ActiveRecord mapping, and that’s it. We don’t implement any Discourse specific interfaces etc., it’s rather so far a standalone Rails engine that does its own thing and relies on Discourse for authentication and a shared database.

Access should be granted with two mechanisms:

Discourse API key for the so-called Admin API, provided by the client application. This is the approach we currently use for all Vue.js applications to access the Discourse API by first doing a SSO login via community.edgeryders.eu and then obtaining the user’s admin API key via this custom API endpoint.
If the user is an authenticated Discourse user and member of group “annotator”, as currently done inside Open Ethnographer. This would be used for the case of implementing Graphryder visualizations as part of the current Open Ethnographer interface. Which is not planned yet, but should be possible. Since it will not be needed in the immediate future, no need to implement it right now. But if there is a way to implement the API endpoint so that this comes out as a side effect, then do that.

We have detailed installation instructions. They reside outside of the Github repositories though … I hope you found them already …

gdpelican · June 4, 2020, 9:48am

A minor update here:

^^ this is done and pushed, woo! Now we have a plugin which exposes a single endpoint which will execute read-only Cypher queries against the RedisGraph embedded in the Discourse instance.

^^ the bulk of this is complete, and looks to be finished by end of week.

In our first instance of differences between RedisGraph and Neo4j, it appears RedisGraph doesn’t support unique constraints, which Neo4j does. We should be able to dance around this with a bit of ‘upsert’ style code in the plugin, which I have on the docket for tomorrow.

I’ve gotten a (more recent version of) neo4j running locally, which seems to run ok, including the suggested plugins. My trouble with the graphryder API so far has been installing python v3.5 effectively, but will give this another crack tomorrow. (For extra-interested parties, I ran into this and haven’t gotten the suggested commands to work just yet)

hugi · June 4, 2020, 1:47pm

Thanks for the update!

I highly recommend going for pyenv to get that working. It’s really the way to go to handle that sort of nonsense with outdated dependencies.

gdpelican · June 8, 2020, 1:50am

Another update here:

The initial build of this is done; I need to install the annotator-store gem in order to confirm annotations are working as expected, but this looks to be working for Users, Posts, Topics, and Tags, as well as the existing relationships from the existing API. We’ll likely need to iron out some bits around odd post formatting yet, but those should be easily resolved once we expose this to a more robust database of content. There’s an initial importer which can be run by passing an ENV variable on startup, and ongoing updates triggered by any saves to those models.

The basics here are also done; we can send arbitrary queries via the API now. I haven’t secured the endpoint via API token just yet; I’ll do that this week. I also want to find a robust way to ensure that only read-only queries can be executed via the API; no merges / removes etc. The quick-and-dirty way is to enforce that the given query starts with MATCH, but it’s still very possible to bypass that requirement and b0rk the graph if someone wanted to.

Yep I’m using pyenv, but seems like its missing some openssl-related thing; I’ll continue to try to get that going, perhaps booting it up on another machine if need be.

gdpelican · June 13, 2020, 11:10pm

Another update here:

I’ve gone through and ripped out the importers and neo4j references in the Graphryder API, and gotten some initial integration requests going through as expected. Here’s the WIP pull request for it (+155 / -1.5k, not bad!).
Next piece is going through each of the queries (there’s something on the order of 30 or so?) and ensuring that they work and return data as expected; initial testing suggested I’ll need to go back and make a more robust deserializer to handle all the various types of cypher queries we’re feeding it.
Regarding auth, I’ve put together a little thing that should work for us, including allowing annotator group members to access the endpoint, that’ll go in soon once I’ve tested it a bit more.
Stripping away the neo4j dependency and doing a minor update to tulip has allowed me to get the graphryder API running locally on python 3.5.3; still need to confirm that moving tulip 4.10 → 5.2 doesn’t break it (which can happen once some more of the redisgraph queries are going through successfully)

hugi · June 16, 2020, 8:20pm

Wow! I will have a look at this soon! Which queries have you tried so far?
Do you think it would be safe for us to install the graph plugin on one of the live Discourse installs to test with production data?

That sounds about right, yes.

Great, that sounds about right @matthias?

A very good side-effect improvement.

@alberto, heads up, very good progress being made here on a real-time SSNA running directly in Discourse without Neo4j.

gdpelican · June 16, 2020, 9:51pm

Let me go in and add some error handling so that if a record import fails it doesn’t crash the instance; once that’s the case I’d be super curious to give the import a go; I’ll ping once that’s the case.

It’ll also be a good test to ensure that installing the RedisGraph module works properly in production as well.

alberto · June 16, 2020, 10:27pm

I am not sure I get everything, but def looking forward to peeking over your shoulder.

gdpelican · June 17, 2020, 3:13am

Alright, so installing this on an instance at this point would be an interesting endeavor. Here’s the plugin in question:

I’ve put in some error handling so that it has a low risk of crashing anything once it gets set up.

Once the plugin is installed, it will start auto-syncing changes to the DB automatically. In order to do a full import of things already in the database, you can either set the GRAPHRYDER_IMPORT env variable on startup:

GRAPHRYDER_IMPORT=1 rails s

or run the importer manually in console:

> Graphryder::Importer.initialize!

That should output something like this:

If some queries fail, they’ll pop output into the console without killing anything else:

NB that the first time the plugin runs, it will attempt to install the RedisGraph module onto the Discourse redis instance; I haven’t had trouble with this in the past installing on a local instance, but haven’t run it through the launcher rebuild app install process before.

This will expose an endpoint, /graphryder/query, which should be accessible to API keys which either have site admin access, or are members of an annotator group.

I make no guarantees about the robustness of the graph, or the endpoint returning values from the graph just yet, although simple queries like

MATCH (post:post) RETURN post, count(*) as count

seem to be going through okay. It would be possible at this point to b0rk the RedisGraph db via this endpoint if you wanted to, although nothing a re-import using the above commands couldn’t fix.

Here’s an example curl to try out:

curl -X POST http://<instance>/graphryder/query -d '{"query": "MATCH (u:user) RETURN u, count(*) as count"}' -H "Content-Type: application/json" -H "Api-Key: <api_key>"

and the results I’m currently getting on my test instance:

hugi · June 17, 2020, 8:39am

I propose that @daniel installs this on the Babel Between Us forum for testing. Sounds good @matthias? We could of course test it on ER main too, I’m just thinking that the data on BBU is less sensitive in case there would still be some security bugs to iron out.

daniel · June 17, 2020, 11:26pm

Seems good. @matthias agrees that we install it. I will install it today (Thursday) on the BBU installation.

gdpelican · June 18, 2020, 2:57am

Alright I’ve started syncing this with the Graphryder API, and will track my progress here:

hugi · June 18, 2020, 7:57am

Yeah, I suspect that a lot of these will need some work and thought.

For example, come to think of it, since we are now moving to having multiple SSNA datasets in the same database we will need to modify the model slightly.

Since we are still in the “Phase 1” step we are assuming that we will have one Graphryder API install per SSNA dataset. However, many SSNA datasets can exists on the same Discourse install. Usually, an SSNA dataset is defined by one or more Discourse topic tags (notice to avoid name confusion here with the annotator ethnographic tags, which we usually call “codes”). If you look at the importer script of the Graphryder API, it builds the Neo4j database from topics with the tags it it is give in the conf file.

In our case, we will instead need to modify all the Cypher queries so that they only get the data related to a certain Topic tag. Luckily, these sort of relational queries are exactly what Cypher is built for and I have written a lot of them. Once the plugin is installed on BBU, we can have a go at it.

In the meantime, are Topics tags already tracked on RedisGraph?

gdpelican · June 18, 2020, 11:47pm

I’ve just put in a bit of code to sync TopicTags, so hopefully that will set us up to query by tag, perhaps via a separate parameter to the graphryder/query endpoint

In the meantime I’ll continue pressing through and identifying issues with the other endpoints