Making Graphryder API multi-tenant

hugi · January 6, 2020, 5:32pm

TL;DR: Our current strategy for making Graphryder multi-tenant is not working out as well as I had hoped. This is a lengthy post describing the problem, which ends with my advice on how to solve it in the short term and in the mid-to-long term.

Background to multi-tenant Graphryder

Graphryder’s (“GR”) current architecture can only handle one SSN (Semantic Social Network) graph at a time. This limitation propagates through many levels of the architecture of Graphryder.

GR’s database, the Community edition of Neo4j, can only handle one graph per running instance of Neo4j.
The GR API is responsible both for building the Neo4j database and for creating Tulip graphs to serve to the dashboard based on the Neo4j data. This architecture assumes that the Neo4j database only contains data for a single project. Furthermore, GR’s API can only be linked to a single Neo4j database.
GR’s dashboard can only be linked to a single API.

We have long wanted to make Graphryder multi-tenant – able to handle data from multiple SSN-graphs at all of the levels above to lower the overhead of setting up a new SSNA-ready project. The hard part of that is making the GR database and API multi-tenant, and that is what this post will focus on.

An overview of the challenge

The architecture is currently very reliant on the database only containing a single SSN. To describe why, I will quickly give an overview of how the Graphryder code is structured:

importFromDiscourse.py is the module that builds the Neo4j database. Classes from this module are called from the settings update routes. This is the class that is called to rebuild the database from scratch from a given set of posts, usually loaded through a tag. Currently, it is only reliable when it is called to rebuild the database from scratch.
routes are the libraries of classes executed when different routes to the API are called. There are approximately 15 different libraries, totalling about 2000 lines of code. Most of these classes contain Cypher (query language for Neo4j) queries to return subsets of the data that has been loaded into the Neo4j database. All of these queries assume that the database only contains a single SSN graph.
graphtulip is a library of 10 classes responsible for building the Tulip graphs. These are the data files describing how to draw the graphs on the dashboard. It is about 1700 lines of code in a a few files in the graphtulip module of the GR API code. The heavy lifting in this code is done with Cypher queries. All of these queries assume that all data in the database is relevant to the SSN Tulip graph. These classes are in turn called by the tulipr classes in routes.

Approaches

I will now go through a few different approaches for how to make Graphryder multi-tenant. We will carry a few assumptions with us on what we mean by that:

We want to be able to store SSN graphs for multiple projects on a single instance of Neo4j
We want a single instance of the API to be able to handle building multiple SSN graphs on Neo4j from separate sets of tags on the platform
We want a single instance of the API to serve Tulip graphs and other data to the dashboard for multiple projects

Approach 1: Multiple databases on one Neo4j instance

The easiest way to achieve multi-tenancy would be to have the API connect to multiple databases on the same Neo4j instance, and then configure the routes to keep track of which database is being called. Most database-software offer to run multiple databases on the same running instance. Neo4j has not offered this functionality until mid 2019, when it was made available in the Neo4j Enterprise Edition 4.0 pre-release. However, the enterprise version is proprietary and requires a lincense.

According to our FOSS principles and operating procedures, this is probably a no-go. However, it’s important to note that we would very likely be given a free licence if we asked for one. Startups with <=50 employees can contact Neo4j to receive a free Startup License for Neo4j Enterprise. And if we ever grew out of that bracket, the cost of a licence would probably not make a big dent. In this case, it’s our idealism and principles that keep us from going with the simplest solution to the problem. I think those principles are sane, especially considering that investing in software that depends on proprietary software creates a very uncomfortable lock-in.

It should also be noted that even if we went with this solution, it would still be risky, as it also means upgrading Neo4j to a much newer version which would most likely break some of our code.

Approach 2: Get rid of Neo4j

Neo4j is not the only graph database out there. There are some open source alternatives. However, most of them don’t support Cypher, which would mean rewriting all queries in some other query language like Gremlin.

I recently found a very promising alternative, called RedisGraph. It is a module for the Redis database, which is a technology we already use for Discourse. Amazingly, it does support Cypher. It comes with some challenges though:

RedisGraph is very new, and only matured to its first stable release in late 2018.
While RedisGraph does implement Cypher, it does not implement the entire range of functions that Neo4j does. Some of those Neo4j exclusive functions, like shortestPath, are used by GR API and would have to be replaced with something else.
There is no guarantee for how much of our Cypher code would work with RedisGraph, and we wouldn’t find out until we had invested significant work.

Because of the above, I would say that while it’s very promising to use Redis for both the SSN graph and the primary data cache from Discourse, we should not go down that path until we are ready to commit significant resources to rewriting the entire API if need be. It’s months of work, and should ideally be split among a team. It’s the sort of work we could do if we got a grant or investment specifically for the SSNA tech stack.

Approach 3: One database with project SSN sub-graphs

If we’re stuck with Neo4j for now, there are ways to fake having multiple databases. In the Neo4j developer community, there are various recommendations for how to create sub-graphs with labels or new relationships.

This is the approach I proposed and that Matthias asked me to start working on. We agreed on that I’d let him know if it looked like it would be more than about two weeks of work to get it done. After working on mapping that out and some early code trials, I no longer think that it’s possible to do in this timeframe.

Problems with Approach 3

This is what I have come to realise about the chosen approach:

Importing Discourse content is the easy part

Building subgraphs is easy. We would simply have to add a property or label to each node and relationship on import, and when updating we would just delete all objects with that property or label and import again. This basically means reworking importFromDiscourse.py and making it capable to accept hard-update calls for the different SSN graphs.

There are more calls to Neo4j than expected

I had not made a correct estimate of how many different calls are made to the Neo4j database. I supposed that the graph data was front loaded in a few calls early on. This is not really the case, there are in fact plenty of calls to Neo4j that happen at various times while interacting with the dashboard that I was not well aware of at the time.

Most of these queries are probably not too hard to update with a clause to only consider objects with a property or label passed to the route. There are just very many of them.

There are a lot of routes to reconfigure

There are more routes than I thought that need to be configured to be aware of multiple graphs. There is a lot of junk in the code and a lot of routes to consider. When I made my early estimate I thought that most of these were not used, but it turns out that they are, in places where I hadn’t looked closely enough.

Again, this is not hard, just tedious and time consuming.

The patterns are not standardised

The code is not as standardized as I thought. It is pretty clear that different people have implemented different practices in the modules they have developed. One example of this that makes things more complicated is that while the calls to Neo4j in the routes use pure Cypher, the calls in the importer and graphtulip sometimes use the py2neo library, and sometimes pure Cypher.

graphtulip is complicated

All of the complications up to this point are fixable with time and diligence. From what I can see, there isn’t really anything there that I don’t understand how to do. It would just take more time than I first thought.

Graphtulip is a bit of a different story. While working on this, I have realized that I don’t understand it well enough to confidently dig into it. I would have to first spend some time getting to know that library, which also has a lot of its own Cypher queries that look to be more complicated than those in the routes.

Since Graphtulip is the part of the code that prepares the graphs for the dashboard, GR is little use without it working flawlessly.

Bottom line: Time needed is at least x4 what is budgeted

All of these issues taken together, I think we are really looking at closer to 8 weeks rather than 2 weeks. Even if we had the cash, I don’t think it would be a reasonable investment into the Neo4j based GR. With that amount of money, I think it would be a better idea to just start working on porting GR to RedisGraph.

And now that we know that it is more complicated than we thought, I also think that putting such a big project on a single developer is bad practice. It would be money better spent to then have two people or a small team that can work together to avoid getting stuck and to cross-review.

What now?

These are my recommendations, pending approval from @matthias and advice from @alberto.

Short term: We don’t bother making it multi-tenant

For POPREBEL, I would recommend that we bite the bullet for now and run one new VPS per Graphryder. This will cost 30 USD per month per Graphryder install. This can also host the dashboard for its own Graphryder install. It is really quite trivial to set it up, if we accept that we will have a little cluster of VPS on Digital Ocean. I can take on the responsibility of running all VPS on Digital Ocean.

Time for me to set up a new GR VPS like I did earlier is not more than a couple of days, tops. I would need to fix a few things on the one I have already set up, but then we can use it as a template and just clone it when we need a new instance. Then all that needs to be done is to update the config files and configure the domain.

Mid-to-long term: We plan for a big reworking of GR

With the POPREBEL deadline of our back and a way to deploy GR with VPSs, we should start experimenting with RedisGraph and refactoring the GR API. I am ready and willing to make room for this during the first half of 2019, in a way that allows for some more experimentation and less of a rush.

Having GR build a graph in Redis opens up for a pretty exciting opportunity, which is to create a complete graph mirror cache of all the data on the platform, updated in real time. This is obviously a completely new feature, but if we wrote a Ruby module for Discourse that could interact with RedisGraph it is perfectly plausible that this could work. This means that we would be adding some powerful functionality, and not just duplicating work done with different tech.

Thoughts?

hugi · January 6, 2020, 6:42pm

As a sidenote, I think there are some other problems that are currently more urgent to work on than the multi-tenancy. One is the bug that for some reason does not present the correct number of posts and users. Another is the reoccurring issue with the sliders, and the problem with content not being viewable.

If we kick the multi-tenancy can down the road, I could instead focus on solving those issues.

hugi · January 6, 2020, 8:10pm

For completeness, here is the plan I drew up and started working on before realising that the scope had become too wide.

Write a new version of importFromDiscourse class which creates a subgraph with the given project tag as a property of all objects, and only removes the objects with a label that is passed as a parameter.
Rewrite the import classes in the settings route libraries to handle building and updating the subgraphs of a passed project tag.
Verify that this works as expected by importing at least two projects, and then hard-updating one of them, keeping track of the Neo4j database and comparing it to the numbers for the single-tenant versions.
Update queries for all route classes, functions and reconfiguring the routes to be aware of what tag is being called. There are 15 libraries, and all of them contain 2-7 classes each with their own Cypher queries. One example to illustrate is the GetPosts class in the post_getter library, where the following Cypher query:

MATCH (p:post)<-[:AUTHORSHIP]-(u:user) RETURN p.post_id AS post_id, p.title AS title, p.content AS content, p.timestamp AS timestamp, u.user_id AS user_id

Would be updated to specify only looking for posts with the property {ssn_id: x}, where x is the passed id of the project. Some of these updates are similar, but they need to be handled one at a time. There are about 200 lines of Cypher code in all of the routes combined, and most lines probably need to be updated.
Each of the route classes must be updated to take a parameter to pass the ssn_id property to the queries. A good consistent way of doing this should be decided on before going ahead.
All Cypher queries in all classes of the graphtulip module must be updated. Many of these are a lot more complicated than the route queries. One example to illustrate are the very long composite queries in the createtlp library. Before these are updated, I need to understand how graphtulip works in more detail.
Routes for tulipr need to be updated
Tulip files need to be sent to the dashboard with ids that make it possible for the dashboard to tell project files apart.
Everything needs to be pulled together and tested extensively, and compared with single-tenant equivalents.

My estimate is that:

1,2,3 would take about 1 week.
4,5 would take about 2 weeks.
6,7,8 would take between 2 to 4 weeks.
9 would take about 1 week.

matthias · January 7, 2020, 2:55am

Thank you for the detailed analysis, it really helps. And I certainly agree. 8 weeks or ~12 kEUR is not a good investment into a component that we plan to eventually replace anyway (to get rid of Neo4J).

I am not sure yet how realistic it is to re-implement the Graphryder backend during the current H2020 projects. The moment where it would have to be ready would be 2021-10-31 where “Graphryder Interactive Dashboard v2” is due. That’s doable. The problem is the budget, which is tight anyway and it was not planned that the interactive dashboard would entail such a major rework.

We’ll have to do a more detailed plan and probably find additional budget. Alberto mentioned something like that already but I’m not having the overview right now.

For now, you’re welcome to re-focus instead on the following tasks, in this order, which we’ll have to finish for Graphryder Interactive Dashboard v1 (due 2020-01-31):

Fix the error in Graphryder that prevents users from accessing the full text of posts from inside the user interface.
Implementing a way to let one Graphryder web application (“frontend”) talk to multiple backends by selecting the backend / database in a dropdown or similar as the first step of using it. That would allow to have all our Graphryder stuff available on a single domains such as graphryder.edgeryders.eu.
Other bugfixes in the Graphryder frontend where needed.

I will try to solve the issue of how to install multiple Neo4J binaries on the same server. As we have a lot of spare RAM now, that might work.

alberto · January 7, 2020, 2:59pm

Thanks a lot, @hugi. This is very useful. Also, it does not entirely come as a surprise: @melancon has a multi-year, n x 10^5 EUR plan to make a scalable solution for this. It would be lovely to piggyback on it, but… not open source :-(.

I know I am out if my depth, but please humour me: what about

Approach 4: everything in Tulip

I never understood completely why Guy did not make use of the subgraph architecture in Tulip to do what you say. A Tulip graph, IMHO, is best understood as a graph dataset anyway:

The same entities (nodes and edges) can be part of multiple graphs, and are ex officio part of the main graph. This means that the main graph has no mathematical interpretation, it is only, as I said, a dataset. Furthermore, entities’ properties can be global but also local. Multi-tenancy in Tulip is as simple as selecting the appropriate subgraph.

The Tulip GUI, obviously, is not a web application. But other than that, it allows us to do everything that GR does: discard weaker edges and keep only the stronger-connected part of the graph; switch between social networks and semantic networks, and so on. In principle, we could write a Python script that would:

Import fresh data from edgeryders.eu, and store them into a Tulip graph. The first iteration of GR had something called “forum network”, which was simply a graph DB-like mirror of what was on the platform: user A is the author of post B, which is a reply to post C, within topic D. It is coded with code F, authored by ethnographer G, and so on and so forth. The logic was exactly the same as RDF.
Use Tulip to compute every graph we need, and all their metrics, storing them as subgraphs and their properties. Tulip implements more network functions (like “shortest path”) than we will ever need.
Export each subgraph as GraphML or whatever can be easily read by sigma.js.
Visualize via web.

Remember we have direct access to Tulip’s core developers, like @bpinaud and Guy himself, and that they are very friendly. If we ask nicely, they might accept a gig with us to develop this. Or they might put some of their graduate students in our line of sight.

What is wrong with this?

hugi · January 7, 2020, 3:57pm

I don’t know, much because I don’t know Tulip at all. But from what I understand, it is a dataformat and not viable as a queryable database. Graphryder calls Neo4j both to build the Tulip files, as mention in the section above on graphtulip, but also to get contextual content – for example, when you click on a user in the social graph (sat @matthias in the OpenCare graph) , Graphryders API prepares this – a hydrated collection of posts by @matthias in this dataset.

bpinaud · January 7, 2020, 4:00pm

Hello.

Yes, Alberto is right. He described what is for us a standard use of the Tulip framework.
To add some details, Tulip can export in any format thanks to the plugin mechanism.
Tulip is much more of a dataformat and it is not a database. It is a graph manipulation framework.

And, yes again, it will be a pleasure to talk on this.
Bruno

hugi · January 7, 2020, 4:10pm

Right. So what I mean is this; while we can do much with only Tulip, we will always need some other database to handle getting data that relates to some selected entity in the graph. While we could often get that straight from Discourse, it is often specific to the loaded dataset – like listing the posts and comments of that user in this particular SSN.

I’m not completely sure though why the Tulip graphs are built by first building the Neo4j graph and then building Tulip graphs on command, but I suspect that it is something like this: Because each selection of a subset of nodes actually requires a new tulip graph to be built on demand. I suspect that the graph that describes the social interactions of subset B of SSNA A can not be directly inferred from the Tulip graph of A, right? And since Discourse actually does not contain any primary data on wether two users have interacted, the social graph needs to be constructed from scratch for any given subset of the SSN. It is probably prohibitively resource intensive to do this by calling the Discourse API or even relational database every time you want to calculate a new social graph. By keeping the SSN in a graph database you have already done most of the heavy lifting, as there are good graph database functions to answer the question of which users have interacted with each other.

bpinaud · January 7, 2020, 4:36pm

I need the help of @jason_vallet for a complete and correct answer. Tulip is used as a computing engine (which it is) and not as a storing system (which is it not, i.e., no index in Tulip).

melancon · January 7, 2020, 5:09pm

The reason is simple. You are right that Tulip is quite robust and versatile as a graph and subgraph container, but it does not come with a query language. Since you already have identified sub/graphs of interest, you may indeed compute them and store them into a hierarchy to be fetched on demand. You can no longer run dynamic queries on your data.

Also, bear in mind that some graphs you consider are not subgraphs of the original all-data graph, it is obtained through a projection acting on the original data (the tag-tag graph is typical of this).

So Neo4j (or any other graph database) comes useful if you plan to let users form queries from keywords or else, which we thought we would investigate. Things turned out differently.

hugi · January 7, 2020, 5:13pm

But don’t we still need a database to, for example, do this:

Even though we don’t have a prompt for users to write their own queries, this looks like it still needs a database layer outside of Discourse?

Also this:

Would this be viable without a queryable graph database?

melancon · January 7, 2020, 5:13pm

We (I am in this with @jason_vallet) are willing to consider the open source option if you have a economically sustainable solution for it. For now, we don’t, so we follow a path going through the standard paid service approach.

alberto · January 7, 2020, 5:23pm

Right. This makes complete sense. I use GR as a network scientist, so I like to keep track of whole topologies. @amelia or others with more of an anthro-ethno background might do it differently.

You do not need to re-query Discourse, no. You can do it by functions. If I were doing this in Python-with-Tulip, I would write a function, that takes as an argument the list of codes and returns the subgraph of interactions between the users whose posts have been coded with any (or all ) of those codes. Those interactions are already there, because all interactions are.

But I have no idea how to serve such a subgraph (that, after calling this function, resides in memory) to sigma .

Happy 2020, Guy and Bruno!

matthias · January 8, 2020, 2:16am

To sum up, it’s not too clear what direction we want to go in. We can’t have everything at the same time for a very moderate budget Here’s a realistic option for what could happen within the runtime of our current projects:

Focusing on a graph query interface for Discourse

We don’t necessarily want a graph database, rather Cypher or similar as a query language interface to the contents of the Discourse database. That may indeed involve a graph database used as a cache for performance reasons (like RedisGraph, discovered by Hugi) or not (like with AgensGraph, a Cypher interface to PostgreSQL data).

This generic graph query API for Discourse would be a nice feature by itself, on which others could build various other stuff. But, since that would take a good amount of developer time to build to quality, we’d have to make do with a simpler interface for the next two years:

Graphryder Dashboard would need some major overhaul due to software rot (example). In principle it could talk Cypher to Discourse the way it talks Cypher to Neo4J now, but the way interface will be different, requiring work.
Something based on the Tulip desktop application maybe? By just extending it for our purposes? In that case, a web application would follow later. The webkit-in-making might already add components for a few aspects of drawing live / interactive graphs, but researchers would use the desktop application.
Other ideas …?

I don’t say that we need a graph query language interface to Discourse. For our current use cases, we could just follow what Alberto proposed, and overengineering is not a good idea in general. I mean, companies can die from investing in “nice tech” that never gets used to its potential … happened to one of my earlier startup ideas for sure. However, if people here think that this is a nice long-term investment to make, it is somewhere near the upper limit of what could be done with the budget we have.

Whatever the case, I don’t want to invest any precious lifetime or brain capacity into a software that we plan to abandon later. If that’s the only option, I’ll rather tell our client that we’re happy with the dashboard we have and will not implement “Graphryder Dashboard v2” and not claim money for that. So either we come up with a way to salvage parts of the Graphryder software economically, or to re-implement its functionality in a long-term maintainable manner economically … or if not, then the zero-action plan will be our plan …

melancon · January 8, 2020, 8:01am

Your comment and the re-editing of my answer probably criss-crossed one another. One reason to go towards graph databases is as you mention, to take early enough in the process, the “heavy lifting”. One observation is key here: Discourse does not contain any information on the interactions between users, ans even on the interaction between tags. This is information you need to derive from Discourse content. Now, once you have stored Discourse into graph form, in graph terms the user-user graph or the tag-tag graph is obtained by projecting two mode graphs onto one-mode graphs. This pretty standard operation is time costly so having everything as graphs greatly improves performance.

hugi · January 8, 2020, 1:14pm

This part I would prefer not to do. I am very unfamiliar with the dashboard code, and didn’t have on my agenda right not to become familiar enough with it to implement this. It’s using Angular, which I don’t particularly like working with and don’t want to learn as it is quite outdated. If it was up to me, I would simply write a nice and simple splash page on graphryder.edgeryders.eu with links to the different deployed dashboards, rather than bothering with making the dashboard handle different sets of data.

melancon · January 9, 2020, 12:36pm

I believe we faced a similar situation in the project we are running (which Alberto mentioned earlier in the thread). The app (which as of today is only a minimum viable product, accessible at demo.intuinet.fr) provides access to different graph databases, seen and played with as distinct databases by the user,

while all these graphs are stored in a single Neo4j databases (we use Neo4j because we already had experience with it and had no time to consider going to another technology). It is true we decided to do so because of Neo4j pricing policy, but at the same time it saves us a bit of trouble. Of course, you need to add properties so you can sort out elements as being part of this or that graph.

I can ask our experts to share with you the details on this implementation strategy. Let me know.

alberto · January 20, 2020, 11:58pm

I have no idea about that whatsoever. Generally speaking, I like Taleb’s “read the classics” maxim, and I believe it applies to software too. It works like this: only spend time reading stuff that is at least 300 years old, because it has been selected by evolution so that it is unlikely to disappear. If you know your Seneca or your Plato, you can still pass for a learned person in 2100, whereas investing in, say, Liu Cixin is unlikely to yield the same results.

I know next to nothing about software, but it seems to me that lower level stuff takes longer to go out of fashion. So, a long-term investment makes more sense if it is on lower levels.

That’s not how we make progress! You yourself had to buy two trucks in order to get it right the second time. Maybe we will indeed throw away the code, but what we gain at each step is that we learn how to move forward.

Version 2 should, in my mind, make only one more thing that 1 does not do: allow to toggle the visualization between “children codes” (in only one language) and “parent codes”.

I think this would be a good way to prototype We could make scripts that allow researchers to use Tulip reasonably fast, and then be quite generous in making new ones when they ask “can I see this”? Over time, researchers would sniff out the visualizations that help them the most, and that would inform the next iteration of development.

I do not think this is necessarily bad. If we had analytics on graphryder, I’ll bet you would see that many features are almost never invoked.

matthias · January 24, 2020, 1:53am

In the end, I used another approach and installed multiple Neo4J databases on our one and only Edgeryders server. So we have now, in total:

graphryder1.edgeryders.eu: NGI Forward dataset, on the Edgeryders server
graphryder2.edgeryders.eu: OpenCare dataset, on Hugi’s Digital Ocean VPS
graphryder3.edgeryders.eu: POPREBEL dataset, on the Edgeryders server

This setup is now considered “Graphryder Interactive Dashboard v1” (a deliverable, see the plan). There are still things to tidy up internally to finalize this tooling. After that, we’ll have everything Graphryder under a single domain graphryder.edgeryders.eu, using one codebase for Graphryder API and Dashboard on the file system together with multiple configuration files.

Ideally I’ll also add a little command line tool that automates the setup of a new Graphryder website and Neo4J database for a new dataset:

graphryder-cli create --tag ethno-earthos --name earthos

For the short term incl. new projects, that will be good to go as setting up Graphryder is then just one command on the server. We can then focus on a medium to longer term solution …

(Also @alberto and @amelia: your Graphryder installation for POPREBEL is ready, see above. The details view is still broken, but will be fixed in the next days …)

alberto · January 24, 2020, 7:59am

Woooow!!! The command line tool looks great!