This document details the software architecture and design decisions to implement the Open Ethnographer application according to the requirements. It is structure along the list of components / main features to implement.
User interface: Code editor
- Drupal taxonomy UI. Only possible when storing codes in Drupal tags, obviously. Allows for comfortable creation of new codes. By default, the Drupal taxonomy UI would not be integrated on the same page as the coding interface, but it can be kept open in a parallel tab. It would also be possible to integrate it into the coding user interface, like into an additional tab of the eComma UI which is the basis here.
If choosing to use code hierarchies (rather than codes as non-hierarchical tags), teh Drupal taxonomy UI also allows to create and modify this hierarchy by drag&drop.
User interface: Coding
Of course, Annotator has to be modified to allow for stand-off annotation with random word IDs, and to provide our intended options to choose tags etc… However, basic Drupal integration does exist (module annotator, and annotation for local storage inside Drupal). And there is the Hypothes.is Annotator version, which is already highly modified and might offer some takeouts.
- eComma. This is a nice find – see their demo. It is a fully open source (GPL), self-hosted plugin for Drupal, funded by U.S. based educational institutions, published just recently (2014-09). It contains the complete basic UI needed for wordwise tagging with Drupal taxonomy terms and by multiple users, and for exploring and displaying tags in the current text, incl. seeing who tagged what when.
- Host organization: COERLL. The "Center for Open Educational Resources and Language Learning", at the University of Texas at Austin, U.S.A..
- Project director / coordinator: Prof. Carl Blyth. [source]
- Head programmer: Nathalie Steinfeld Childre. [source] Here on drupal.org.
- Project homepage.
- Project list entry.
- GitHub repository. It is unclear so far if this or the Drupal repository is the authoritative one.
- Hypothesis. A free and open source annotation software, also based on Annotator (providing a highly customized version that might offer benefits). Hypothesis is however a browser plugin, so the architecture is not really fit for converting it to a Drupal module: even if forcing the datastorage into Drupal, it would allow annotating everything, and not know the relation to Drupal content types. Also it seems a bi early in development: causing a lot of CPU load even when idle, and being a bit sluggish when loading and showing annotations.
- RDFaCE. An open source project developed at University of Leipzig, Germany. The editor is based on TinyMCE, just adding a toolbar to it, and nice highlighting of codings. So ideally, edgeryders.eu would be switched from its current CKEditor to TinyMCE globally. It has to be checked how RDFaCE can be informed about the possible code taxonomy that it can use.
- WissKI editor. Again, an editor for adding semantic annotations, based on TinyMCE and adding some plugins to it. It is part of WissKI, a larger software initiative for semantic data in science, by a consortium of German research institutions. It is somewhat in active development, but RDFaCE seems more active and more apt for what we need here (just comfortable tagging!). However, WissKI also includes a nice graph-based navigation of the semantic data, which could be added later when adding analysis features.
- CKEditor plugin. The Open Ethnographer prototype used the CKEditor styles plugin for selecting codes. Similar to this, a dedicated plugin with tree-like selection of codes could be developed.
- GroupDocs.Annotation. A Drupal plugin that performs much the same thing as eComma above, but is neither open source nor self-hosted. It is instead cloud software with a pretty demanding pricing model. However, its features and UI are worth to explore.
- co-ment. A Drupal plugin for adding comments as annotations to selections of text. The comments are stored in a proprietary web service software though, and the plugin seems to not have tagging.
Storage of code definitions
- Drupal taxonomies. This is how the chosen eComma base software stores tag definitions (in collaboration with the community_tags module). By going this way, the advantage is having a storage technology, a code and code hierarchy editing UI, and lots of contributed taxonomy-related modules ready already. Drupal taxonomy terms are normally used to tag content elements (nodes, comments) and not selections of text, but this extension would be the contribution of Open Ethnographer.
Also, when using taxonomies, tagging content elements manually in a special field allows to mark which content elements have been tagged with a certain tag (and its subtags) already; to save manual efforts, one would use the highest-level applicable tag, for example to express that all “events” have been tagged already. This creates problems however as hierarchies can change at any point of time …
Contributed Drupal modules that can help with implementing code hierarchies as taxonomies:
- private_taxonomy. Allows to implement code storage as one single taxonomy, which is preferable to "one taxonomy per researcher" to prevent namespace cluttering. With this, every researcher can only see and edit the terms they created (and if configured, see the terms others created). When combined with the "Drupal taxonomies plus a content type" mode of implementation (see above), it allows shared physical codes and private logical codes (i.e., private code hierarchies).
- taxonomy_access. Allows to govern access to taxonomy terms by user role ("Two term access types: View tag, Add tag."). However, not by user, which makes this hardly usable.
- taxonomy_tools. Includes a taxonomy_role_access submodule, which seems preferable over taxonomy_access as it only comes with the functionality needed here. But again, it does not allow giving access on per-user basis. Also has the taxonomy_copier submodule, allowing to create copies of whole taxonomy hiearchy parts, optionally also including the tagged nodes.
- Comment tagging. When assigning taxonomy terms to nodes (to say they contain this term in word-level tagging), the same has to be possible with comment. Fortunately, it is, by simply adding a term reference field to comments [source]. However, adding taxonomy terms to nodes / comments in this way is redundant and should be avoided if possible. It would however allow for a simple implementation of a quotation manager via Drupal taxonomy term search.
- Drupal taxonomies plus a content type. This solution is similar to the "Drupal taxonomies" alternative below. But in addition, it allows putting one code (from any researcher) into multiple hierarchies, which else would need another custom module. Multiple hierarchies are needed because: "physical" codes (which are actually placed into texts as RDFa tags or similar) are represented by nodes of a custom content type here, while "logical" codes (those higher up in the hierarchy, only physically visible in exports) are represented by taxonomy terms. Now, multiple researchers can incorporate the same representations of physical codes into their coding hierarchies by tagging the nodes that represent these physical codes with the taxonomy terms that represent their logical codes. A problem could be that physical codes are not as easy to move in the hierarchy as logical ones, since a different interface is used for that requires editing the node and assigning taxonomy terms there. But maybe a contributed module provides an integrated interface.
Storage of codings
This is a very basic decision with profound effects of how and for what Open Ethnographer is to be used. So we are conducting a small survey to gather ethnographers’ opinions on this before.
- Word-ID based stand-off annotation and Drupal's database. The main advantage is that this technique allows coding the live version, private coding, and sharing ones codings all at the same time. Stand-off annotation is a term for storing the annotation outside of the annotated text – for an introduction, see Translation Driven Corpora, p. 97. This usually involves adding word ID tags to the annotated test, which then allows word-level coding, but not finer (that is, it is not possible to include only part of a word in a coding). The idea is to put a special word marker tag around each word of live content (like
<span id="w-ed51f">…</span>) and then to store each coding externally with a node ID / comment ID and a range or enumeration of references to word marker tags.
This technique still allows comfortable coding by adding markup with a rich text editor, but when saving, instead of saving the markup into the live version, the codings would be stored into a special database table with node ID / comment ID and a range of text points. Word tags would be created automatically when a user creates a node or comment, and would be included invisibly when editing the same piece again later. They are not shown when just viewing the piece of content. So when editing content and adding words, more text point tags would be added when saving.
- Allows private storage. Allows private storage of codings.
- Allows selective sharing and publishing. Also allows selectively sharing codings with selected other researchers, and selectively publishing them into the live version. This involves pre-processing the live version before output. The preprocessing would remove the text point tags and instead insert tags with RDFa attributes for all public codings. Similarly, only these selected public codings would be shown to a researcher while coding the same text (and the researcher could even select just a subset of them).
- Allows intersecting codings. XML does not allow for partially intersecting tags. While a single researcher can work around this by splitting one of the intersecting tags in two, it becomes a problem when conflicting with codings from other researchers stored into the same text. With stand-off annotation, this is not a problem since codings are stored externally.
- Separates data and presentation. By storing codings not right into RDFa tags (a mere presentation form), it is possible to later decide that rendering them as (say) microformats would be better, or even allowing website visitors and search engines a choice of rendering format. It also facilitates exporting to the many linguistic / ethnographic interchange formats later (AIF, Open Annotation etc.). This is similar to the RDF support in Drupal 7 Core, where RDFa tags etc. are all automatically generated from data stored in other forms.
- Keeps codings intact after content edits. When using quasi-random word IDs and enumerating them all for storing a coding, existing codings can be kept intact when inserting words, deleting words and even when moving substantial portions with cut&paste. Words without text point markup will be assumed as new and included into an existing coding when saving (with newly created word IDs of course), words with text point markup will not be.
- Allows simple implementation of quotation manager and semantic search. With stand-off annotion, codings would be stored externally, probably into a SQL table as "node ID, word ID, code ID" triples. This allows to implement a fast quotation manager and fast semantic search easily, as it's just SQL queries on this table (including AND queries for semantic search etc.). The same table, or a separate similar table might hold codings for higher-level "logical" codes – with external storage however, there is no conceptual difference between them any more. It is also possible to translate queries including "logical" codes to queries of the "physical" codes that they currently refer to, which makes it unnecessary to store "node ID, word ID, code ID" triples for logical codes.
- Allows jumping to codings from a quotation manager / search results. This works because HTML IDs in
<span id="w-ed51f">…</span>tags can be used as subpart markers in URLs, e.g.
- Optional: Compact storage of whole-paragraph annotations. In addition to having word tags, there could also be paragraph tags, and paragraph tag IDs could be used as prefixes for word IDs.
- Performance problems. From somebody who tried: "When every word of a piece of content has a DOM element, there are significant performance issues when content has more than 20,000 words, especially noticeable in the slower browsers." [source]
This can however be fixed easily by only using word IDs for words that get highlighted. Means, Annotator would tell the server with its default XPath selectors which words to annotate, and the server then saves them with word ID references after adding the relevant word IDs to the source text. This also prevents the “messy HTML” issue discussed below.
- Makes the HTML messy. Totally messy and bloated HTML when editing own content in HTML source. This can be mitigated by also modifying the HTML source editor so that it hides text point tags while still keeping them – but that's a lot of work. A simpler and also sufficient way to mitigate this is to generate word IDs only when the first researcher starts coding a piece of text. Since this usually happens some weeks after the content creation when edits by original authors are rare, the messy HTML will hardly be noticed.
- Performance problems. From somebody who tried: "When every word of a piece of content has a DOM element, there are significant performance issues when content has more than 20,000 words, especially noticeable in the slower browsers." [source]
- Robust DOM-friendly annotation anchors. The so-called DOM-friendly annotation anchors store more about a DOM path than simple XPath selectors. They have been developed into a plugin for Annotator [see], so quite ready to use. In cases where they still do not allow to re-anchor an annotation after changes, a heuristics based approach using the content quote would be used.
- Character based stand-off annotation updated by CKEditor-created diffs. The idea is to develop a plugin for CKEditor that hooks into its existing "undo" feature and records character movements and insertions ("123 characters inserted at position 456, then range 23-45 moved to position 67, then …"). Upon saving, CKEditor would also return this record, and they would be used to immediately update existing annotations made on the node or comment in question. Of course, this mandates that only CKEditor is used for editing nodes and comments with annotations. The source code editor would have to be disabled for "ordinary users". For the rare situations where admins change content with annotations on source level, heuristics based change discovery with manual confirmation can be used to infer the new position of annotations. The disadvantage of this scheme is obviously that one annotation record only refers to one revision of a node or comment. If Drupal node / comment revisions are not in use, this is not a problem as old versions are forgotten. If they are used, annotations could not be shown for the older revisions, which is a problem if one wants to revert a revision because of, for example, spammy wiki edits. To solve this, annotations would have to be revisioned as well, refering to the annotated content with both an entity ID and a revision ID.
- Character-based stand-off annotation and diff-type Drupal revisions. This is an interesting novel idea for handling changes of base texts. Each annotation is saved referencing a specific revision of the base text – the revision to which it was added. When displaying annotations, some refer to the current latest revision and can be displayed automatically, since their XPath and character index based ranges are still valid. Others refer to earlier revisions and have to be re-calculated on the fly. This is done with the help of diffs between revisions. These diffs can mostly be calculated with a diff tool (see also Drupal's diff project). However, since we are also interested if annotated text has just been moved (so the annotation will be moved) or deleted and rewritten (so the annotation will be deleted), there is one addition: diffs are calculated between versions of the HTML that contain invisible markup for existing annotations. This markup is then deleted again before saving the changed text, but is important for the diff, which is also saved as it cannot be re-calculated after the markup has been deleted and the changed text saved.
- Stand-off annotation and a specialized database. The database chosen for stand-off annotation is the normal Drupal RDBMS (so, MySQL/MariaDB or PostgreSQL), to keep system complexity low. Performance is not critical in any way, since so far only coding is done inside Open Ethnographer, while analysis is done after an export to external QDA tools. When adding analysis features later and using Open Ethnographer to process huge corpuses, a more specialized database like the MonetDB column-oriented database can be employed. It provides special search operators that can be used for stand-off annotation to query for example for overlapping codings [source]. Another way to store annotations in a more performant way would be to employ a triple store like Virtuoso RDF, however as argued below about OSF, this seems "architecturally wrong" to us for a Drupal plugin.
- Stand-off annotation and OSF for Drupal. OSF is the Open Semantic Framework, a middleware that connects Drupal via modules and web services to proven semantic web engines (Virtuoso, GATE, OWL API 2, Solr, and Memcached) – see the architectural diagram. Since this includes a triple store, and a triple store is a performant way to store and query ethnographic codings, choosing a solution with OSF seems tempting. However, two reasons let us stay away. First, the huge system complexity that would discourage collaboration in the open source project similar to how it happened to Google Wave after its open source release. Second, the Drupal 7 Core support for RDF treats RDF as a presentation layer for data. So there is no triple store at all, instead data is pulled from the normal Drupal database and rendered into RDF on demand. This is natural, as all the metadata that now goes into RDF was present in Drupal anyway, and storing it into a triple store additionally would be redundant. So the clean way to use semantic data in Drupal is to store it in native (core or module specific) data structures and use semantic formats only for presentation (which is, interfacing with other software, as that's what the semantic web is for). So we should also prefer module specific data structures inside the Drupal database rather than pulling in a triple store. (It would be different if Drupal would store all its metadata in a triple store and only in a triple store, natively, by architecture. It's just not the case.) So the only use of a triple store (and the whole of OSF, for that matter) that I can see in Drupal is harvesting and querying linked data from other websites. Open Ethnographer however only queries own data. If in the future, an ethnographic analysis software would be built that harvests codings from multiple websites, this would use a triple store, but Open Ethnographer would be again unaffected. Because Open Ethnographer would be for the coding inside Drupal, other applications for the coding inside other CMSs, and this new software and its triple store would be for the analysis (quite a natural separation).
- Stand-off annotation and Apache Stanbol. Stanbol is a "semantic engine" for managing the semantic data exposed by a CMS. There are several modules to connect it with Drupal (VIE, iksce – more on this). It seems that metadata would usually be stored in Stanbol redundantly, in addition to its native storage inside the "traditional CMS". So regarding Open Ethnographer, the same argument applies that led to forgoing OSF (see above): a specialized semantic data store has only a clean place in the Drupal architecture if used for harvesting and querying semantic data from multiple websites. But that's not what Open Ethnographer is for.
- Private storage in copies. If private storage is needed, it could happen in a special "coded post" content type and its coded comments, only accessible to the researcher creating it, and linking to the original uncoded content via an entity_reference field. Alternatively, the copies could be stored in cloned nodes of the same content type, created with node_clone.
Advantages: (1) Researchers are not forced to apply open notebook science. Disadvantages: (1) No double use for semantic web integration (since the copied versions cannot be public for users or search engines – the redundance would be confusing and unnavigable), (2) no collaborative coding beyond staring from a base version (there is no reasonable implementation to take over codings from a different document later), (3) author’s later edits are not included in the coded versions. In effect, much of the potential disruptiveness is not effective here, as the tool would be equivalent to doing copy&paste offline ethnography plus hyperlinks to the original content and author profiles.
- Public storage in the live node / comment and its revisions. A revisions would be automatically created when finishing coding a piece of text, so that confirming research results is possible by searching through these versions. When exporting to a third-party analysis tool, only selected codes (own and others') would be exported. Also in the editor, one would only see selected codings, even though all are present as tags.
Advantages: (1) all codings are public, so have a double use of adding semantic web integration, (2) collaborative coding at any time by incorporating other researchers’ codes into ones code hierarchy, (3) authors’ edits in the coded document are possible and automatically coded, (4) no redundancy. Disadvantages: (1) researchers might be opposed to open notebook science, (2) original content authors and other researchers can modify and remove codings in the live version (it does not affect stored revisions though, so the latest revision of each content item that was created by the researcher would be used when this researcher does an export; this requires enforcing the creation of revisions except when the latest one is created by oneself, too).
- Private storage in revisions. Drupal's node revisions are not a reasonable storage for private storage of codings, since there are no revisions for comments, all revisions would be accessible by all researchers, and revisions use linear versioning while we would need a version tree.
Coding syntax for presentation
As discussed in the “Storage of codings” section above, the storage format can be very different from the presentation format into which codings are converted when showing coded live content (or when exporting – usually using a different presentation format again that depends on the target application). Alternatives for the presentation format for publishing inside Drupal (also usable for input with a rich text editor):
- RDFa with dc:subject and skos:Concept. This usage of established RDFa vocabularies is how Drupal 7 Core RDF functionality encodes the relationship between taxonomy terms and content items tagged with them [source]. So when using taxonomies for code hierarchy storage, rendering the RDF markup the same way as Drupal Core ensures a consistent RDF interface (esp. also when others use Open Ethnographer for more general semantic markup purposes).
It has to be investigated if this use of dc:subject fits into its specification, which says the element is to state “the topic of a resource”. And “resource” in Dublin Core is normally something having a title, author etc. (like a book).
- RDFa with Open Annotation vocabulary (future addition). Open Annotation supports semantic tagging natively, and semantic tags themselves can still be skos:Concept instances as in the currently chosen solution. The difference is that Open Annotation allows to express the relation of the tag to the content: that it is an annotation, created by some author, at a specific time, for some purpose (like online ethnography) etc.. This is desirable, as it allows to use tagging information on third-party websites in every aspect like one own tags on an own website. For practical purposes of importing third-party semantic tagging and exporting to a QDA tool, this is not too much of a benefit though for quite a lot of complexity to handle, so supporting Open Annotation can be a future addition. The drawback is obviously that it deviates from the way Drupal Core renders tags (taxonomy terms) into semantic output.
In addition, Open Annotation is currently a draft specification developed by the W3C Open Annotation Community Group. Which means it is not yet in any widespread use. However, the standard matches nicely to the task at hand, and at least Domeo (not a browser based software though – see) intends to support this standard soon buy exporting to it (while still using the now-deprecated Annotation Ontology standard internally).
- RDFa with own vocabulary. In Open Ethnographer RPT, RDFa was used as the syntax of choice to embed semantic tags into the live content [documentation].
A quotation manager is just the simplest form of semantic search, allowing to search for one coding at a time. Alternatives:
- Built on the stand-off annotation data. As discussed under storage alternative "Stand-off annotation and Drupal's database" above, a quotation manager is simple to build based on stand-off annotation data tuples (word ID, code ID). Plus, the annotator_view tag view is or will be basically a quotation manager interface for a single piece of content, and can be extended accordingly to show search result snippets for a tag, making it a full quotation manager.
- Drupal taxonomy term search. If code hierarchies are stored as Drupal taxonomies, and word-level tagging is synced on-save to node / comment tagging with taxonomy terms, normal Drupal taxonomy term search will provide quotation manager functionality. There is probably also auto-complete functionality for taxonomy term search.
This obviously only finds content elements. But here, the term is both present as a taxonomy term assigned to the node / comment, and as markup inside the text itself. So some custom code should be done to show and highlight search result snippets based on word-level tagging.
Not to be implemented for Open Ethnographer 1.0 (i.e., until 2015-02-28).
- Built on the stand-off annotation data. As discussed under storage alternative "Stand-off annotation and Drupal's database" above, a semantic search feature is simple to build based on stand-off annotation data tuples (word ID, code ID).
- External semantic web search engine. Creating a performant, feature-rich semantic search is a lot of effort and not possible within the scope of this project. So it could be a good idea to utilize the semantic web compatibility and public nature of the coding and simply use an existing, third party semantic web search engine for the quotation manager functionality. For example, Sindice. A page generated by Open Ethnographer would show all codes as links that query this external search engine accordingly.
- Drupal search indexing API. Drupal has a quite sophisticated API for faceted search indexing [introduction]. This could be used to implement indexing semantic taggings, but it would be still a lot of work.
QDA tool for external analysis
The final selection of the tool to export to will be done in collaboration with an ethnographer (@Inga_Popovaite). Candidate tools:
- RQDA. Desktop based, open source QDA software based on the statistics software package R. From the Spot the Future project, We already have a Ruby implementation of an export script for tagged content from the Open Ethnographer Research Prototype to RQDA. Benefits of using RQDA:
- Desktop-based. In text analysis, the delays of a web-based tool (esp. CATMA) are very annoying for productive work. A web-based tool would allow realtime syncing, the delays are and the inability to use it offline are just too annoying.
- SQLite file format. Which is great to handle for exporting to / downloading / archiving. Unlike CATMA's multi-file XML stuff.
- Single-user data structures. RQDA has a simple UI targeted for a single researcher, which is just the usage scenario for analyzing coded content that was exported from Open Ethnographer. No need for commplex multi-user and shared data concepts at this point anymore; that's what is handled before in Open Ethnographer during the tagging phase.
- Better handling of complex queries. In RQDA, simple queries are done in the UI, and complex queries with the R language in a console.
- CATMA. A good quality, open source web-based QDA software written in Java, which we once found for the Spot the Future project. For data exchange, it uses multiple
.txtfiles for the corpus and a TEI
.xmlfile to store the code hierarchy and actual stand-off annotations, referenced by character range [examples]. With respect to web-based tools, this would still be the preferred candidate from a tech perspective, and also found way ahead other alternatives from the ethnographer's point of view [see]. It was chosen first as the export target, but several issues finally lead to choosing RQDA:
- Too complex UI and data structures. What annoyed most in CATMA is its uber-complex UI and concept structuring, coming mostly from trying to support multi-user / collaborative work, and from bad UI design. For Open Ethnographer, we completely don't need these multi-user features, since in our case Open Ethnographer is for the collaborative part, and text analysis (everything after the exports) is every ethnographer's own business.
- Poor handling of complex queries. In CATMA, they have a kind of proprietary query language that however misses some basic features like a quotation manager. It is hard to learn, and search result presentation does not provide enough context to "read vertically" through the corpus.
- No quotation manager. Also not via the search feature: there is no way to search for all tags at once, so no quotation manager feature.
- Too long delays. This is due to being a web-based application, but also due to the way it is programmed.
- Wikipedia: web-based CAQDAS software.
- lboro: CAQDAS list. Also includes an introduction to CAQDAS and a comparative review of CAQDAS tools. Not many free software tools included, though.
- Coding Analysis Toolkit.
Appendix: Positioning in the Drupal ecosystem
In order to attract contributions from other developers, the Open Ethnographer software should have a more generic use and / or be made from parts (Drupal modules) that have a more generic use. This is achieved in the current design by basing it on eComma, that is, contributing features to that project so that it can be both used for collaborative online annotations and open ethnography. In addition, here is how to make the extensions / additions to eComma as generic as possible themselves:
- Word-level taxonomy module. This functionality is totally absent in Drupal, but can be useful for many situations, incl. for semantic web integration. It would integrate with the Drupal taxonomy system by on-save syncing the word-level tagging ("coding") to content-element level tagging with Drupal taxonomy terms.
Appendix: Potentially helpful software components
- SCF custom modules. The Science Collaboration Framework is an application suite / distribution for Drupal 6 and includes multiple modules extending the Drupal taxonomy features. Might be worthwhile to take a look, to not reimplement functionality.
- RDF Extensions. A module that extends the RDF functionality provided in Drupal 7 Core. Includes RDFx (provides more RDF formats), RDF UI (UI for vocabulary mapping) and Evoc (RDF vocabulary import tool).
- sparql. A Drupal module that provides a SPARQL query endpoint for the RDF data exposed by Drupal.
- varql. A Drupal module that allows to create SPARQL queries graphically, via views. Allows querying own and remote SPARQL interfaces.
- Documentation: RDF in Drupal.