Open Ethnographer Software Requirements

matthias · November 15, 2014, 8:25pm

This wiki contains the proposed features of Open Ethnographer, and decisions if and when to implement them. (For how to implement them, see the software design.) This is a wiki, and so far just an agreement about the foundational, basic features. You’re welcome to contribute!

Code management features

Code manager. Means, creating codes with a graphical interface. (Contained in the proposal, p. 9, as "Build a code manager".)
Information in a code. The following information will go into a code:
- code name
- code definition (free text)
- code author (user ID reference)
- creation date
- sharing status (public or private)
- code tags [optional / later] (every user will be able to tag a code with his own codes only, so in principle a single multi-value term reference field is sufficient to store all the code tags)
Code networks. [undecided] Coding would happen with "just codes". But codes could be applied not just to text, but also to other codes as "code tags". This will lead to code networks that can be exported to, for example, RQDA's code groups. Thus, they would only be relevant for analysis, not for coding itself, because coding with the auto-complete function allows a fast workstyle already. Codes networks will also not be used to aggregate codes into "higher-level" codes that are composed from constituent codes from different authors; for such collaboration / code re-use, it is better to use a "fork and pull request" approach [reasoning]
A reason not to implement it at all could be that, since code networks / code structures are only for analysis, they should be built within the CAQDAS software to which Open Ethnographer exports. Since re-exporting would usually mean that this work has to be repeated in the CAQDAS software, it will however probably be better to support code structures inside Open Ethnographer and export them. This feature is not completely decided yet, though. See the this discussion, esp. the latest input.

Coding features

Coding while reading.
Removing codings while reading. This is relevant esp. because it is needed often after forking a code from another researcher.
Coding by every authenticated user. In the spirit of openness, it is not necessary to exclude anybody from coding. Those users who are not trained ethnographers might not create high-quality codings, but it does not matter as ethnographers do not have to use or reuse their codings.
Code selection options. The following two modes to quickly select a code to apply to a selection of text are desirable:
- Instant search. If the researcher already knows the code name, filtering a flat code list client-side while typing is fastest.
- Hierarchical selection. Using a tree-like structure. This is not completely decided, but it seems from initial tests that instant search is way enough. See this discussion.
Parallel coding. Multiple ethnographers (potentially every Drupal user) should be able to develop their own codings. (Referred in the proposal, p. 10, as "Build a solution for different users to code the same online content in parallel.")
Coding progress memory. There should be a feature allowing to say that a certain text has been coded with a certain code (and if with the base version or also re-coded with the own, forked version of a code). Even if none of the codes appears in the text. Because that allows other researchers who build on existing work to know the extent of the existing coding work.
Support for re-coding / code refining. It may happen often (does it?) that a researcher wants to look for a code and refine its use by re-coding with two different codes. If so, a basic quotation manager ("search by code" function) has to be right in the coding software.
Voluntary publishing of codings. Alberto, Inga and Noemi did a small survey to see if ethnographers are ok with an obligatory "open notebook science" approach. It turns out, they are not against sharing, but uncomfortable with obligatory sharing of codings. So the sharing should be voluntary. This affects the basic way of how codings can be stored (as markup in the live version for all-public codings, vs. stand-off annotation for private and sharable codings).
Collaboration via fork and merge. [prepare, and implement forking; merging back can wait] Open online ethnography is mostly disruptive due to its collaborative potential, allowing to collaboratively code much larger corpi of text in aggregative manner. This also requires a collaboration feature that allows different researchers to have different opinions about how to code what, so allowing to add to and remove from the codings with one code that another researcher did. Fork and merge allows for this.
Merging can be implemented later, as it “only” makes looking at others’ changes and recoding them with own codes more comfortable. When implementing it, here is how: at every moment, a researcher will be able to compare the coding status of one of their codes with all the forked versions that are around. The software will indicate differences (both added and removed codings to words) with each forked version, excluding only changes that one manually rejected earlier. These differences can then be reviewed and taken over selectively. For the double use of rendering codings as linked data (RDFa markup) into the public website output, Drupal could then use the union set of all publicly shared tags in a “fork set”.

Revisions for codings and coded texts. [prepare; implementing can wait] While taking over all codings into the new live version when an edit is done is a good idea, in addition codings should be tied to a revision of the coded text. This is relevant because Open Ethnographer should allow to reproduce the findings of researchers, and for that a defined data version has to exist. This means that the coding progress information has to be stored not per code and text, but as multiple data points per code and revision of text. And when exporting codings, a date-based version has to be chosen, defaulting to the most current one for each code.
Coding the live version. [optional] This would allow showing the codings even after the coded text got edited after the coding. Without this, codings can only be shown in the version that was coded. This makes sharing the codings difficult, since another researcher could work on a newer version. It also makes reusing the codings for semantic web purposes difficult, since web visitors are always shown the most current version. However, the effect of these difficulties is low since it barely happens that somebody changes their post (and if so, mostly directly after posting, before the coding happened.)

Publication and sharing features

Export to QDA software. Ideally, the codings of one ethnographer could be saved into a file (also including codings from others that this ethnographer did take over into her code hierarchy). The requirement is:
- Ability to filter which content will be in the export. Making a list of groups is granular enough. Non-relevant content (like administrative posts etc.) in these groups can be easily ignored as it will not be coded either, and can also be deleted in the export's target QDA software if needed.
- Ability to filter which codes will be in the export. This would default to "all own codes, including shared codes to which one has subscribed", but options would allow to limit this further. Since it is rarely needed that one wants to exclude own tags, implementing this part can wait or even be discarded.
- Export for download in a format of one open source QDA software.
- Real-time syncing to a web-based open source QDA software. [optional] This can be the same QDA software as the one for which a downloadable file is offered.
Semantic web integration. [prepare; implementing can wait] Export to Internet search engines via RDFa, ideally using RDF ontologies that they understand. This is optional, but the software architecture must be designed to allow it.
Obligatory sharing with authors. [undecided] Maybe, for transparency it is a good idea to enforce that authors can see all codings in their own texts, even if they are marked as "private" by the ethnographers creating them.

Analysis features

Decision: There is little meaning in recreating the full analysis part of QDA tools, so we should not do this. There are tools for this already, just not for collaborative coding of live data. We will instead interface with existing tools by exporting and syncing.

Quotation manager. Basically, this is an integrated site-wide search function to search by code. However, this is not strictly required since an ethnographer can use the external analysis tool for this. But it would be nice to have, since it makes coding and re-coding text simpler if a quotation manager is provided that allows to quickly switch from search results to coding. (Referred to in the proposal, p. 9, as "Build a quotation manager.")
Tree of keywords. [prepare; implementing can wait] Ben used this kind of diagram in his network analysis report for "Spot the Future".
Advanced code search. [prepare; implement later] This refers the more sophisticated analysis features usually found in QDA software:
- Search by revision date. This would allow using the integrated analysis tools to check and confirm conclusions of other ethnographers.
- Include / exclude codings from individual coders.
- Vicinity search for coding combinations.
- Searching by multiple tags with boolean operators.
Federation via semantic web linked data. [prepare; implement much later] With this feature, coding can happen on multiple websites independently, with shared or connected code hierarchies. The coded data is then shared, for example via SPARQL endpoints that allow searching for codes across multiple websites.
Semantic search and knowledge deduction. [optional, much later] This would be a kind of semantic search inside ones own website, potentially opening a way to a much bigger participation from the Drupal open source community as the plugin would suddenly be about semantic search, not "just" ethnography.

Privacy features

Consent management. [prepare, implementing can wait] So, it must not be possible to code something where the user has not given consent. The problem with the prototype was that the "consent on signup" was not really working, as people do not read long agreements on signup. Maybe it should be a tickmark on the user profile, or when posting content. However, given that Annotator as a browser plugin can do stand-off annotation, it is questionable if missing consent can be enforced at all, and thus, if it makes sense to require consent …
Managing third party content. [prepare; implementing can wait] Is third-party content (which is reported on the platform, not directly created) content that is relevant for ethnography at all? Alberto proposes that the content can go in, as "transcriptions do qualify as qualitative data". We lose the connection to an Edgeryders account in that way, though. It would not work to invite all these interviewed people on the platform, but would be ok if their content is included. The decision is to use a code "third-party content", allowing to filter this content in or out depending on what we want in the report.
Third-party consent management. [prepare; implementing can wait] If including third-party content, what happens to that content, is consent needed form the third party? Inga and Noemi will write e-mails asking for consent.
Traceability protection. [prepare, implementing can wait] If the research data (coded data etc.) is handed over to other researchers for other research projects, content will be traceable back to a user profile and then to a user (and if only by a Google search). How can that be prevented?