Understanding the software

matthewlinares · December 2, 2014, 1:15pm

I’m trying to understand the use and approach of this project. For what I can see, it aims to do some of the same work as hypothes.is (well just annotations really), although RDFa tagging is a key difference.

Is there a primer somewhere? I have looked at https://edgeryders.eu/en/research-methodology-the-spot-the-future-data-model-and-workflow and other related pages, but haven’t found a clear outline of use cases to readers not savvy with digital humanities.

Can anyone help? Thanks

alberto · December 2, 2014, 4:35pm

Just a small open tool for research

Hello @matthewlinares, welcome. My name is Alberto, and I guess I am one of the initiators of the OpenEthnographer journey. We have a (longish) writeup somewhere – maybe we could make it public (what do you say, @Matthias? Do you want to take care if it?).

In short, there seem to be three differences with projects like hypothes.is (which I did not know, thanks!) or FactLink, or Diigo (it has an annotation functionality, unlike its predecessor del.icio.us).

Architecture. From what I understand of hypothes.is (and I am sure about FactLink and Diigo), the annotations get saved in a central database. For several reasons, we depart from the "one platform to rule them all" approach and try to build a Drupal plugin that anyone can install on their own Drupal website. This allows each instantiation of OE to have its own social contract with its respondents. Interoperability across different sets of ethnographic codes are assured by open data + OSS rather than by centralization.
Use case. We are not trying to make people interact on commenting web pages. OE is not a general purpose tool, but a tool for ethnographic research; so it is optimized for a very special kind of annotation, the kind made by ethnographers when they do research.
Architecture + Use case. The two reasons I have given so far interfere in a nice way. Ethnographic software typically "lifts" ethnographic data from wherever it is and imports it as text. We, by contrast, aim to do the ethno coding of the same website where the conversation being coded lives. What this means is that the ethnographic data stays situated: for any piece of text, the website's database knows who wrote it, what else that person wrote, who that person is in conversation with (via network analysis) and so on. So, an ethnographer chasing hunches can make pretty complex queries to check them out ("Give me ALL keywords that co-occur with keywords X AND Y in interactions in which users Alice OR Bob OR Charlie were involved between 13 February and 29 October 2014"... you get the idea).

Also, we are eating our own dogfood. Yes, we make open source software, but in the end we are building a tool we need ourselves. This means a nice, tight loop between development and testing instead of working two years, then launching and hope “they will come”.

matthias · December 3, 2014, 12:27am

Good idea. Just made your writeup public.

Yes, I’m all for publishing the original writeup, so here it is: “Open Ethnographer: Idea, Concept, Goals”. (Published it just now, but I have set author and date according to how the document was originally written …). Guess that will answer some questions, @matthewlinares. Welcome here from me too, BTW – I’m Matt, managing the tech side of this project.

In addition to the differences pointed out by Alberto above, here’s one from a tech-centered perspective: in contrast to annotation tools meant for text study, ethnographic coding can easily lead to data for semantic web integration, basically as a side effect. Sure, Open Annotation as used by hypothes.is also results in data for the semantic web, but the semantic level is low: software will only know (for comments) “this is an annotation” and (for tags) “this is a concept” but understand nothing about the concept itself. That’s because people working informally with text can’t be bothered to enter data in a more structured way. Ethnographers can, and in fact regularly do when they work with text. So later, there will be several specialized features that integrate RDF vocabularies and ethnographic tagging, and other features known from informal text annotation will no be there (no free tagging, for example …).

hypothes.is is a great find anyway (thank you! didn’t know it either so far). They include a severely customized version of annotator.js. Open Ethnographer will also use annotator.js and has to customize it, so we should take a deep look and see what we can reuse

alberto · December 3, 2014, 9:51am

Argument mapping!

Aha, good point Matthias. Classing stuff as issues, concepts, pros and cons is consistent with an argument mapping kind of stance. This is common in the collective intelligence community: the CATALYST project, that I am involved in with a different hat, is all about that. They do (semantic web based) graphs that say basically: “A (a lump of text) is an idea, B (another lump of text) is a supporting argument to A, C (yet another lump of text) is evidence for B” etc. Do perhaps hypothes.is comes from that intellectual space (but not FactLink).

Indeed, ethnography is something else entirely. It aims to (manually) analyze lumps of texts and resolve them in their semantic meaning.

matthias · December 3, 2014, 2:10pm

As far as I have seen, hypothes.is is not about argument mapping, either. It is for informally working with text, in the tradition of marginalia in paper books. The tagging scheme in it is simply free tagging, and it’s meant to be used in a not-so-consistent manner and on a low semantic level. Semantic processing happens entirely within humans here. The only thing software “knows” if two pieces of text have been tagged with the same tag is: they relate to the same “concept”, and concept could be anything.

What I tried to say: the special thing about how ethnographers code things, at least as far as I have seen, is this: they develop their own vocabulary for one corpus of texts. It’s subject-specific and idiosyncratic, but applied consistently, and well thought out. So these lend themselves naturally to be connected to other vocabularies of the semantic web, thus making the software “know” what these tags / codings are about, on a semantic level (e.g. about an emotion, an event, an innovation and many much other stuff). It provides “effortless” semantic web integration, esp. as a means of sharing and collaborating around the ethnographic research data. I have not investigated in detail if this matching to existing vocabularies would work out … maybe we’d better define an open vocabulary for ethnography some day.

Note that this is not about forcing an ontology on an ethnographer. They can code however they like to. It’s about capturing and reusing some of the meaning in these idiosyncratic codes by describing them additionally in a common vocabulary. If one day I can do a SPARQL query across 100 or 10 000 websites to serve me all text snippets talking about “innovation” and “precariousness” … that’s how I see the potential of massive open online ethnography.

matthewlinares · December 8, 2014, 5:53pm

Meta-level considerations

Thanks for your responses. I have two comments.

1 Architecture, etc.

The hypothes.is project currently provides storage for annotations, but will eventually encourage distributed installation and storage.

Moreover, the fact that it enables annotation of and access to other sites as well as one’s own, of course increases the reach and insight of network graphing. Forgive me if that is true of Open Ethnographer and spelled out in one of the docs, I only skimmed!

I am not affiliated with hypothes.is, but have followed the project for some time. They are a talented, active outfit and have also consulted academics of various stripes to assess the utility of their offering in different fields. I would be interested to hear what their approach to ethnography is. I imagine there might be a smart extension of what they have built to allow for such embellishments, something I have thought about before. By this, I mean using infrastructure like this to augment specific parts of web pages with rich, searchable, ethnographic data.

It would be a shame to duplicate effort, even partially, so perhaps there is scope for collaboration.

2 The article you give is very thorough. I had hoped for a more terse overview which might help non-experts (e.g. casual visitors to the site) attempting to get a quick grasp on the project with e.g. a fleshed-out use case such as development of the statement:

Trend detection by wisdom of crowds. By virtue of focusing on communities, ethnography is good at detecting the weak signals of social, economic and cultural trends in the making.

This is a great project though! I already have a research project in mind that it would benefit.

Cheers,

Matt

alberto · December 9, 2014, 2:44pm

Ethnography => small data

@Maguy, are you @matthewlinares? May I ask you if there is any reason you are using two accounts? Just curious.

As for the rest: annotating the web and doing ethnographic research are two different activities, and they have different (even contradictory) requirements. In a nutshell, ethnographers work with proper taxonomies: a research group agrees that when someone finds a reference to, say, moving abroad to find work they will call that “international mobility of labour” and not something else. Across the same research project, the taxonomy is scientifically grounded (i.e. researchers agree on what makes a code a code), standardized and arranged into a hierarchy. For example, here is the ethno code tree from our Spot The Future project earlier this year:

Some of the words may appear semantically vague, for example, “values”. But they are very well defined in the context of this research project. By contrast, annotating the web is folksonomy stuff: the same concept may be encoded by many keywords, and the same keyword may be used in many different ways. You can still use it for research (and, yes, it has potentially a much broader reach), but it is very different research: it becomes big data stuff, with gazillions of low quality data from which powerful algorithms filter out the noise. Edgeryders can neither process meaningfully these data, nor elicit their generation. We have no option but using small, high quality qualitative data.

This is not to say that, in practice, some software code can not be called in to service both in hypothes.is and in OpenEthnographer. If that is the case, say hallelujah! We will gladly do so. But at the end of the day, the projects have different aims, and their architectures are going to be very different.

matthewlinares · December 9, 2014, 4:56pm

@Alberto Yes, I’m both avatars. I have two email addresses and logging in with Persona led me to forget which I initially signed into Edgeryders with.

I take your point, but I thought there may be a possible feature-set where hypothes.is users could mark certain annotations in such a way as to recognise them as a subset (e.g. ethnographic and specific to Edgeryders) of all annotations. This would permit researchers working with a jointly understood taxonomy to access only ‘project-specific’ data where necessary. @Matthias As for the extra nodal information, I’m not sure what data-structures Drupal contains internally which couldn’t be expressed in rendered html for later aggregation alongside such annotations. Perhaps a module could make that available in the browser.

I thought there would then be added analytical value if this tool were used on other sites for similar annotations, thereby increasing the data at hand. This is the key intended benefit of my suggestion, but perhaps I’m mistaken in my assessment of what’s going on.

I understand that you have considered the options in great detail and forgive me if I am missing or underestimating critical architectural obstacles. I suppose I am pushing it as I think there may be an opportunity here for a community markup layer which slickly serves multiple purposes simultaneously, and thereby benefits from scale and network effects. What I am getting at may sound vague, but I hint at it here (pardon the repost).

I also don’t mean to distract you from your efforts, so I’ll drop it if you think enough has been said!

matthias · December 9, 2014, 8:52pm

Every purpose has its tool …

So … I guess I found out enough about Hypothesis now, to know how it fits into the picture.

Your point about expressing “data-structures [of] Drupal […] in rendered html for later aggregation alongside such annotations” is valid, that’s also the way I would choose if going the Hypothesis way. RDF embedded into HTML to be specific (so, RDFa or similar). Would be a good fit since the Hypothesis crowd intends to go the semantic web route anyway, by supporting the Open Annotation standard in the future. This means, ethnographic tagging within Hypothesis or a derived more specialized product would happen using oa:SemanticTag. Aggregation and ethnographic analysis would happen using semantic search engines and accompanying technology (triple stores etc.). That’s the clean way to do web-wide online ethnography. However, the effort for this is way beyond the limits of what fits into the Open Ethnographer project. It fits well for a follow-up project, and I show below how we want to prepare for this within the limits of the Open Ethnographer project. Open Ethnographer itself is intended as a small sensemaking / analysis tool for ones own site’s online conversations, and for that the Hypothesis dependencies and architecture are too much complexity to handle. (At least I don’t want to run a Python server, secondary search infrastructure etc. alongside Drupal to store annotations. Complex architectures for simple purposes make me very wary.)

Ok, but here are a few changes to Open Ethnographer that I think will make it fit into the big picture of tool agnostic, semantic web based, web-wide online ethnography in the future:

Making our new "fast tagging" plugin for Annotator a bit more generic, to allow reuse by Hypothesis lateron, making Hypothesis usable for ethnographic tagging as well. Namely, we will also support multiple tags per annotation, with just an option to limit this to one as we intend to.
Rendering public ethnographic codings using the Open Annotation oa:SemanticTag syntax, compatible with what Hypothesis will support.
Exposing ethnographers' taxonomies as a set of URIs fit for the Open Annotation oa:SemanticTag markup.
Making sure that author and content relationships (like "this is a comment in reply to this one") are also exposed via semantic markup.
Later: Developing the so-called code manager of Open Ethnographer as a tag management plugin for Annotator. It would allow to merge tags, order and group tags and could be taken over by Hypothesis lateron. It is marked "later" as it saves us quite some effort to rely on Drupal-integrated taxonomy management features for that, for now.

In the end, you could use both Open Ethnographer, Hypothesis or any other tool of choice with Open Annotation support to add ethnographic markup to web pages. That (allowing tool choice by standardizing at the protocol level, not the tool level) seems preferable to me for everything “huge”. Huge, like web-wide ethnographic annotation in this case.

matthewlinares · December 12, 2014, 11:23am

Nice

That all sounds sensible. Is that much of a departure from what you had planned anyway?

I will mention this thread on the hypothes.is dev list in case it is of interest to them.

Thanks

matthias · December 9, 2014, 3:15pm

Hypothesis is not THAT different, but still

What Alberto said about the different use cases is correct. What we need is tagging with defined taxonomies. About 80% of the Hypothesis code will care with more basic stuff like the browser extension, data storage architecture etc… So yes, one could join efforts and base an online ethnography software on this. In fact we have considered other “web-wide” annotation infrastructure as a basis when discussing Open Ethnographer initially, and the reason against it was finally this: much of the value and meaning of the data in online ethnography comes from its context (who wrote it, to whom, when etc.), which we only know if it’s data on our own platform. So in Open Ethnographer, we don’t just store annotations about an URL, but about a node or comment, which are Drupal datastructures that also store author, creation date etc…

Hypothesis might be prepared to store such meta information, but using a browser extension to annotate your own website is quite overkill / too complex. Using Annotator (also a part of Hypothesis) as we do now, plus annotator_view for a Hypothesis-like view of annotations seems much less complex to handle to me. But maybe I miss an idea / insight into Hpothesis architecture that would indeed make it a better basis for our case?

alberto · December 9, 2014, 6:23pm

Stewardship!

Aha. Very clear case, well done @Matthias. What we are saying is: we can guarantee transparency of research (reproducibility of results etc.) also as a function of the cleanliness of then data model for storing the information on our server. This leads to the choice of storing ethnographic coding into the Drupal database via Annotator, then retrieving it via annotator_view. It is responsible stewardship of data: it makes so much sense even I can understand it!

@Maguy, do you want me to merge one of your users into the other one?

matthewlinares · December 12, 2014, 11:02am

Merge

@alberto Yes please! Merge them into this one.

alberto · December 13, 2014, 6:44pm

Which one shall I keep?

Do you prefer @matthewlinares or @Maguy?

matthewlinares · December 17, 2014, 4:48pm

@matthewlinares, thanks.