Modify Annotator to use quasi-random word ID tags for stand-off annotation

matthias · November 24, 2014, 10:29pm

Currently, ~~eComma~~ Annotator ~~loses the annotations made to text if the text is changed. This (probably) means that it~~ uses word ranges to save what annotation belongs to which words. Instead, Open Ethnographer should allow changing the base text while keeping all annotations intact. The technique to use for this is stand-off annotation with pseudo-random word IDs, as documented in the software design (section “Storage of Codings → Stand-off annotation and Drupal’s database”).

This task is to implement this data storage scheme: Make ~~eComma~~ Annotator store annotations using word tags with pseudo-random IDs inside the original text, and make it use stand-off annotation to save the annotations.

We want to use the Drupal modules annotator (for the frontend) and annotation (for the backend storage in the Drupal database). So these should use as the base software. Though if extending annoation is not meaningfully possible with a fork, a specialized backend storage module for the word ID storage method can be developed.

This is a paid task and welcome to be picked up by Edgeryders community members!

Implementation languages: PHP (inside Drupal 7 framework), HTML

Budget: to be determined – the required effort is not clear yet; you can make a proposal

Collaboration: Delivery should be as a Github pull request to edgeryders/eComma. Payment is after your code was tested and integrated by @Matthias and after you have sent an invoice to Edgeryders. If there are bugs or later that prohibit basic functioning, you will have to fix them within the budget limits; if after payment, there are smaller bugs not affecting basic functioning (edge cases, nice-to-have features etc.) you don’t have to care.

danohu · December 1, 2014, 11:48am

thinking aloud

“Currently, eComma loses the annotations made to text if the text is changed”

Not quite. the annotations remain. But since they are indexed by word number, they will apply to the wrong text if the content is changed.

ecomma does indeed [beg]innning and [end] word indexes in the ecomma_range table.

Currently, ecomma wraps each word in a span with a sequential class and ID: <span class=‘ec-p5 token’ id=‘ec-p5’>

The current system (eg for the highlighting) works on “find the beginning, count until you reach the end”. We could:

a) make that "find the beginning, keep going until you reach the end". Bad idea - we don't know that the end tag still exists (or hasn't been moved before the beginning)
b) store beginning id and length (in words). Then:
- if the beginning word is deleted, we don't display the annotation
- if the highlighted text has been changed, we just have a highlight covering some changed words, but starting in the right place
c) store the list of every highlighted word-id with the annotation. This is probably the most work -- we'd create a new table for the range-id -> word-id mapping. but it is most resistant to changes

matthias · December 4, 2014, 6:03pm

Option c) is what I had in mind

Your option c) with non-sequential word IDs is what I call the “quasi-random word ID” solution. When you say, “we’d create a new table for the range-id → word-id mapping”, I assume you mean coding-id → word-id mapping? So, storing for each tagging / annotation / coding or how you call it what words it covers. With non-sequential IDs, there would be no ranges.

It’s indeed the solution that requires the most effort, but I like it Esp. since Open Ethnographer is meant for annotating live data (which can potentially change). In contrast to what more or less all other ethnographic software does. Haven’t seen this word ID solution anywhere in stand-off annotation so far, so it’s a little innovation as well. So my proposal is to start implementing it, but if it turns out then that the effort is really high, we can stick with a simpler solution (esp. since @Alberto thinks change-resistance is not so important as content hardly changes; it would force the software to hide annotations in the latest / live revision though until changes are manually reconciled with the coding …).

danohu · December 1, 2014, 12:18pm

adding the pseudorandom IDs

There are actually two subtasks in this task:

add the pseudorandom word ids
make ecomma use them for annotation

[breaking it down like this might make sense if we have more developers interested, but I think so far it’s just me and matthias?]

Looking at the first of them…

“generate word IDs only when the first researcher starts coding a piece of text”

So, a function that:

is called whenever a researcher loads a page with coding enabled
acts before showing them the page (i.e. so they will see it with tags added)
doesn't care if the text has previously had pseudorandom IDs added
finds all words not wrapped in an ID, and wraps them in an ID
saves everything back to the DB

Below are notes on the way to that plan. I’m thinking about two situations:

A researcher starts coding a piece of content for the first time
1. it's unclear what that means in terms of UI. Is it just 'when a user with researcher permissions looks at the content'? Or 'when we show the ecomma interface'? Or does the researcher click some link saying "I want to code now"?
2. We need some UI work anyway, to figure out the above, and configure appearance for the situation where most users aren't researchers. So I guess we can change the hooks later?
3. How do we know this is the first time? Do we do it by introspection
A piece of content is changed (by anybody) after the researcher has already started it
1. or, maybe just when a researcher looks at it? That means some words could be id-less for a while, but it shouldn't cause any problems (?)
2. What happens to the ID wrappers depends on precisely what edits the (presumably oblivious) user makes. Possibilities are:
  1. Add more words outside any ID tag
  2. Add/change text within an id tag. i.e. the tag covers multiple words. This is ugly, but actually livable with for now. Writing a function to break up multiple-word tags is 'nice to have' rather than essential
  3. Delete text. This means that some tags become meaningless. But, short of externally storing the tag text, we can't do much about it

We should keep as much as possible of the existing code for breaking up words – it looks like it has gone through some careful work. In ecomma.module:theme_ecomma_formatter_myformatter.

danohu · December 1, 2014, 2:20pm

annotator changes everything

Using annotator is going to substantially change how we approach this, vs. vanilla eComma. It’s probably wise to postpone this task until we have annotator integrated, or at least a very clear idea of how annotator fits into the picture.

matthias · December 4, 2014, 6:26pm

Not that many changes

Annotator will be more or less a drop-in replacement for eComma’s own little JavaScript library. Keeping the interface, exchanging the implementation. So it does not change that much, but Annotator itself has to change to work with word IDs. And for that you’re right, we should at least know how Annotator has to change and how much effort that will be … . I don’t think it’s blocking this task, since the early stages of implementing word IDs are also explorative, but both tasks have to be done early on so we know if word IDs are worth the effort …

matthias · December 1, 2014, 5:06pm

Non-conflicting source of pseudo-randomness

Thanks for the input, @danohu. I’ll add specific replies, just wanted to leave a piece of “thinking out loud” myself:

We need a way to generate pseudo-random word IDs that does not result in conflicts, ever. So word IDs used once must not be re-used in the same piece of content, also not after the original words which got these ID have been deleted. Because, the deleted words might still carry annotation in the revisions they appear in.

So hashing functions and the like would be difficult to use here due to hash collisions. But a static list of non-colliding word IDs, being reused for every piece of content could be a simple, working idea.

danohu · December 1, 2014, 5:59pm

Random generation is fine, surely? The collision likelihood is miniscule (e.g. 8-char random string gives us 62^8 options, which is 2 x 10^14).

Using from a list means you need to keep track of which numbers have already been used in a document. And just parsing them from the text isn’t enough, since some words may have been deleted. If we’re going that way it’s easier to go back to numeric IDs, and just store the maximum number reached in any piece of content.

–

But I think we can get away without even bothering with IDs for words. Look back at the requirements. We’re talking about these 2:

Revisions for codings and coded texts. [prepare; implementing can wait] While taking over all codings into the new live version when an edit is done is a good idea, in addition codings should be tied to a revision of the coded text. This is relevant because Open Ethnographer should allow to reproduce the findings of researchers, and for that a defined data version has to exist. This means that the coding progress information has to be stored not per code and text, but as multiple data points per code and revision of text. And when exporting codings, a date-based version has to be chosen, defaulting to the most current one for each code.
Coding the live version. [optional] This would allow showing the codings even after the coded text got edited after the coding. Without this, codings can only be shown in the version that was coded. This makes sharing the codings difficult, since another researcher could work on a newer version. It also makes reusing the codings for semantic web purposes difficult, since web visitors are always shown the most current version. However, the effect of these difficulties is low since it barely happens that somebody changes their post (and if so, mostly directly after posting, before the coding happened.)

The first, we achieve just by using drupal’s revision control, and storing the revision ID with each code. That’s a simple addition to either ecomma or annotator. We can also think about exposing the drupal version-compare screen to researchers (latest version on one side, coded version on the other).

The second task is optional and low impact. We can also probably fudge it later by using some text-alignment algorithm (take the text from the coded revision, search for it in the latest version somewhere near the correct offset).

Just a thought – you know much more than I do about the requirements here.

matthias · December 4, 2014, 6:20pm

Aw you’re right. Random generation is fine.

Not sure what got me on the hash function track. Hashing makes completely no sense here. Randomness is it – thanks for the hint!

About avoiding the word IDs: yes we could do that, since the software is for ourselves in the first place, and the requirements are thus quite malleable. And as you noted, change resistance is not that important. My personal view on this, still open for discussion: word IDs instead of word number ranges seem architecturally cleaner. Make a word a “thing” with an identity rather than remembering just its position, and you won’t have problems finding it again, ever. And I prefer clean solutions in the basic architecture of software, while favoring cheaper / effort saving solutions higher up (user interface, config options etc.). I think it saves maintenance efforts and makes extensions, remix and reuse simpler lateron. Apart from that, word IDs seem innovative

matthias · December 6, 2014, 1:44am

This is for later. Also, migrated to Annotator.

A major concept change here: this task should now start with Annotator as the base software, not eComma. Upon close examination, I think that eComma is not a fitting base software for us (will document in the software design wiki, but in short: it does not offer us anything beyond Annotator that we could use as-is, and adapting it would be as much effort as building what we need … also, Annotator cannot be integrated anywhere as easily as I thought).

Also, this task can be treated as an add-on for later in the project (so, “postponed” for now). We should, in agile manner, rather get a basic version up soon and extend it along the way.

matthias · December 8, 2014, 3:12am

Requirements change!

It seems preferable to store annotations as text ranges as Annotator does normally, but storing the list of word IDs alongside them. And then whenever the base text changes, calculate the new annotation range(s) and store it again as range(s), as Annotator does normally. So for now, this is the latest requirement how to implement this task Details and reasons here.