The following is only (!) for photo and video. Any other multimedia format (360° images, audio etc.) will need additional development effort of around 60-80% the effort for the image and video options quoted below.
Photos are hosted inside Discourse and so they can be “easily” annotated using annotator-imgselect (compatible with the Annotator.js software we already use for Open Ethnographer) or Annotorious (live demo). Both are open source. Effort: 12,000 - 30,000 EUR depending on required finish level (“rough to high polish”) and the developer we can find for the job and the rate he / she quotes.
Videos are not hosted inside Discourse, and it seems we better do not provide for selecting image portions to annotate (seems overkill) but rather treat them as a one-dimensional stream of information and allow to annotate time spans on them. Very similar to what https://ant.umn.edu/ does (not open source though). Our annotations would live in the Discourse database and refer to videos (from YouTube, Vimeo etc.) embedded in Discourse posts via the Onebox embedding feature. No other videos than these could be annotated. When viewing the “coding view”, the video would be replaced with a coding editor for Videos similar to the one in https://ant.umn.edu/ . Quote: 20,000 - 45,000 EUR, depending on required finish level and developer.