Final data cleanup needed: a look at the Czech corpus

alberto · August 19, 2022 14:59

Similarly to what happened with the German corpus, the Czech one also is suffering from similar issues. @SantosCardonaPR and @Jirka_Kocian are aware of them. This post is to provide them with materials aimed at making the cleanup faster.

First of all, this spreadsheet contains all codes used in the Czech corpus, arranged in ascending alphabetical order of the English label. The rightmost column is the live link to the code’s page on Edgeryders’s back end, so you can quickly go from the spreadsheet to the back end.

Problem 1: duplicate codes

Duplicate codes are a common error. They happen when an ethnographer generates his own new code, unaware that an identical one was already used for the same corpus. If you scroll throught the spreadsheet, you will find them easily, as in this example:

[details = Here are the 33 codes, used for annotating ethno-rebelpop-czech-interviews, that are synonimous with other codes (synonimous = same English label, ignoring capitalization).]

anti-vaxxers
caring about future generations
community
culdisorient
culman
czechs
donald trump
economy
eeconomicviol
envenergy
envgenprob
eu
europe
facebook
gender role models
gensexism
homophobia
impact of covid-19
morals
new routine
online platforms
polpol
populism
russia
social activism
social distancing
social isolation
social media
staying at home
viktor orban
volunteering
xenophobia
young generation

[/details]

Please note that there are less obvious duplicates, for example democracy and demokracie, and check for them too.

Problem 2: inaccurate labeling by language

Some codes have no English label. In most of them the English label was recorded in the database field of the label in another language, normally (in this corpus) Czech, but occasionally others. In the spreadsheet linked above, you can find them quickly sorting for the name_en column. There are only 33 such codes.

alberto · September 05, 2022 14:40

Guys, I can still see duplicate codes I now have 30:

['caring about future generations', 'CULdisorient', 'Czechs', 'Donald Trump', 'Eeconomicviol', 'ENVenergy', 'ENVgenprob', 'EU', 'Europe', 'France', 'gender role models', 'GENsexism', 'Germany', 'impact of COVID-19', 'morals', 'new routine', 'online platforms', 'POLpol', 'populism', 'Russia', 'social activism', 'social distancing', 'social isolation', 'social media', 'social media', 'staying at home', 'Viktor Orban', 'volunteering', 'xenophobia', 'young generation']

social media appears 3 times. The correct one (with ancestry) appears to be https://edgeryders.eu/annotator/codes/4004, created by Corinne in the NGI project. The other two are https://edgeryders.eu/annotator/codes/1578, created by Amelia, which appears only has the parent code of facebook, and https://edgeryders.eu/annotator/codes/9149, created by Nica. My recommendation here is to merge all three of them.

Additionally, there is this:

and of course this:

The first column shows the Id of the parent node. The second column shows the number of annotations (platform-wide) in the corpus. The two Z-categories look like they are identical, both created by Amelia, and it comes down to deleting one: can I please just do it? It would be this: https://edgeryders.eu/annotator/codes/8882.

The two “impact of COVID” differ by ancestry, number of annotations and creator. One was created by Jirka, and has no parent. The other was created by Richard, and has proper ancestry. Recommendation: merge Jirka’s into Richard’s.

These three are going to affect the visualization a lot, because they are highly connected codes. @Jan, @SantosCardonaPR, @Jirka_Kocian, can you authorize me to:

Delete Amelia’s Z COVID-19 (Copy)
Merge the three social medias into one (Corinne’s)
Merge Jirka’s Impact of COVID into Richard’s

The other 27 or so cases of duplication also need addressing.

I think this is where I need to flag to @Jan and @Richard that I am quite unhappy working this way. These data are never clean, and we can never move on.

alberto · September 05, 2022 14:59

While I was at it, I also checked the Polish corpus. Duplicate codes:

['CULdisorient', 'ENVenergy', 'ENVoverexploit', 'EU', 'Europe', 'freedom of movement', 'Germany', 'legality', 'older people', 'POLpol', 'post-socialist transformation', 'science', 'social engagement', 'social media', 'social media', 'television', 'victims of post-socialist transformation', 'women', "women's rights"]

They are 19, and 5 are shared with the Czech ones: 'CULdisorient', 'ENVenergy', 'EU', 'Europe', 'social media'

The Polish corpus used only one Z COVID 19 Category (not the COPY one), and only one Ìmpact of COVID 19`(Richard’s).

SantosCardonaPR · September 06, 2022 00:43

Continuing the discussion from Final data cleanup needed: a look at the Czech corpus:

Hi @alberto, I hope you are doing well.

I just wanted to let you know that I cleaned all the codes you mentioned above. No more duplicated codes for ethnopoprebel tags. Also, I changed the language labels to English for the czech codes. Nothing to do on your end!

Please let me know if there is anything else I need to do. Have a lovely rest of your day!

All the best,
Santos

alberto · September 07, 2022 15:33

Hello @SantosCardonaPR and thanks for this.

Indeed, the Czech corpus is clean. The Polish one still has one duplicate: there are two codes both called “Europe”:

The former one is embedded in the Z hierarchy of POPREBEL, but then only used (within POPREBEL) as a parent code (of “Germany”, “Poland” etc.). The latter one was used in actual annotations by @Jirka_Kocian, @Richard and @Wojt . Is that the way you want it?

Jan · September 08, 2022 19:45

Dear All, we have discovered another issue that we want to address ASAP. Therefore, I propose that:

@SantosCardonaPR , @Wojt and @Jirka_Kocian will meet (at 9:00?) to work on this issue. We have already started and are making progress. At this point in time, it is better to work on the problem than to discuss with the rest of the team that we have a problem.
The ethnographic “sub-team” (@Nica, @jitka.kralova, @SZdenek, @Maniamana, @Djan) may want to have their separate meeting or simply skip this one.
@alberto and @Richard are welcome to join the one they find more “urgent” at the moment.
I have two meetings scheduled from 8:00 to 10:00 am, thus will join the coding meeting as soon as I can, but no later than 10:00 am.

alberto · September 09, 2022 07:31

Which corpus does it affect? Yesterday I started a lookup at Czech data.

Richard · September 09, 2022 10:48

I’ll join the meeting at 9.00 US / 14.00 UK / 15.00 EU. Is it the usual link?

I don’t think there’s any point having a separate ethnographers meeting this week. I’ll be meeting with Nica and Jan separately to go through the German texts.

Maniamana · September 09, 2022 11:53

If any of the ethnographers wants to meet to discuss anything, I am up for it, and I have set up a link HERE

Maniamana · September 09, 2022 11:57

BUT if you would rather not, I am not insisting!

Nica · September 09, 2022 12:27

I am meeting with @Richard and @Jan later to go over the German material/notes. @Jan and I met with @jitka.kralova last Friday to go over her “homework” and we talked through the revisions we asked her to do. The Polish “homework” is now all in but because it came in later we don’t have notes for @Maniamana yet – I aim to do that over the weekend. So I think I am not needed anywhere until the Richard/Jan meeting – but if the ethnographers are doing a sub-meeting at Mania’s link, I can join, just let me know.

Richard · September 09, 2022 13:02

@SantosCardonaPR , @Wojt and @Jirka_Kocian - where are you meeting? Are you meeting now?

jitka.kralova · September 09, 2022 13:03

ok, we decided not to meet in the end. Let’s discuss our progress next week.

Jitka + Mania

Jan · September 09, 2022 18:01

Hi @alberto, the issue is that we have in the backend old and new systems of codes. So, sometimes a given code x become a child of an “old parent,” and sometimes of a “new parent.” The platform does not help with distinguishing them. We understand the problem well now and are working on the solution. At some point we may want to reach out to you - as you may - with your skills - be in the position to help. However, we think you should go ahead with visualizations, as we want to start developing familiarity with them and having a given corpus visualized may also help with determining the scope of the issue. @Wojt, @Jirka_Kocian, and/or @SantosCardonaPR can brief you better on the details. Please reach out to us with questions.

alberto · September 12, 2022 10:24

Ok. A first look at Czech data is here. Today I will do the same for Polish data.

Still missing:

Analysis by gender
Comparative analysis.

alberto · September 12, 2022 11:30

@SantosCardonaPR, I need an answer to the issue of the duplicate codes for Europe:

The Czech code uses only this one:

https://edgeryders.eu/annotator/codes/9943

So I suppose the other one must be merged into it.

SantosCardonaPR · September 12, 2022 11:33

Hi Alberto, sorry I missed this.

I have a class in 1 hour, but after that, I will look into it and clean it. I will let you know whenever I finish it, but be sure it will be today!

Best,
Santos

SantosCardonaPR · September 12, 2022 19:18

Hi Alberto,

So I looked at the codes, and it seems like the Europe code with the Z parent only had annotations from other projects. However, I removed the parent. Thus, for ethnopoprebel, the code is Continents → Europe (9943).

In short, there are two “Europe” codes, but only one contains ethnopoprebel annotations, the one with Y-X hierarchy.

Please let me know if anything is not clear. Best,
Santos