Here the formatted transcript:
(gave it another run through with grammarly to get rid of some double words and this instead of theses etc.)
DE Lab Presentations and Discussion hosted by Edgeryders online June 3, 2020,
“Making sense of a COVID-19 world: applying collective intelligence to big data”
Kristof Gyodi (@kristof_gyodi) and Michal Palinski (@mpalinski)
Host:
Welcome everybody to this NGI Forward COVID-19 presentation and discussion. Our guests of honour here from DE Lab in Poland, Kristof Gyodi and Michal Palinski will begin with a presentation of their findings. They will take over the screen for about 20 minutes. Then a robust discussion should get going. I’m John from EdgeRyders and Louis from Nesta. We are the hosts of this meeting. Maria is going to be looking after the chat part. I will facilitate the discussion which probably won’t take a lot of work, and we know that the chat is an important component of the discussion. Sometimes important things get entered into the chat when a lot of people are talking so if you have an important point you want to bring up and you can’t find a way to work your way into the conversation, you can put it in the chat and we will try to bring it onto the table for everybody. So with that, I think we’re ready to go. I want to also remind everybody that we are recording this and that you have agreed to it. Alright, without further ado, let’s go now forward to the presentation from DE Lab.
Michal:
Thank you, John, for that introduction. My name is Michal Palinski, and together with Kristof Gyodi will try to present you today’s topic that is social challenges brought by COVID-19. We have both analysis from the study economic lab at University of Warsaw and PhD candidates at Faculty of Economic Sciences also at the University of Warsaw, so we work mainly in a project that is called NGI Forward. Maybe you heard about it. It’s a project brought by European Commission to identify the main technologies and interrelated social issues in relation to the internet and our role at the D lab is to bring numerous analyses, data science analysis involved in machine learning, web scraping and natural language processing to this topic. One leg of this project is to provide analysis related to COVID-19 pandemic. And we tried to combine numerous methodologies we use day to day in our NGI Forward project to this challenge, and we would like to show you some of the results. Okay, I will share my screen now, so that you can see our website. If you didn’t see the invitation then the address is https://covid.delabapps.eu/. There you can find our results, our visualizations.
I will introduce our main idea. We wanted to analyze four main areas. That is news about recent technologies: Social media. We chose Reddit because of its popularity. When it comes to open source we chose GitHub Repositories. And we also analyzed scientific papers, mainly Social Science Research Network, and data sets that are openly available. When it comes to news about COVID-19, we have analyzed around I think it was finally 11 news outlets in English that covers technology and wanted to see what are the technological angles with this recent pandemic.
We have identified some growing trends. Here you can see these bubbles that represent keywords we have identified using some basic linear regression analysis, and we have divided these keywords into two main groups. The first one you can see clusters of keywords that are related solid to this pandemic, in the second bubble, you can see some of the keywords related to technologies. So the bigger the bubble is, the more robust the trend is of the selected keyword. So for example, you can see that in the recent months, such issues as social distancing, or short term rental, Air BnB or the issues related to unemployment are getting traction, you can think about some keywords that you can find interesting right now, and we can check them in Python and see whether they are growing or not. And recently, is the same with technologies, we can see what types of technologies are getting traction recently, so you can comment in the chat and we can check whether they are trending or not. So this is the first bit of our analysis, the most high level let’s say. Maybe I will mention that our analysis is semi-automatic, so we use many automated tools. But also some experts analysis was involved as in the case of this of these bubbles because we have used this as a regression to identify trends. And then with Kristof, we have clustered these keywords into these nice bubbles.
Kristof:
Yeah. And following this identification of main words, we were interested in to dive deeper into these news stories. So what we did next was that we analyzed the sentiment of the text. So if sentiment analysis, you can tell if the text is rather positive, negative, or maybe in between in sentiment. So now we’ll draw based on the words of the text. So what we did was that we select some of the interesting news stories that are maybe more controversial, such as news stories related to contact tracing or news stories related to Zoom. And we did two things. First, we examine which are the words are frequently mentioned together, for example, with Zoom. So we identified all of the paragraphs that contain the word Zoom and examine, which are the words that are also frequently in these paragraphs.
The second thing was that we calculated the sentiment of these paragraphs and rank these word combinations based on the sentiment. So this way we were able to see first, which are the frequent combinations, and second, which are the positive and which are the negative relationships. So we had all the stories of some of these controversial views. So as an example, in the case of Zoom, the big dab that’s those paragraphs that covered on the technology itself, on this video conferencing aspect on the different use case, it is rather positive. On the other hand, there were recently lots of sketchy stories about uninvited guests. So that’s why we have also this password for this meeting. So different pranksters There’s show up and violate meetings. And actually, this is also featured in the news about Zoom. And we calculated the sentiment and it was negative or in this neutral zone. So we prepared similar news stories related to contact tracing as well. So you can see which are the positive sides of this technology. And also, which are the negative issues such as privacy, the use of location data, and similar issues. Also, we identified, which are relevant actors, and so on. So this kind of tool has been very useful for us to quickly map out the main angles of a new story.
The second thing that we want to highlight is related to the matter of articles. And here we use a different machine learning tool and define that an algorithm called C SME has been very useful in this type of analysis. So our idea was to prepare a map, where articles that cover the same topic are in close vicinity to each other. So neighbouring points should report on the same news story. And here this location of the point is determined by this algorithm for colour, it was used another algorithm called topic modelling. And this is a different methodology. But in theory, we should have points that are on the same colour and should focus on the same topic. And on the other hand, points that are close to each other should be reporting on the same news or something related to each other. And basically, that’s what the map is showing us. So now Michal is showing, for example, the dots in the northern section. These are the news stories on Zoom. So all the points are reporting here on Zoom, whether it is something positive or negative. We also picked out interesting clusters such as new stories about contact tracing, so now Michal is showing you this area of the map where you have reporting of the different initiatives, different national projects, also some concerns related to this technology. And we also found some other clusters related to companies. So for example, this is related to Facebook. In the southern section, we identify the news on Amazon. And again, these are the new studies related to COVID-19. So it enabled us to have a quick visual background of all the major topics and news reporting on recent developments.
Michal:
Okay, so another area we have analyzed was social media. Here we wander to each social media to choose whether to Twitter Facebook. Finally, we have chosen Reddit data we have analyzed both comments and also on Reddit because of its popularity among tech enthusiasts. Here using Reddit data, we can also analyze certain keywords have chosen some sub-areas related to the job market mental health your app that is used in remote work and communication. We have analyzed trends of these keywords. Here, for example, you can see trends of comments and posts that cover a topic of face masks. And so major restriction around the word for it now. As you can see, in this graph, there is an exponential growth of comments about this topic. And this will be a case with many of our graphs. So exponential growth is not only related to infections, but it’s also visible in social data. Here you can see how this issue of wearing a face mask was growing in popularity on Reddit in the recent month, the peak was at the beginning of April. Of course, you can then do this analysis on a more granular level. So you can see in which countries The discussion was most present. We also tried to map this issue in a timeline. We should update it because most of you know in Poland, we don’t need to wear masks in public right now only in the rooms and buildings. Another area we wanted to cover was a threat to mental health. Because of the isolation, it was a very contentious issue. And we hypothesised that there will be a growing activity in subreddits that are related to such issues as anxiety, depression or suicide watch, and it was actually the case at least with the anxiety. As you can see that in the graph the number of comments in subreddit anxiety was skyrocketing at the beginning of March, which is really troublesome and it might be a good tool for policy. For example, interest monitor public attitudes towards Isolation quarantined and restrictions.
Kristof:
And the other aspect of COVID-19 is the economic impact. And so we were also interested in to see if we can track any activity on Reddit related to the change of the economy. And you can see that the discussion around, for example, unemployment really skyrocketed. So there was a huge influx of comments that were covering unemployment from less than 1000 a day to almost 10,000. And so it suggests that there was a growing tension. We also know now from macro data that in fact, there’s a really high share of people losing their jobs across the world. So in the case of the labour market graph, it is containing the number of commands that contain unemployed or unemployment. In the case of automation. They also were interested in to see there is a more lively discussion around automation tools, I guess, a risk that it would be also a long term solution to some of the recent problems in supply chains, in problems related to the pandemic. And you can see on the graph that from mid-March, there was increased activity on Reddit relates to automation robots and such words. The other thing that we did was we identified or examined, which are the subreddits that have been more activities related to discussions on unemployment. And on this graph, you can see the name of the subreddits and this gives us some ideas about the discussions. So on the one hand, you have stuff related to the economy, such as the small business subreddit, or legal advice. You have also this personnel aspect that is happening such as subreddits on personal finance, “off my chest.” Also, the mental health side that is now very important, maybe a bit neglected part of COVID-19, and so on. So this graph summarizes all the side of the challenge that is happening on the labour markets.
Michal:
Another wanted to see the competition among these remote work apps. And here we can see some interesting trends. For example, Skype experienced a surge in comments on Reddit. So probably it was the tool of the first choice. And then you can see this decreasing trend does a different story with such tools as discord, or even zoom, which plateaued after a while but it didn’t experience a huge downward trend. Although Kristof mentioned some problems with zoom right, but it didn’t affect at least popularity on Reddit.
What we thought might be interesting from the policymakers’ perspective was word comments about lockdown and lockdown easing lockdown protests also. Wanted to see whether communities on Reddit are commenting about this issue. And we thought it might be a nice tool to monitor the public sentiment about lockdown in quarantine. For example, anti-lockdown protests increased. There was a search in the middle of April, probably related to the situation in the US. As you know, there was quite a contentious issue there. And is recently it’s even more right. You can see that lockdown easing experienced this huge growth from the middle of March. We can see it in the Reddit data. So one thing is this activity in the social media but another thing is this actual sentiment because we can speak about lockdown easing but what is your opinion about it, right? So we have applied sentiment analysis, this increase I have explained to you in the beginning. And this is just the beginning of our analysis. But we can see that after an initial negative sentiment, this is the mean sentiment, let’s remember that. But the sentiment at the beginning about lockdown was pretty negative and in the next month of April and May, it is more and more positive. How you can interpret those values? So the scale is from minus one to plus one, where minus one is really, really negative. And the last one is super excited. We usually say that when sentiment is around point two, it is meant to be quite positive. So right now this lockdown party so the comments about lockdown are in the neutral or slightly positive area. What we want to examine in the future is to see what is the distribution of the sentiments? Because right now, it’s just a mean. But we will be happy to hear your ideas, how we can make some deep dives into it to examine what are the changes in perception of lockdown.
Kristof:
The final section. Well, we want to dive a bit deeper. Now if your is related to open source projects, there are projects on GitHub. And the thing that we did first was that we collected the metadata for all the projects that mention COVID-19 in their title or the project description, and we identified the projects and also the location of the developer who prepared the project. And they certainly are able to track the number of projects geographically and also dynamically over time how the number of projects evolved. So if you run the video, then you can see The 15 most important countries on GitHub in this COVID-19 project section, and you can see some interesting things happening. So naturally, in the US and in Europe, we have a great amount of projects, also in India, and in South America. So basically, it’s all around the world. But you have these countries where you have quite a lot of open-source activity going on. And the other thing that is interesting is China because this was the epicentre of the pandemic, already there were projects early on from China in the first week, the time period that we are tracking. However, at some point, the rest of the world, really to cap the number of projects and China remain that relatively low number of projects. So this is an interesting question, how come that the number of projects remains relatively low in China? We don’t have a specific answer to this because we are more interested in at this point, to map out the data and see what’s going on. So we don’t really know the answers to the why’s. But it can be interesting to observe how the societies work, and what is the role of the state, across the world. So it can be that in China, maybe it is not the best idea to have this kind of individual approaches to have a say in very important social issues. Maybe the state is taking a stronger stance in these kinds of things where location data and technology is involved. On the other hand, we have Europe where we have a less efficient state, taking your fight against Coronavirus. And also there’s much greater opportunity for individuals to put together these kinds of projects and just release them to the world. And related to this, we also have some interesting stories. For example, the number of new projects that are released every day.
Michal:
So again, you can see another exponential growth here. But this initial rush is slowing down in recent weeks. And we wonder what is going on? Was it just a phase and after this initial rush, the activity is slowing down. And what we want to examine right now is whether the activity in this repositories is still going on or not. But you can see that there are still, at the beginning of May, there were still around 400 repositories a day created at GitHub, GitHub alone, but you still have Git Lab and other repositories. So in my opinion, it’s unprecedented. I haven’t done an analysis of GitHub in a large area, but it seems that this is an unprecedented issue.
We have analyzed what are the keywords in the introductions of these repositories. And what’s interesting is that you can see that mostly they are aimed at visualizing this epidemic and creating apps. I thought that there will be more mortals that are aiming at explaining forecasting the number of infections and, and the victims of this pandemic. But it seems that communities mostly focused on creating dashboards and web visualizing tools. You can see also what are the main languages used in this project. This distribution is a bit different than in case of overall GitHub as we have checked, so there is more of Jupiter’s. So it seems that people are creating some data science projects that can be easily transferred to other users.
We have also checked what are the most popular repositories. So here you can see the So table which ranks repositories related to COVID-19 on GitHub, which has the biggest number of forks, and watchers and so on. And it’s not surprising that in the first place, there is this repository by Johns Hopkins University, which is one of the main sources around the world, providing a number of infected people and deaths from COVID-19. But you can see some interesting examples here. For example, number seven is 2019, and COVID memory, which is no longer on GitHub. This is the case Kristof told you earlier. So this is the suppository from Wuhan, which covered stories, memories of 100 residents related to COVID-19 and was censored. And the creator of this repository is in jail right now in China and it’s no longer available. It was available when we scraped the data. But right now, you will see just this website with input that it’s no longer available. And in the beginning, we were surprised because we didn’t know the story, the background story. And then we read all these articles about the censoring, censoring open source projects in China. So it’s an interesting case where you first see something in data and then you read the newspapers. But as John mentioned in the beginning, there is such a flood of information about this issue that you’ve got to streamline it in a bit.
Kristof:
Yeah, so that’s maybe these were the most relevant before the research. We have also other graphs or even other data sources because we also need at the answer some graphs about research papers, also geographical analysis is related to research. You can also find some of the codes for our tours, how the research can be replicated and also some tutorials related to using the code. So if you are interested in data science and you have relatively basic programming skills, then using our tutorials, you will be able to read about the different functions, and also fire up the cords yourself and check for different words, topics and so on. So this was the introduction to our research. And now we are very happy to answer the questions. I see that the chat has been quite active.
Host:
Thank you very much. First questions were just simply: did you pick your keywords based on frequency count?
Kristof:
So at the beginning where we had this visualization with the bubbles than what we did was, we calculated an average value for frequencies. So what is the average time per two years per article and we did an average Across all the sources for different periods of time, and this was the basis for identifying well, which are the words that are gaining traction that is more and more frequently are used. So yes, that was related to frequency.
Michal:
You can see our example. So I just checked or BBP (?). So this standard preserving privacy and conduct tracing. and here we can see the growing frequency of this word.
Audience comment:
So it’s based on the time difference, not on the frequency count, absolute frequency count, per se. I’m trying to understand what the null model is, how can you say one thing is more frequently than another?
Michal:
So we have different options. We can check for a mean of all weeks, let’s say, or we can see what is the biggest difference between the beginning and the end, right? So we can also do some normalizations. We have experimented with different methods. But first of all, we see which of these words are characterized by growing popularity in the whole period. And then we choose this which had the biggest positive trend overall these weeks.
Kristof:
So the basic idea is that for every more for every source, we have this average frequency, recreate for every month a weighted value based on the source. And then we have this regression, where we can calculate these trends. How did this frequency evolve over time? So, here what was also a tricky thing is to try to make justice for different sources, because we have very big sources such as TechCrunch or Reuters. We also have a bunch of small but relevant in the European discussion, like , , so we had different approaches. We had one approach where it was a weighted average, so the size of the source, the number of articles published by the source has been the basis of the weight. So that was how we rated the values. But we also had a version where we gave equal weight for every source, or we even give higher priority for the smaller ones, but the ones that can maybe have greater coverage of social issues. So there was also some results, not in this analysis, because here we have quite a small time period. So now we are just analyzing news from the beginning of January until the beginning of May. But during the entire activation internet projects, we have been collecting articles for quite a lot of time - for around three and a half years. And then we had some greater space on how to solve this issue.
Audience comment:
And all of the 52 Top Words were all COVID relevant.
Kristof:
So here at the first visualization, where do we have the bubbles, it has been filtered by us based on the top thousand trending words. So we created this list, which is the top thousand most sounding words. And then we started to filter out the relevant terms. So that’s why this tool is semi-automatic. So it enables to relatively quickly identify the trending terms. And the task is especially easy when you can see the different relationships and how the topics come together. So for example, social distancing, and lockdown and so on. So these words appeared at the beginning of the list.
Audience comment:
I wanted to ask you if it’s possible, or it doesn’t even make sense to compare these trends that you’re presenting to other trends that might have occurred in other major events of 21st century like for example, in 2008, global economic depression, because I guess that anxiety was also one of the trending words. So maybe it’s worth looking into it.
Michal:
This is a very nice idea. The problem is to have comparable sources because I wonder if social media data is available for this period.
Kristof:
It’s a very interesting suggestion. And I’m sure that someone has already done something like that because social media data is getting increasingly relevant also in macroeconomics, so how to forecast economic events based around words in use, and so on. So we don’t know the answer to this specifically if we can use Reddit for 2008. My intuition is that yes, and I’m sure that that it is it would be very interesting research, and hopefully someone will do that. Maybe we will find time as you say because that’s an emerging research area so how to use social data for also in this context.
Audience comment:
Okay, so let’s, let’s imagine that we can actually compare this data. You find the time and you find the source and everything is nice and beautiful. What kind of conclusions could you draw from such a comparison? Because those are major events. But they’re hardly comparable when it comes to the trigger that started these events, and how people respond to the timelines. So if you were to compare it, I guess my question is, what kind of scientific question you would like to answer with your research and with your data?
Michal:
First of all, I’ve just checked and subreddit for anxiety was created in 2008. It might be the case that we could do it.
Audience comment:
I am from the University of Warsaw. And I’m curious, did you consider sharing this data in some subReddits, maybe some nationalism Subreddits. I’m very curious about how their editors were faced with this data.
Michal:
So when it comes to visualizations, I guess we might try to post it on “Data is Beautiful”, or a colleague who is doing most of his visualizations is always saying that they are not ready. So maybe in future iterations, they will finally be ready to be posted there. I just wanted to answer also, the question of Kasha.
Audience comment:
Just a question based on training on sentiment. And second that you employed any kind of benchmark for sentiment analysis algorithms?
Michal:
Okay, so answering the question about a DA <?>. We have used this algorithm for topic modelling on this news data set. And again on a data set of introductions to all of the repositories on GitHub. In the first case, the articles were longer. And because the introductions are basically two or three sentences, in the case of articles, the texts were much longer. And it was used as a tool for summing up the main topics in the news. We often use topic modelling, because it’s a nice tool for identifying Latin, Latin topics in the data sets we analyze. So it’s for the first step of the analysis, right? So just to summarize a corpus of texts, all the codes related to LDA are available on our GitHub. So if you are interested in some technical details, how we do it, how we choose the parameters for this algorithm, or how we, how we compare the results, it’s all there. We have a working paper on Where we…
Audience comment:
So it’s on website. And then on trends for trends within define privacy, you also fall into LDA or no?
Michal:
So, there was one attempt in this analysis to do this topic modelling dynamic
Audience comment:
to put the trends together where you put the right trends - that big picture of trends.
Michal:
So, what we have done was something pretty different. What we have done in this time dimension was to see how topic evolves. So, we have used a tool which is called dynamic topic modelling, where we identify topics in the corpus and then see how during our analysis period these topics evolved. So you can see it maybe I will start…
Kristof:
For for this section of the analysis, we mostly use topic modelling Just to give us context, and then we started to use this dynamic topic modelling, which enables us to see also changes in topics, how the frequency of the probability of different words changed in a given topic. So, for example, here, you can see topic one, this appeared as a big topic, collecting the articles on COVID-19. And here it is, the thing that you can see is that how the difference which are the interesting words that had a big change in their probability over time, so, this is our first experiment with dynamic topic modelling. In this example, what has been interesting for us is to see, for example, this change in in in different geographic names, so, New York appear from the end of March in this topic, but previously it has been not present So it suggests as the news stories about COVID-19 and New York, appeared just at the end of March. And on the other hand, China had this strong presence in January, and they gradually decreased. So you can see these kinds of dynamics within a topic. Coming back to your question. If we use the LDA to identify trends, then at this stage, we focused on the buying. So the stuff that’s happening in the bubbles at the beginning of the presentation that’s based on these time-frequency analyses and regression analyses, and we didn’t use LDA, for an automatic generation of trends, but we often think about it and if you do find a way to do that, I’m sure that we’ll report on it. And when it comes to the question, I think we all we have just one question related to sentiment analysis.
Audience comment:
Is there any kind of benchmark for sentiment analysis as part of your project?
Kristof:
So here We use sentiment tool called Vader. And it is based on training on social media and mostly for shorter snippets of text. And this has been the tool that we have been using.
Michal:
is a dictionary-based tool, a lexicon-based tool where all the words are given sentiment, right?
Host:
Because you asked a question, and then we started to go there, but we didn’t really go there.
Michal:
It’s a good one. This question about the comparison between, let’s say, 2008. And today’s situation, right? What we could do about it, what’s referred to quite often is this hype cycle when it comes to technologies, right, this Gartner hype curve, and maybe there is something with such disasters or uncommon races like pandemics or financial crisis, that can be represented in some sort of curve right? So what we could do as we are not field experts. So we are not psychologists, but we are data scientists. So what we could do is to compare these curves. We could do some econometrics and time series analysis to compare trends in, let’s say, this anxiety separated. This is something would be interesting for us to check out. I think it would be also interesting to cooperate with some psychologists and think more about this the future.
Host:
Did you map the depression part going back to 2008 too? I thought it was curious that anxiety would go up but in which I would expect but depression went down. That I didn’t really get.
Michal:
This is interesting. And what we thought is that maybe it’s the case that I guess what we are observing is ability <?>n, not actual symptoms, or patients, but rather narratives or discourse about mental health. So it might be the case that during such events, you are expressing more active problems and you are more agitated. Or let’s say you are not expressing symptoms of depression on social media, maybe after being you know, quarantined for a while, you start to start doing it again. But in the beginning, it’s more about anxiety, right,d about something unexpected. It’s my take on it as a layman in psychology. But what data says is what we present here, and it’s a sharp increase in comments and posts about anxiety and an interesting decrease in posts and comments in the subreddit about depression.
Host:
I think that also connects to this question of how then research questions formulated or addressed with this type of research. How would this data set that comes together, then come together with, for example, the psychologist or the researcher who then formulate the research question.
Audience comment:
Yes, exactly. How do you communicate your research to different fields? That is important.
Michal:
That’s a tricky question for us. We are in this good position that in the lab, we are an interdisciplinary team of researchers. So we work daily basis with lawyers and people from the sociology department, mostly. And we have some discussions where we try to communicate our results to them, and they make use of this data. But we are really open to cooperating with all social scientists and scientists on it. If you have any ideas and or suggestions on how we can do it better please go on.
Audience comment:
I’m not I don’t have any suggested sorry. I was just curious. But thank you for circling back to my question.
Audience comment:
I would like to propose something, Tt may be a bit crazy, but following up on what Kasha said, I like this idea of the hybrid technique in which you would have some kind of domain expert, helping to channel the mass of data. And along those lines, I wonder if we could kind of dream up a methodology that looks like this: you get a bunch of domain experts to produce some kind of statement, think of it as a focus group. And they will say something like, and I’m quoting the one we had on surveillance pandemic, about a month ago, they will say something like, “policymakers tend to overestimate the effectiveness of technology-based surveillance vis a vis the pandemic.” Now, once you have a statement like this, which is exaggerated with respect to the data, then maybe you can use the data to find echoes of it in the data source. So in this case, you would be looking for something like, "everybody agrees everybody in the know, expert, the community agrees with x, where x is the statement. And so in this case, the way we work is an example of solutionism, you would find a bunch of solutions words solution effective efficient, real-time scalable algorithmic stuff like that. Then see if this kind of language trends, even though we have a problem of the null, the counterfactual. Does it trend more than average of whatever analysis, then see if it co-occurs with epidemics language. You will build another vector of epidemics word - COVID Coronavirus pandemic whatever. And if it co-occurs with tech solutions to the pandemic, like contact tracing app, link and sees if it co-occurs with things like dodgy surveillance companies that will also mention the same focus group as treating versus abusing opportunity, Palantir, etc. If sentiment is negative if all of this occurs, it means that there is an association between solutionist language and pandemic, there is one. And it is looked upon with scepticism or with these negative sentiment by the community for whatever values of a community. And in that case, you could say, well, we dreamed up this thing in a focus group, but actually looks like it’s in line with what is coming out of the data. So there will be a sort of fairly complete method in the sense that at least it triangulates a theme otherwise you’re always left with this kind of anomalies, but it’s very hard to interpret. Is that possible to do?
Kristof:
Yes, maybe not in its glory was way that you described, but for sure it is necessary. So when it comes to the analysis of word combinations, or sentiments, or even trending terms, it is very important to have a feedback if it actually makes sense on the one hand, and on the other hand to reach areas we should go deeper. When we are working on these kinds of analyses, then we are always selecting certain areas to prepare some deep dives some case studies similarly as in this presentation and to have these kinds of meetings where we are together with experts from different fields and they are drawing our attention to certain areas. This is very useful for us because then just as you mentioned, we can, for example, look into this kind of word combinations and check back in our data if the general consensus in tech journalism somehow corresponds to their opinion.
Audience comment:
Yeah, so, in that case, you would have an exogenously generated list of words and that is what makes the whole model kind of clench into action because now you’re not pulling stuff out of whatever you actually know what you’re looking for, because somebody else is externally validating that those are relevant words. And that we could kind of maybe try and do with this particular, call it to focus group, but we could do others. And by the way, practical reasons we couldn’t record encode the text of the code, but if we could have, which we will in the future, then we would have precise words and group of words that you could simply just feed to your algorithm and see what comes out of it. Like the exact words that people use this to describe the problem at hand. This will get a fix somehow and the problem I feel.
Kristof:
so we had one in the project before our results were presented at the workshop for an expert group. The workshop was organized by Argus University, so by our team members, and actually the process was that we suggested a bunch of trending topics on the project website. And they show the presentation to the experts, who then had a great discussion if actually, our suggestion that, for example, different economic aspects of the technology should be devoted greater research into words such as platform competition, or privacy, this kind of stuff. So they vetted our suggested topics are also the most relevant to them. And there was some agreement. There were also some disagreements, and in the end, we prepared a best-off list of trending stuff that are trending according to the data and the title events are also according to the experts. So please do we do such workshops It was very useful.
Host:
I didn’t see the word equality or inequality coming up, and at least from where I sit this pandemic and inequality and now, I mean I realize the broad term, but at least it’s super relevant for us over here: racism. In other words, are the handling of immigrants or outsiders, these things cannot be separated. You can’t solve any one of them without the others. Maybe that would come up now, whereas it didn’t so much before.
Michal:
I have just checked and till the end of May, the word inequality was not trending in technological news, either was racism. It’s interesting to be the case when we iterate again, or analysis and do the update the results because I agree that, especially when we choose U.S. sources might be the case that these issues will be really interrelated.
Host:
Do you know how Reddit is distributed geographically? Does anybody know that? I mean, is it dominated by American comments now?
Michal:
So it depends on the subreddit. But what we want to do is to choose major subreddits of all EU countries and analyze the data.
Kristof:
So when it comes to like this inequality aspect, then maybe we didn’t pick up the word inequality itself. But we find such words that are socially relevant as for example, stockpiling or furlough, this whole area related to unemployment, I remember we also had quite a lot of results on the gig economy. So that press is also covering in an increased manner, this relationship between technology and social issues. I think we also need to check vulnerability and justice Alberto is suggesting.
Host:
I was curious as important as what the information that is in the graph, that is also how you introduce and how you present background information, things like The Reddit that we are drawing this from started in this year or at this point a riot started which changed the meaning of that word or if something like GitHub is used less in China due to the different product being used, and therefore things that would influence that data. And there is a myriad also things and I just imagined that it is very hard to figure out the right balance of how to write a legend so it presents all the background information that you need to properly understand and not misunderstand trends that you see in a graph. But also, to have that in a way that people will still read it at all because if you put all of it in a) nobody reads it b) it’s just like you have to draw a line somewhere to have something to say.
Kristof:
So Reddit to data is is a great challenge in this aspect because, in the case of news articles, you have a relatively clean data, which is worse structure <?> than when the context is relatively straightforward, but in the case of Reddit is a much more complex data source before these different shades and additional context that you mentioned. So to find the right way to analyze the data, definitely, some insights from social scientists are necessary to really dive into the discussion on Reddit in a way that is representative for the communities over there. We had some struggles finding the right ways to analyze this data. And thank you very much for all the suggestions for the things to be checked because it will help us to design the research for next iterations.
Michal:
That is a tough one balancing presenting everything from events but not overflowing our users or our audience. This is something we are struggling with. And recently we focused on our team on creating tutorials to our analysis. So when you have basic knowledge of data science, you can use our GitLab code and seek for growing keywords you can seek for hidden topics in this data sets for yourself and calculate everything we have done so far but using different hypothesis or research questions. I know that there is still this difficulty because you still got to know the basics of Python. But at the moment, we cannot do anything about it.
Host:
Are you documenting those weights like how people interpret and use it in different ways?
Audience Comment
Yes, this was focused on recent works. We hope that is understandable right now. You can check for yourself and check our tutorials. We have put a lot of effort so that they are visible. Maybe in the future, we can do it even you know, for people I don’t use Python, I guess we can create some interface for users just to type in some keywords, and there would be the results of our analysis. Maybe in the future.
Host:
You looked at the news, the press, and you looked at Reddit, I’m curious, did you analyze what the news said and then analyze what Reddit said, to see if they matched up much? In other words, let’s say that Reddit is an example of “the people.” And then you get the press. Is there alignment as to what they’re talking about? Or is there a bit of a mismatch?
Michal:
This is something we want to do, or I guess, two or three years to compare the sources, but we don’t still don’t have a good methodology for that. What we had done previously was just to compare the discourse about certain things on Reddit and in news, so see what is the vocabulary used around certain topics. And we found that, for example, when you use Reddit, this discourse is really, really different as narratives in media. The vocabulary is “mediatized”, right? And you can find very interesting keywords when you project our angles on viewing some issues related to technology when you analyze Reddit. But it’s very difficult to compare them, let’s say on one axis. Alberto already mentioned the problem with this null model or baselines and we had this problem in case of single sources, but when you want to compare across sources, the problem is even bigger. Right now, what we will focus on probably is creating baselines in all sources, in all single sources. So for example, there was this question about how we measured this growth against all the comments in the source. So this is something we will work in the future. If you know any good methodologies where people convert social media with news, we are happy to look at them. But we didn’t find any good right now.
Kristof:
Nothing similar.
Michal:
We wanted to do with research papers. To see whether there is one source that is ahead of the other. So is it the case that researchers are taking ideas from popular media or is it the other way around or is it different across disciplines? This is something easier to do, I guess. But social media is so dynamic that it’s difficult to compare with the news.
Audience comment:
But is it difficult because of the manpower or lack thereof or is it difficult because of the lack of proper methodology?
Audience comment:
The second thing - it is the second problem. We don’t have any good idea on how to normalize this data.
Kristof:
In the case of news, we can always take into account the number of articles or some other things to normalize. In the case of Reddit, it is challenging to find this single value that we can attach a certain topic or certain words. So, definitely, it needs a well-prepared methodology to see these kinds of comparisons.
Audience comment:
The null model problem is the bane of data science, you know, this kind of keeps coming up in all sorts of problems, maybe because we tend to under theorize and so of course, you have this kind of regularities. But in a case like this, I would probably try to go down the collective intelligence path. So instead of looking too much around, like trying to identify in authoritative, but large enough data source, so imagine you’re talking about information, security or say, Okay, I’m gonna just focus on the Chaos Computer Club. So what those guys say goes, because they are the crowd that knows this stuff. And then you would focus on there, whatever their online watering holes are and just ignore everything else. Because also you can be super polluted. The media, you know, vehicle by hype cycles, you get a lot of garbage in and then you will get some kind of garbage out. So maybe instead of big data, it becomes a more hybrid methodology, which is a quantitative validation of expert opinion that will be already. It means that most experts agree, even if you’re trying to do policy I used to work for government myself, if you can tell your minister look, yes, there are a few outliers, outliers but by and large, this is what people are saying that is valuable to them.
Michal:
Yes, such a hybrid method is something we are aiming at. We see ourselves as a foundation for such a discussion, not with the solution, but just to provide some basic info for further investigation.
Host:
I was also wondering, you also before, described how you’re opening this up to people to use your code, how you want to open it up to more people to use it, who can’t use Python and such. And I think this also leads to what due diligence .connects back to how do you write the right legends to what thing. And do you need to do at which point if when you share your data with people in your own field, then you have a certain baseline already but you know, what they know about they know to ask for how is this to look for certain background information may be on their own, the more we are opening up those things, the more people then take that differently, maybe? But at the same time, you also need to make this accessible to people. So how do you balance out this opening up data across disciplines in a larger context? And who is responsible when for how stuff is then taken and read, like, when a journalist takes a graph that was there for two researchers to talk about, and it’s not now finished 100%. But it’s just the start of a conversation. But now, the journalist takes it and puts it into a different forum, that changes the content of that, how do we deal with something like that?
Kristof:
This is a great question because that’s like a big problem or big challenge for the entire academia now, like how to communicate results in a way that you have social resonance. And how do you prepare and present your results for different stakeholders and I don’t think that we have an ultimate solution for that because ourselves are struggling when it comes to this presentation but also in our different research fields. So Michal is working on privacy research, which is rather complicated and, and but is relevant for the society. I’m doing a different empirical search on Airbnb and sometimes I have this kind of conversation with journalists. Often the problem is that you have very nuanced results with different aspects to it. And then it gets very oversimplified or there’s a huge overstatement of the results are taken to the extreme, and that’s a big challenge. So what we try to do is to filter down our analysis as much as possible. So even if the presentation that we have shown you contains, like, lots of aspects, very different data sources, we really frame this to the best way we could because we tried to find some narratives that that can be shown and that is not that confusing. But what we also try to do is to prepare some blog posts where we provide greater detail of the methodology than in a short presentation. So we try to make different versions for different communities, different observers of our results.
Audience comment:
as an X researcher, as an astronomer, and as a teacher now, I would like to tell you that there is a great space for communicating science, but you need people for this. I mean, we have these places that are called science centres, or science museums where you can do your science hands-on, but it mostly boils down to physics, biology, maybe chemistry, but it’s very difficult to do Museum of Science for research that you do. So you need some kind of intermediate person who is going to understand your science deeply enough. And then is going to speak, so to speak a common language to communicate that science to the public. So it’s very nice that you’re trying to model your output. So different groups can understand it. But frankly speaking, I don’t think this is your job. Your job as a researcher is to produce science. You need another person to understand that science and then speak up for you. And for your research. That would be great. But if you have any ideas, how to how to put science into the science museums, that would be even better, then everyone can see I think this is great media for that great outlook.
Host:
Yeah, that’s a great point. And I think even if you ask those questions here, it is not that you are then the one responsible here, but as you said, somebody needs to be responsible for how and where is it communicated and presented. So maybe I’m also interested in everybody’s opinion here, on where those things critical questions should be addressed. And where responsibilities lie. Some of them lie with researchers. But not all of them can. It’s a multi-layered community and society where responsibilities lie on different levels.
Host:
We are now at the end of our appointed timeframe for this discussion. This was really interesting. And I want to thank everybody.
Kristof:
Thank you all. Thank you very much for all the questions and helpful comments. It has been a great pleasure for us to come and see you and talk about the results because I think we have spent a relatively large amount of time on doing research, but less on communicating the results as we would have wished. So we are very grateful for this opportunity. And please do get in touch with us in case of further questions. You can find our contact information on the DE Lab website, D lab UW. We are also on Twitter. And meet also on the Edgeryder Platform to bring the conversation further.
Michal:
Exactly. Thank you very much, guys.