Back in the 90s I started and managed the first large news website (sfgate.com). We had the challenge of supplying readers with news from multiple sources (2 newspapers, the AP feed, a TV station, original content and a couple of other sources). None of these orgs or feeds communicated with each other in any way, and the “slugs” or metadata, were always peculiar to their one source.
]We wanted to present the news according to what it was about, not just where it cam from. This was especially important with the limited screen real estate on a computer, plus Google had just started out and was nothing like the behemoth we have known for the past 20 yrs.
Our only meaningful option was machine learning. What we did was create with the newspaper librarian a news category system and I hired her then on the side to tag a large number of current stories for a period of time so we could get a good sample.
But for machine learning, that is not enough. One has to have a large amount of data to run it against in order, in this case, for the machine to decide what a news story was really about with any meaningful accuracy. The bigger the sample, the more accurate the result. But, since we were at a newspaper with electronic archives going back decades, we were able to accomplish it so that our system was highly accurate. It was the first - and only - one of its kind. Before Google News or the Apple News utility, we were supplying anyone interested with a lot of daily news that you could see either by source or subject. This is a big reason why we “punched above our weight” in the news world. Yes the New York Times within a few years had 4 times the traffic we did, but they spent 20 times the money to get there.
But it was only possible to get that accuracy by basing it on a huge - truly huge - amount of data. Without that, you have to tag.