What coding a webpage has to do with the quality of the news on it – Quartz
A simple look at the components of an HTML page says a lot about the reliability of its content. The problem is, distribution platforms don’t bother watching these signals. (Part of a series on my News Quality Assessment Project.)
As one microbiologist said, the devil is not in the details, but in the structure. It was referring to the genetic arrangement of a deadly strain of virus. The digital world is a bit like a living organism. It’s constantly changing, it’s unstable, and there’s a lot of organic waste scattered around. For example, on The Guardian, for a single character of an article, you can expect around 100 characters of code; more about it here.
The visual tradition of journalism dictates providing a minimum set of elements to enable readers to assess the origin of information. For example, a story should show where it is reported or written, and by whom. We are supposed to know a little about the authors, sometimes with access to the heir and to the whole work. The Trust project at the University of Santa Clara focuses on developing standards for better journalistic transparency (see his list of indicators). My own project at Stanford John S. Knight Scholarship is complementary to the Trust project.
The News Quality Assessment (NQS) Project aims to find and quantify “signals” that convey the quality of the content. The idea is to build a scalable and largely automated process. Incidentally, this will help demystify fake news by “triangulating” questionable sources – see this previous Monday Note.
To date, we’ve just completed a collection of 640,000 articles resulting from three weeks of data collection from 500 of America’s largest websites and their 850 corresponding RSS feeds. The task now is to extract and analyze the relevant signals, assess their relevance, reliability and resistance to tampering (more on this in a few weeks).
Getting back to the HTML structure approach, let’s take a look at the components of a basic article on the web:
This is (or should be) the basic structure of an article page in a news site. The default configuration is not unimportant. Let’s get into the details:
1. The source
In theory, this is the most verifiable quality indicator. But the magic of the web tends to confuse it. First, the producers of fake news have displayed great skill in mimicking legitimate information outlets. To learn more about this, read this average message by Aviv Ovadia, creator of Media window which “monitors the credibility of online news media”.
The easiest way to address the legitimacy of the source is to send a query to a domain name registry (Who is) to determine when the domain was registered, whether it was done anonymously, etc. This is done easily and automatically. Such a simple process applied to this list alt-right sites compiled by Indiana University are said to have prevented an explosion of fake news.
The second problem of qualifying sources is linked to the rise of distribution platforms (Facebook, Google, Apple News, etc.). The Tow Center for Digital Journalism at Columbia University recently released a comprehensive report titled “The Platform Press: How Silicon Valley Reorganized Journalism“ this reminds us of some disturbing statistics (emphasis mine):
Bench [Research] found this only 56% of online news consumers who clicked on a link could remember the news source (2016). And the American Press Institute’s Media Insight Project (2017) found that on Facebook, only two in ten people could remember the source, while much more trust has been placed in the sharer. “If we’re here for the brand and nobody even recognizes it, if our brand is tied to the Snapchat brand, then… Maybe it’s not worth it,” says a magazine editor.
The must-read Tow report is a moderate call for publishers to reconsider their relationship with distribution platforms. (Many of his arguments are soft music to my ears. I have always defended here the idea of using platforms as tools for promotion and recruitment, especially with non-essential but important audiences such as young readers who will become That said, as audiences remain controlled by platforms as media brands evaporate, I have also argued against the use of platforms for publication.)
2. History (title)
An easy way to test its reliability is to assess its click-bait level and / or to confront the story topic with the rest of the news cycle. Unless the medium is The Intercept or one known for its exclusives, the likelihood of an unknown medium rounding up everyone is limited. Therefore, if a “fact” doesn’t resonate on the web, it is probably a fake.
3. Author (s)
I already touched on this question in the last note on Monday. As for the sources, it now seems essential to constitute a white list of news authors, in the broadest sense of the term. This database should include not only journalists and editorial staff, but also freelance writers, selected bloggers, experts and opinion writers. Building it is not an easy task (I know this firsthand); among other obstacles, it is potentially a burning issue among journalists, and it is difficult to decide on the criteria for whitelisting. But: a) it is very likely that the largest distribution platforms already have or will create such a database, and b) making this data widely accessible is a necessity if we are to prevent the proliferation of disinformation.
4. Iconography & 6. Video
It’s a matter of consistency: if a long story is accompanied by an unattributed photograph (thus possibly stolen), or an archive photo bought for a few dollars (and easily identifiable), something is fishy. . Ditto for the video. There must be a certain consistency between the editorial effort for the text and for the iconography.
In the list of signals I established for the News Quality Scoring Project, the analysis for the photo or video is as follows:
The final score is a combination of sub-signals – the trickiest being finding the right weight between major and minor signals, or between the more reliable and tamper-proof versus the more easily tampered with, etc.
The evaluation of the body of the text is a subject in itself. Dozens of technologies and services are capable of performing good semantic analysis, ranging from structure or richness to tonal analysis. A good journalistic story has certain characteristics in terms of density which can be evaluated automatically, such as the proportion of citations, named entities, etc. But these can easily be invented. (More on that later in a dedicated Monday note.)
7. Related stories
Legitimate news outlets tend to produce a lot on the same subject. The depth of the inventory is reflected in the “Related Stories” boxes or sidebars. In short, an article without additional reading recommendation from the editor should arouse suspicion. This can be easily spotted by looking at the link structure of the HTML page, or by the presence of a recommendation engine.
8. Footer information
This is the easy part: most legitimate media often provide all the information about themselves or their parent company with a means of contacting them at the bottom of the page, or a link to the relevant part.
One problem with this deliberately sketchy scheme is that it only applies to content that is put on the publisher’s website. Distribution platforms tend to relieve publishers of their obligation to adhere to the design conventions set out above which to some extent guarantee reliability. Perhaps this is a step platforms should consider: forcing the sources they host to display (and verify) basic legitimacy signals.