COMPUTER SCIENTISTS PUSH BOUNDARIES TO CREATE AUTOMATED NEWS SOFTWARE

The development of a sophisticated and workable Named Entity Linking (NEL) system will enable the creation of new software allowing users to find vast amounts of news information about a named entity at the touch of a button.

Will Radford CMCRC PhD candidate U. of Sydney

A/Professor James Curran CMCRC Research Associate U. of Sydney

Keywords: Information technology, natural language processing  
The CMCRC’s Computable News Project’s main goal is to turn unstructured news stories into structured data making them “computable” data. This can then be used and analysed by any computer application where monitoring or searching for entities is important. Computable news will unlock information in text for a myriad of purposes and will be of interest to a vast array of people and industries. This includes media monitors and online reputation managers, capital market traders reacting to market announcements and lawyers searching for evidence in huge document collections to name but a few. The central component of computable news is a NEL system that matches textual mentions of entities to a knowledge database. Research by Will Radford, Dr James Curran and other researchers of the CMCRC’s text analytics team have produced a sophisticated NEL system that uses a variety of linguistic and statistical techniques to model the document context and link entities to a Wikipedia-derived database. Radford’s research has helped produce a NEL system with a KB clustering score of 65.6% which is considered to be extremely good by the text mining community around the world. News is driven by named entities like people, places and organisations and there is significant value in understanding how they interact with one another. Radford’s research had its fair share of problems to solve, language is ambiguous and entities can share the same name and have multiple aliases. Name synonymy (words that have the same or similar meaning) and polysemy (a word or phrase with different or related senses) also make automated analysis difficult. For example, in finance news, the U.S. publication The Atlantic speculated in 2011 that Berkshire Hathaway share price spikes correlated to stories speculating about Anne Hathaway’s Oscar chances! Trading algorithms monitor news streams searching for exploitable infor- mation, but misinterpreting ambiguous names could lead to bad trading decisions and dangerous market instability. The CMCRC’s text mining team has made substantial steps towards the computable news goals and are about to launch a new product which unlocks 25 years of news stories. Users will be able to view named entities’ stories along a timeline and can browse to other related entities, providing a wider context and a more time ef- ficient way for understanding the news.
Author(s): Will Radford, James Curran