Sequential pattern mining for robust event detection
Antoine Doucet is a tenured Full Professor in computer science at the L3i laboratory of the University of La Rochelle since 2014. Director of the ICT department in the University of Science and Technology of Hanoi, he leads in La Rochelle the research group in document analysis, digital contents and images (about 40 people). He is the coordinator of the H2020 project NewsEye, running until 2021 and focusing on augmenting access to historical newspapers, across domains and languages. He further leads the effort on semantic enrichment for low-resourced languages in the context of the H2020 project Embeddia. His main research interests lie in the fields of information retrieval, natural language processing and (text) data mining. The central focus of his
work is on the development of methods that scale to very large document collections and that do not require prior knowledge of the data, hence that are robust to noise (e.g, stemming from OCR) and language-independent. Antoine Doucet holds a PhD in computer science from the University in Helsinki (Finland) since 2005, and a French research supervision habilitation (HDR) since 2012.
In the age of open and big data, the task of automatically analysing numerous media in various format and multiple languages is getting all the more critical. The ability to quickly and efficiently analyse massive amounts of documents, both digitised and digitally-born, is crucial. With a history dating a few centuries and a current rate of about hundreds of thousands of articles published every day, newspapers represent a heterogeneous resource of great importance.
This talk will present an approach that is able to detect events from news using very limited external resources, notably not requiring any form of linguistic analysis. By relying on the journalistic genre rather than on linguistic analysis, it is both able to process text written in any language, and in a fashion that is robust to noise (eg, stemming from imperfect OCR). Applied for instance to epidemic event detection, it is able to find what epidemic diseases are active where, in any language and in real time. Evaluated over 40 languages, the DaNIEL system is on average able to find epidemic events faster than human experts. In this presentation, we will further explain how this work is being expanded to further domains and particularly to historical documents, and how similar ideas are applied to other important NLP problems such as named entity recognition and linking.