Prof. Pascale FUNG Helps to Debunk COVID-19 Myths with Science
A team led by Prof. Pascale FUNG has won, among more than 1,000 teams globally, one of the 10 tasks of the COVID-19 Open Research Dataset Challenge (CORD-19 Challenge), and they have successfully used the system for debunking COVID-19 myths based on scientific evidence.
Pioneer in open source
Prof. Fung’s team at HKUST’s Center for Artificial Intelligence Research (CAiRE) built CAiRE-COVID in response to the CORD-19 Challenge, a task using natural language processing on Kaggle – the world’s largest data science community and a subsidiary of Google. CAiRE-COVID is a machine learning-based system with top-notch natural language processing (NLP) question-answering (QA) techniques, combined with summarization, for mining available scientific literature.
“CAiRE-COVID aims to facilitate the medical community, in the time-critical race, to find answers to various COVID-related queries in the hope of finding a cure for the virus. With an end-to-end neural network-based open-domain QA system, it can quickly generate ranked lists and paragraph-level summaries among COVID-19 Open Research Dataset’s 57,000 scholarly articles, most of them with full texts, about COVID-19 and related coronaviruses,” said Prof. Fung.
“Our CAiRE-COVID is significant and unusual because it is about open search, open database, and open collaborative research. Whereas we use open-source models of other companies, we also release our codes publicly. In fact the artificial intelligence (AI) research community has always been proposing open collaborative research. The medical research field on the other hand, has been very much protected and has not been involved in open source due to privacy laws, compliance issues and regulations. Information about cancer research, for instance, is hardly existent in open source.”
“The CORD-19 Challenge, for which we won one of the 10 tasks, is important as it is the first open-search and open-call initiative using AI in the medical field. It is the first time AI and public health work together. With this collaboration, we hope that the power of machine learning in medical research will be unlocked,” she continued.
Mining scientific truths and debunking myths
A 30-year research veteran in the NLP area, one of the most challenging fields in AI, Prof. Fung previously worked on translating medical terms with data mining methods. A turning point came when Prof. Fung met Dr. Oliver MORGAN, Director of the Health Emergency Information and Risk Assessment Department in the World Health Organization (WHO) at the World Economic Forum’s Annual Meeting of the New Champions in July 2019. There, they discussed how NLP technologies could be used in epidemic intelligence.
Then when the world was hit with COVID-19 this spring, Kaggle posted an open call to the world’s AI experts to develop text and data mining tools that could help the medical community develop answers to high-priority scientific questions using the COVID-19 Open Research Dataset, which was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine – National Institutes of Health, in coordination with the White House Office of Science and Technology Policy.
“At that, I thought of adapting our previous work to answer this AI challenge.” First and foremost, the team had to meet the challenge of finding time to work on the project. “With all team members being preoccupied with other tasks, the team had to meet such a tight deadline with so little time that they could only find time during three weekends to have video conferences and small-group meetings.”
Led by Prof. Fung’s PhD student SU Dan, the handful of postgraduate students and research assistants came up with a strategy to divide the task of their system into three steps of search, question and answer, and summarization.
First, a user question is used to query a search engine to find and rank all related documents. One example of a user question would be “What do we know about the seasonality of COVID-19?” Second, a neural QA engine finds the snippets in all these documents that answer the question from the top-ranking documents using an open-source pre-trained natural language model that is adapted to medical QA databases. Finally, a neural summarization engine generates an abstract to answer the question. Both the QA engine and the summarizer use state-of-the-art neural networks, which are also known as deep learning models.
Thus using open-source NLP and QA models as well as an open-source search engine, the team was able to build a system that can answer questions in real time. Completing the task a week before the deadline, they nevertheless had yet another hurdle to overcome – their system was running too slowly. “Whereas companies such as Google and Facebook have computing resources that no academic institution can match, we at the University have to solve this problem as neural models are notoriously hungry for computing power,” Prof. Fung noted.
Their efforts paid off when they came first, among over 1,000 teams around the world, in the CORD-19 task.
Meanwhile, Prof. Fung thought of using this centralized resource of scientific publications to help debunk the many pseudo-scientific myths on COVID-19. What if, she thought, “we use the output of the QA system we developed to check whether a claim is of scientific merit or not?”
Debunking misinformation has become an increasingly urgent task as without timely rebuttal, misinformation can potentially lead to disaster and even deadly consequences. For example, a couple died from drinking bleach that they believed would kill the coronavirus, after such a claim spread in the news and online. The WHO’s website has a specific page with “Myth Busters”. It is difficult for humans to catch up and check all misinformation online on a daily basis. An automatic misinformation debunker can help by flagging them for human verification. However, using machine learning methods require large amounts of labeled data for training and it is not realistic to expect there to be sufficient human labeled data for this purpose as misinformation tends to be highly varied and bears no resemblance to previous information.
Working with her other PhD student LEE Nayeon and MPhil student BANG Ye Jin, they came up with a way to flag misinformation by measuring how “predictable” the subject is in a statement. For example, when asked to complete the sentence “5G can help spread …”, humans would think of words such as “data, information”, etc. Indeed, given the scientific evidence, and while using a publicly available language model that learned from 8 million web pages, machine would also predict “data”, “information”, etc with high probability, whereas “5G can help spread COVID-19” will return a very low score. Such “scientific evidence” is provided by the QA engine that entered the Kaggle competition.
More challenges to come
This being said, the mere size of publications continues to grow. The open resource of scientific publications has grown to 57,000 when they launched the system, and it has further increased to over 135,000 as of June 9. “The number of entries will increase to tens of millions in the future. Our challenge is to scale up the system to continue improving accuracy and coverage to update its answers.” Meanwhile, misinformation and myths are not unique to the topic of COVID-19, there are other myths and rumors in both the medical and political areas.
Releasing the source code of both their QA system and the myth debunker to the public in the spirit of open source, the HKUST team hopes that other developers can use it for further work.
Collaboration with EIOS
Further to the success, Prof. Fung’s team is now collaborating with Dr. Morgan’s team in the Epidemic Intelligence from Open Sources (EIOS) initiative of the WHO to co-develop QA and summarization technology. The goal is to find answers from millions of online material for early detection, verification and assessment of public health risks.
“At HKUST’s CAiRE, we collaborate with the WHO’s EIOS that picked up the first article reporting on a pneumonia cluster towards the end of 2019. According to our mutual agreement, the source code of our technology will be released publicly. Our tool will eventually be put on EIOS’ website to be used by the United Nations’ member states, their ministries of health and their centers for disease control and prevention, so that they too can conduct their own research.”
“With CAiRE-COVID, we hope that the open-source initiative between AI and public health will be the shape of things for the future. It is our aspiration that the CORD-19 Challenge and other open competitions will encourage the global research community to tackle other health hazards such as cancer, and that our invention will be used in early assessment and detection of any kind of epidemic.”
- SENG news (Jun 24, 2020): PhD Student SU Dan Leads HKUST Engineering Team in Award-Winning Research Related to COVID-19