首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts
Authors:Arun Varghese  Michelle Cawley  Tao Hong
Institution:1.ICF,Durham,USA
Abstract:Machine learning and natural language processing algorithms are currently widely used to retrieve relevant documents in a variety of contexts, including literature review and systematic review. Supervised machine learning algorithms perform well in terms of retrieval metrics such as recall and precision, but require the use of a sizeable training dataset, which is typically expensive to develop. Unsupervised machine learning algorithms do not require a training dataset and may perform well in terms of recall, but are typically lower in precision, and do not offer a transparent means for decision-makers to justify selection choices. In this paper, we illustrate the use of a hybrid document classification method based on semi-supervised learning that we refer to as “supervised clustering.” We show that supervised clustering combines the ease of use of unsupervised algorithms with the retrieval efficiency and transparency of supervised algorithms. We demonstrate through simulations the high performance and unbiased predictions of supervised clustering when provided even with only minimal training data. We further propose the use of ensemble learning as a means to maximize retrieval efficiency and to prioritize the review of those documents that are not eliminated by the supervised clustering algorithm.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号