首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Configurable assembly of classification rules for enhancing entity resolution results
Institution:1. Federal Rural University of Pernambuco (UFRPE), Federal University of Campina Grande (UFCG) Aprígio Veloso, 882 - Universitário, Campina Grande, PB 58429-900, Brazil;2. Federal University of Campina Grande (UFCG), Brazil;1. Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China;2. Department of Pain Medicine, the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310018, China;3. Wuhan second ship design and research institute, Wuhan 430205, China;1. School of Information Management, Wuhan University, Wuhan, Hubei, China;2. Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands;3. Amsterdam Business School, University of Amsterdam, Amsterdam, The Netherlands;1. Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, 28223 Pozuelo de Alarcón, Spain;2. ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, 28660 Boadilla del Monte, Spain;1. Computer Science Department, Universidad Carlos III de Madrid Spain;2. Faculty of Computer Sciences, Østfold University College Norway;3. Foundation for Biomedical Research, Príncipe de Asturias Hospital Spain;4. Computer Science Department, Universidad de Alcalá Spain;1. Department of Computer Information Systems, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia;2. King Fahd University Hospital, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia;3. Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
Abstract:Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号