首页 | 本学科首页   官方微博 | 高级检索  
     

基于自然语言处理和机器学习的疑似土壤污染企业识别
引用本文:黄国鑫, 朱守信, 王夏晖, 田梓, 季国华, 卢然, 崔轩, 陈茜. 基于自然语言处理和机器学习的疑似土壤污染企业识别[J]. 环境工程学报, 2020, 14(11): 3234-3242. doi: 10.12030/j.cjee.202007079
作者姓名:黄国鑫  朱守信  王夏晖  田梓  季国华  卢然  崔轩  陈茜
作者单位:1.生态环境部环境规划院,北京 100012; 2.中国地质大学(北京)水资源与环境学院,北京 100083
基金项目:国家重点研发计划项目(2018YFC1800205);生态环境部环境规划院青年科技创新基金(2018年度)
摘    要:针对污染场地识别的精准性不高、科学性不足、全面性不够和数据共享难度大等问题,以南方某地级市为研究区,借助大数据平台,基于自然语言处理和机器学习,通过引入摘要中热词权重构建改进型朴素贝叶斯模型,并对兴趣点(POI)数据进行中类行业预测和污染企业识别。结果表明,与随机森林算法和XGBoost算法相比,朴素贝叶斯算法的性能最佳;企业名称+经营范围构建有语义词汇库后,朴素贝叶斯算法的准确率、召回率和综合评价指标(F1)值得到大幅提升,分别提高了0.23、0.23和0.23;采用权重1.27和平滑参数α为1.10后,建立了改进型朴素贝叶斯模型,实现了行业类别预测,相应的准确率、召回率和F1值分别为0.63、0.62和0.63;识别出研究区中26个疑似土壤污染行业有关1774家企业。改进型朴素贝叶斯模型能够有效地预测疑似土壤污染企业,具有较好的准确率与召回率,能够为场地污染识别与风险管控实践提供理论依据和设计参数。

关 键 词:土壤污染   自然语言处理   机器学习   中类行业   污染企业识别   改进型朴素贝叶斯模型
收稿时间:2020-07-11

Natural language processing and machine learning-based suspected soil contamination enterprise identification
HUANG Guoxin, ZHU Shouxin, WANG Xiahui, TIAN Zi, JI Guohua, LU Ran, CUI Xuan, Chen Xi. Natural language processing and machine learning-based suspected soil contamination enterprise identification[J]. Chinese Journal of Environmental Engineering, 2020, 14(11): 3234-3242. doi: 10.12030/j.cjee.202007079
Authors:HUANG Guoxin  ZHU Shouxin  WANG Xiahui  TIAN Zi  JI Guohua  LU Ran  CUI Xuan  Chen Xi
Affiliation:1.Chinese Academy for Environmental Planning, Beijing 100012, China; 2.School of Water Resources and Environment, China University of Geosciences (Beijing), Beijing 100083, China
Abstract:Aiming at the problems of low accuracy, inadequate scientific basis, bad wholeness and the difficulty in data sharing of soil contamination identification, a typical city in South China was selected as the research area. Based on the natural language processing and machine learning, an improved naive Bayesian model was constructed by the weights of hot words from an abstract and then utilized to predict the middle-class industries and identify the relevant contamination enterprises from point of interest (POI) data with a big data platform. The results showed that the performance of the naive Bayesian aggregation was better than that of random forest and XGBoost aggregations; the precision, recall and F1 values of the naive Bayesian aggregation were improved by 0.23, 0.23 and 0.23 after the semantic vocabulary database was constructed by enterprise name and business scope; the naive Bayesian model that constructed under the weight of 1.27 and smoothing parameter α value of 1.10 could be used for the prediction of the middle-class industries with the precision, recall and F1 value of 0.63, 0.62 and 0.63, respectively, and 1774 suspected soil contamination enterprises affiliated to 26 industry categories were identified in the research area. Therefore, the improved naive Bayesian model with the good precision and recall values can be effectively used to predict the suspected contamination enterprises, and provides the theoretical bases and design parameters for site contamination identification and risk management.
Keywords:soil contamination  natural language processing  machine learning  middle-class industries  contamination enterprise identification  improved naive Bayesian model
点击此处可从《环境工程学报》浏览原始摘要信息
点击此处可从《环境工程学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号