首頁資訊基于機器學習的冠心病風險預測模型構(gòu)建與比較

基于機器學習的冠心病風險預測模型構(gòu)建與比較

來源：泰然健康網(wǎng) 時間：2024年11月28日 01:53

摘要： 背景冠狀動脈粥樣硬化性心臟病（Coronary atherosclerotic heart disease，CHD）（以下簡稱冠心?。┦侨蛑匾乃劳鲈蛑?。目前關(guān)于冠心病風險評估的研究在逐年增長。然而，在這些研究中常忽略了數(shù)據(jù)不平衡的問題，而解決該問題對于提高分類算法中識別冠心病風險的準確性至關(guān)重要。目的探索冠心病的影響因素，通過使用 2 種平衡數(shù)據(jù)的方法，基于 5 種算法建立冠心病風險相關(guān)的預測模型，比較這 5 種模型對冠心病風險的預測價值。方法基于 2021 年美國國家行為風險因素監(jiān)測系統(tǒng)（Behavioral Risk Factor Surveillance System，BRFSS）橫斷面調(diào)查數(shù)據(jù)篩選出 112 606 位研究對象的健康相關(guān)風險行為、慢性健康狀況等 24 個變量信息，結(jié)局指標為自我報告是否患有冠心病并據(jù)此分為冠心病組和非冠心病組。通過進行單因素分析和逐步 Logistic 回歸分析探索冠心病發(fā)生的影響因素并篩選出納入預測模型的變量。隨機抽取 112 606 位受訪者的 10%（共計 11 261 名），以 8：2 的比例隨機劃分為訓練與測試的數(shù)據(jù)集，采用隨機過采樣（Random oversampling）和合成少數(shù)過采樣技術(shù)（Synthetic Minority Over-samplingTechnique，SMOTE）兩種過采樣（Over-sampling）的方法處理不平衡數(shù)據(jù)，基于 k 最鄰近算法（K-Nearest Neighbor，KNN）、Logistic 回歸、支持向量機（Support Vector Machine，SVM）、決策樹和 XGBoost 算法分別建立冠心病預測模型。結(jié)果兩組年齡、性別、BMI、種族、婚姻狀態(tài)、教育水平、收入水平、是否被告知患高血壓、是否被告知患處于高血壓前期、是否被告知患妊娠高血壓、現(xiàn)在是否在服用高血壓藥物、是否被告知患有高血脂、是否被告知患有糖尿病、抽煙情況、過去 30 d 內(nèi)是否至少喝過 1 次酒、是否為重度飲酒者、是否為酗酒者、過去 30 d 內(nèi)是否有體育鍛煉、心理健康狀況以及自我健康評價比較，差異有統(tǒng)計學意義（P<0.05）。逐步 Logistic 回歸分析結(jié)果顯示：年齡、性別、BMI 水平、種族、教育水平、收入水平、是否被告知患高血壓、是否被告知患處于高血壓前期、是否被告知患妊娠高血壓、現(xiàn)在是否在服用高血壓藥物、是否被告知患有高血脂、是否被告知患有糖尿病、抽煙情況、過去 30 天內(nèi)是否至少喝過一次酒、是否為重度飲酒者、是否為酗酒者以及自我健康評價為冠心病的影響因素（P<0.05）。風險模型構(gòu)建的分析結(jié)果顯示：k 最鄰近算法、Logistic 回歸、支持向量機、決策樹和 XGBoost 采用合成少數(shù)過采樣技術(shù)處理不平衡數(shù)據(jù)的總體分類精度分別為 59.2%、67.4%、66.2%、69.2% 和 85.9%；召回率分別為 75.2%、71.4%、70.5%、62.9%和 34.8%；精確度分別為 15.4%、18.2%、17.5%、17.6% 和 28.7%；F 值分別為 0.256、0.290、0.280、0.275 和 0.315；AUC 分別為 0.80、0.78、0.72、0.72 和 0.82；采用隨機過采樣處理不平衡數(shù)據(jù)的總體分類精度分別為 62.5%、68.5%、69.0%、60.2% 和 70.1%；召回率分別為 70.0%、69.5%、71.9%、69.0% 和 67.6%；精確度分別為 15.8%、18.4%、19.1%、14.8% 和 19.0%；F值分別為 0.258、0.291、0.302、0.244 和 0.297；受試者工作特征曲線下面積分別為 0.80、0.77、0.72、0.72 和 0.83。結(jié)論本研究不僅確認了已知冠心病的影響因素，還發(fā)現(xiàn)了自我健康評價水平、收入水平和教育水平對冠心病具有潛在影響。在使用 2 種數(shù)據(jù)平衡方法后，5 種算法的性能顯著提高。其中 XGBoost 模型表現(xiàn)最佳，可作為未來優(yōu)化冠心病預測模型的參考。此外，鑒于 XGBoost 模型的優(yōu)異性能以及逐步 Logistic 回歸的操作便捷和可解釋性，推薦在冠心病風險預測模型中，結(jié)合使用數(shù)據(jù)平衡后的 XGBoost 和逐步 Logistic 回歸分析。

關(guān)鍵詞: 冠心病, 機器學習, 風險預測模型, Logistic 回歸, k 最鄰近算法, 支持向量機, 決策樹, XGBoost

Abstract: Background Coronary atherosclerotic heart disease （CHD） is one of the leading causes of mortality worldwide，and research on risk assessment for CHD has been growing annually. However，the issue of data imbalance in these studies is often overlooked，despite its crucial role in enhancing the accuracy of CHD risk identification within classification algorithms. Objective To investigate the factors influencing CHD and to establish predictive models for CHD risk using two data balancing methods based on five algorithms，comparing the predictive value of these models for CHD risk. Methods Utilizing cross-sectional survey data from the 2021 Behavioral Risk Factor Surveillance System （BRFSS） in the United States，a cohort of 112，606 participants was identified，featuring 24 variables related to risk behaviors and health status，with self-reported coronary heart disease （CHD） as the outcome measure. Factors influencing the incidence of CHD were explored through univariate analysis and stepwise logistic regression to select pertinent variables for inclusion in the predictive model. A random sample comprising 10% of the participants （11，261 individuals） was drawn and then randomly divided into training and testing datasets at an 8：2 ratio. To address data imbalance，two over-sampling techniques were employed：random oversampling and the Synthetic Minority Over-sampling Technique （SMOTE）. Based on these methods，CHD predictive models were constructed using five different algorithms：K-Nearest Neighbors （KNN），Logistic Regression，Support Vector Machine （SVM）， Decision Tree，and XGBoost. Results Univariate analysis revealed significant differences （P<0.05） between the CHD and non-CHD groups across all input variables except for rental housing and being informed of prediabetic status. Stepwise logistic regression identified age，gender，BMI，ethnicity，education level，income level，being informed of hypertension，being informed of prehypertension，being informed of pregnancy-induced hypertension，current use of antihypertensive medication， being informed of hyperlipidemia，being informed of diabetes，smoking status，alcohol consumption within the last 30 days， heavy drinking status，and self-assessed health as factors influencing CHD. The performance of risk models using SMOTE showed overall classification accuracies of 59.2%，67.4%，66.2%，69.2%，and 85.9%；recall rates of 75.2%，71.4%，70.5%， 62.9%，and 34.8%；precision of 15.4%，18.2%，17.5%，17.6%，and 28.7%；F-values of 0.256，0.290，0.280，0.275， and 0.315；and AUC values of 0.80，0.78，0.72，0.72，and 0.82，respectively. Using random oversampling，the models achieved classification accuracies of 62.5%，68.5%，69.0%，60.2%，and 70.1%；recall rates of 70.0%，69.5%，71.9%， 69.0%，and 67.6%；precision of 15.8%，18.4%，19.1%，14.8%，and 19.0%；F-values of 0.258，0.291，0.302，0.244， and 0.297；and AUC values of 0.80，0.77，0.72，0.72，and 0.83，respectively. Conclusion This study not only confirmed known factors affecting CHD but also identified potential impacts of self-assessed health level，income level，and education level on CHD. The performance of the five algorithms was significantly enhanced after employing two data balancing methods. Among them，the XGBoost model exhibited superior performance and can be referenced for future optimization of CHD prediction models. Additionally，considering the excellent performance of the XGBoost model and the convenience and interpretability of stepwise logistic regression，a combined use of these approaches after data balancing is recommended in CHD risk prediction models.

Key words: Coronary Disease, Machine Learning, Risk prediction model, K-nearest neighbor, Support vector machine, Decision tree, Logistic regression, XGBoost

中圖分類號:

R 541.4

網(wǎng)址: 基于機器學習的冠心病風險預測模型構(gòu)建與比較 http://www.u1s5d6.cn/newsview142169.html

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

基于機器學習的冠心病風險預測模型構(gòu)建與比較

推薦資訊

從出汗看健康出汗透露你的健康信號

早上怎么喝水最健康？

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

基于機器學習的冠心病風險預測模型構(gòu)建與比較

推薦資訊

從出汗看健康 出汗透露你的健康信號

早上怎么喝水最健康？

從出汗看健康出汗透露你的健康信號