首頁(yè) 資訊 基于機(jī)器學(xué)習(xí)的冠心病風(fēng)險(xiǎn)預(yù)測(cè)模型構(gòu)建與比較

基于機(jī)器學(xué)習(xí)的冠心病風(fēng)險(xiǎn)預(yù)測(cè)模型構(gòu)建與比較

來(lái)源:泰然健康網(wǎng) 時(shí)間:2024年11月28日 01:53

摘要: 背景 冠狀動(dòng)脈粥樣硬化性心臟?。–oronary atherosclerotic heart disease,CHD)(以下簡(jiǎn)稱(chēng)冠心?。┦侨蛑匾乃劳鲈蛑?。目前關(guān)于冠心病風(fēng)險(xiǎn)評(píng)估的研究在逐年增長(zhǎng)。然而,在這些研究中常忽略了數(shù)據(jù)不平衡的問(wèn)題,而解決該問(wèn)題對(duì)于提高分類(lèi)算法中識(shí)別冠心病風(fēng)險(xiǎn)的準(zhǔn)確性至關(guān)重要。目的 探索冠心病的影響因素,通過(guò)使用 2 種平衡數(shù)據(jù)的方法,基于 5 種算法建立冠心病風(fēng)險(xiǎn)相關(guān)的預(yù)測(cè)模型,比較這 5 種模型對(duì)冠心病風(fēng)險(xiǎn)的預(yù)測(cè)價(jià)值。方法 基于 2021 年美國(guó)國(guó)家行為風(fēng)險(xiǎn)因素監(jiān)測(cè)系統(tǒng)(Behavioral Risk Factor Surveillance System,BRFSS)橫斷面調(diào)查數(shù)據(jù)篩選出 112 606 位研究對(duì)象的健康相關(guān)風(fēng)險(xiǎn)行為、慢性健康狀況等 24 個(gè)變量信息,結(jié)局指標(biāo)為自我報(bào)告是否患有冠心病并據(jù)此分為冠心病組和非冠心病組。通過(guò)進(jìn)行單因素分析和逐步 Logistic 回歸分析探索冠心病發(fā)生的影響因素并篩選出納入預(yù)測(cè)模型的變量。隨機(jī)抽取 112 606 位受訪者的 10%(共計(jì) 11 261 名),以 8:2 的比例隨機(jī)劃分為訓(xùn)練與測(cè)試的數(shù)據(jù)集,采用隨機(jī)過(guò)采樣(Random oversampling)和合成少數(shù)過(guò)采樣技術(shù)(Synthetic Minority Over-samplingTechnique,SMOTE)兩種過(guò)采樣(Over-sampling)的方法處理不平衡數(shù)據(jù),基于 k 最鄰近算法(K-Nearest Neighbor,KNN)、Logistic 回歸、支持向量機(jī)(Support Vector Machine,SVM)、決策樹(shù)和 XGBoost 算法分別建立冠心病預(yù)測(cè)模型。結(jié)果 兩組年齡、性別、BMI、種族、婚姻狀態(tài)、教育水平、收入水平、是否被告知患高血壓、是否被告知患處于高血壓前期、是否被告知患妊娠高血壓、現(xiàn)在是否在服用高血壓藥物、是否被告知患有高血脂、是否被告知患有糖尿病、抽煙情況、過(guò)去 30 d 內(nèi)是否至少喝過(guò) 1 次酒、是否為重度飲酒者、是否為酗酒者、過(guò)去 30 d 內(nèi)是否有體育鍛煉、心理健康狀況以及自我健康評(píng)價(jià)比較,差異有統(tǒng)計(jì)學(xué)意義(P<0.05)。逐步 Logistic 回歸分析結(jié)果顯示:年齡、性別、BMI 水平、種族、教育水平、收入水平、是否被告知患高血壓、是否被告知患處于高血壓前期、是否被告知患妊娠高血壓、現(xiàn)在是否在服用高血壓藥物、是否被告知患有高血脂、是否被告知患有糖尿病、抽煙情況、過(guò)去 30 天內(nèi)是否至少喝過(guò)一次酒、是否為重度飲酒者、是否為酗酒者以及自我健康評(píng)價(jià)為冠心病的影響因素(P<0.05)。風(fēng)險(xiǎn)模型構(gòu)建的分析結(jié)果顯示:k 最鄰近算法、Logistic 回歸、支持向量機(jī)、決策樹(shù)和 XGBoost 采用合成少數(shù)過(guò)采樣技術(shù)處理不平衡數(shù)據(jù)的總體分類(lèi)精度分別為 59.2%、67.4%、66.2%、69.2% 和 85.9%;召回率分別為 75.2%、71.4%、70.5%、62.9%和 34.8%;精確度分別為 15.4%、18.2%、17.5%、17.6% 和 28.7%;F 值分別為 0.256、0.290、0.280、0.275 和 0.315;AUC 分別為 0.80、0.78、0.72、0.72 和 0.82;采用隨機(jī)過(guò)采樣處理不平衡數(shù)據(jù)的總體分類(lèi)精度分別為 62.5%、68.5%、69.0%、60.2% 和 70.1%; 召 回 率 分 別 為 70.0%、69.5%、71.9%、69.0% 和 67.6%; 精 確 度 分 別 為 15.8%、18.4%、19.1%、14.8% 和 19.0%;F值分別為 0.258、0.291、0.302、0.244 和 0.297;受試者工作特征曲線下面積分別為 0.80、0.77、0.72、0.72 和 0.83。結(jié)論 本研究不僅確認(rèn)了已知冠心病的影響因素,還發(fā)現(xiàn)了自我健康評(píng)價(jià)水平、收入水平和教育水平對(duì)冠心病具有潛在影響。在使用 2 種數(shù)據(jù)平衡方法后,5 種算法的性能顯著提高。其中 XGBoost 模型表現(xiàn)最佳,可作為未來(lái)優(yōu)化冠心病預(yù)測(cè)模型的參考。此外,鑒于 XGBoost 模型的優(yōu)異性能以及逐步 Logistic 回歸的操作便捷和可解釋性,推薦在冠心病風(fēng)險(xiǎn)預(yù)測(cè)模型中,結(jié)合使用數(shù)據(jù)平衡后的 XGBoost 和逐步 Logistic 回歸分析。

關(guān)鍵詞: 冠心病, 機(jī)器學(xué)習(xí), 風(fēng)險(xiǎn)預(yù)測(cè)模型, Logistic 回歸, k 最鄰近算法, 支持向量機(jī), 決策樹(shù), XGBoost

Abstract: Background Coronary atherosclerotic heart disease (CHD) is one of the leading causes of mortality worldwide,and research on risk assessment for CHD has been growing annually. However,the issue of data imbalance in these studies is often overlooked,despite its crucial role in enhancing the accuracy of CHD risk identification within classification algorithms. Objective To investigate the factors influencing CHD and to establish predictive models for CHD risk using two data balancing methods based on five algorithms,comparing the predictive value of these models for CHD risk. Methods Utilizing cross-sectional survey data from the 2021 Behavioral Risk Factor Surveillance System (BRFSS) in the United States,a cohort of 112,606 participants was identified,featuring 24 variables related to risk behaviors and health status,with self-reported coronary heart disease (CHD) as the outcome measure. Factors influencing the incidence of CHD were explored through univariate analysis and stepwise logistic regression to select pertinent variables for inclusion in the predictive model. A random sample comprising 10% of the participants (11,261 individuals) was drawn and then randomly divided into training and testing datasets at an 8:2 ratio. To address data imbalance,two over-sampling techniques were employed:random oversampling and the Synthetic Minority Over-sampling Technique (SMOTE). Based on these methods,CHD predictive models were constructed using five different algorithms:K-Nearest Neighbors (KNN),Logistic Regression,Support Vector Machine (SVM), Decision Tree,and XGBoost. Results Univariate analysis revealed significant differences (P<0.05) between the CHD and non-CHD groups across all input variables except for rental housing and being informed of prediabetic status. Stepwise logistic regression identified age,gender,BMI,ethnicity,education level,income level,being informed of hypertension,being informed of prehypertension,being informed of pregnancy-induced hypertension,current use of antihypertensive medication, being informed of hyperlipidemia,being informed of diabetes,smoking status,alcohol consumption within the last 30 days, heavy drinking status,and self-assessed health as factors influencing CHD. The performance of risk models using SMOTE showed overall classification accuracies of 59.2%,67.4%,66.2%,69.2%,and 85.9%;recall rates of 75.2%,71.4%,70.5%, 62.9%,and 34.8%;precision of 15.4%,18.2%,17.5%,17.6%,and 28.7%;F-values of 0.256,0.290,0.280,0.275, and 0.315;and AUC values of 0.80,0.78,0.72,0.72,and 0.82,respectively. Using random oversampling,the models achieved classification accuracies of 62.5%,68.5%,69.0%,60.2%,and 70.1%;recall rates of 70.0%,69.5%,71.9%, 69.0%,and 67.6%;precision of 15.8%,18.4%,19.1%,14.8%,and 19.0%;F-values of 0.258,0.291,0.302,0.244, and 0.297;and AUC values of 0.80,0.77,0.72,0.72,and 0.83,respectively. Conclusion This study not only confirmed known factors affecting CHD but also identified potential impacts of self-assessed health level,income level,and education level on CHD. The performance of the five algorithms was significantly enhanced after employing two data balancing methods. Among them,the XGBoost model exhibited superior performance and can be referenced for future optimization of CHD prediction models. Additionally,considering the excellent performance of the XGBoost model and the convenience and interpretability of stepwise logistic regression,a combined use of these approaches after data balancing is recommended in CHD risk prediction models.

Key words: Coronary Disease, Machine Learning, Risk prediction model, K-nearest neighbor, Support vector machine, Decision tree, Logistic regression, XGBoost

中圖分類(lèi)號(hào): 

R 541.4

相關(guān)知識(shí)

基于大數(shù)據(jù)老年多重慢性病風(fēng)險(xiǎn)預(yù)測(cè)模型構(gòu)建探究
AI用單次X光預(yù)測(cè)心臟病風(fēng)險(xiǎn)
基于中國(guó)護(hù)士健康隊(duì)列平臺(tái)建立護(hù)士睡眠障礙風(fēng)險(xiǎn)預(yù)測(cè)模型
基于“治未病”理論構(gòu)建中醫(yī)藥大學(xué)生心理健康干預(yù)模式
健康監(jiān)測(cè)與疾病預(yù)警系統(tǒng).ppt
這3項(xiàng)體檢指標(biāo),能提前30年預(yù)測(cè)心臟病風(fēng)險(xiǎn)!
幾乎所有的疾病都和基因相關(guān),基因測(cè)序可檢測(cè)150多種疾病的患病風(fēng)險(xiǎn)!
健康數(shù)據(jù)分析與預(yù)測(cè)
基于機(jī)器學(xué)習(xí)的睡眠質(zhì)量檢測(cè)方法及裝置制造方法及圖紙
李群:著力構(gòu)建新時(shí)期監(jiān)測(cè)預(yù)警體系

網(wǎng)址: 基于機(jī)器學(xué)習(xí)的冠心病風(fēng)險(xiǎn)預(yù)測(cè)模型構(gòu)建與比較 http://www.u1s5d6.cn/newsview142169.html

推薦資訊