基于多种机器学习算法的活动性结核病免疫源性分子标志物筛选

Screening of immunogenic molecular markers for active tuberculosis based on multiple machine learning algorithms

  • 摘要:
    目的 通过生物信息学和机器学习研究免疫相关基因在活动性结核病(ATB)中的作用。
    方法  从Gene Expression Omnibus数据库下载活动性结核病数据集GSE42825、GSE42830和GSE83456用于筛选结核病的差异表达基因(DEGs),GeneCards数据库下载免疫相关基因集(IRGs),与DEGs取交集形成免疫相关差异基因(DEIRGs),并进行功能富集和通路分析。 进一步用支持向量–递归特征消除、最小绝对收缩和选择算子、Boruta算法确定关键基因, 并用受试者工作特征曲线下面积(AUC)进行内部和外部验证并构建模型,校准曲线和临床决策曲线评价模型的校准度与临床效用。 采用沙普利可加性解释(SHAP)解释模型每个特征的重要性和模型预测过程。
    结果  共获得502个DEGs,与免疫相关基因交集获得166个DEIRGs。 富集分析结果表明,DEIRGs主要与Toll 样受体信号通路、NOD样受体信号通路、核因子-κB(NF-κB)信号通路、丝裂原活化蛋白激酶信号通路、磷脂酰肌醇3-激酶信号通路相关。 验证集通过三种机器学习算法共筛选出5个结核病相关DEIRGs,通过AUC筛除1个小于0.70的基因,得到4个DEIRGs(AIM2FCGR1AIFITM3SOCS1)并构建了诊断模型(AUC=0.98)。 校准曲线与Hosmer-Lemeshow检验显示模型校准效果可靠,决策曲线具有临床实用性。SHAP展示4个特征基因的重要性排序依次为IFITM3FCGR1ASOCS1AIM2
    结论  本研究显示,DEIRGs与免疫反应、细胞结构和酶活性有关,为结核病发病机制提供了新的见解,这些生物标志物在联合检测预测ATB方面可能优于先前报道的分子特征。

     

    Abstract:
    Objective To understand the role of immune-related genes in active tuberculosis (TB) through bioinformatics and machine learning research.
    Methods The datasets GSE42825, GSE42830, and GSE83456 related to ATB were downloaded from the Gene Expression Omnibus database for the screening of differentially expressed genes (DEGs) associated with TB. The immune-related gene sets (IRGs) were obtained from the GeneCards database. The intersection of DEGs and IRGs was taken to obtain the differentially expressed immune-related genes (DEIRGs), followed by functional enrichment and pathway analysis. Furthermore, key genes were identified by using Support Vector Machine - Recursive Feature Elimination, Least Absolute Shrinkage and Selection Operator, and the Boruta algorithm. The area under the ROC curve (AUC) was used for internal and external validation, as well as for model construction. Calibration curves and clinical decision curves were used to evaluate the calibration level and clinical efficiency of the model. Shapley Additive exPlanations (SHAP) method was used to interpret the importance of each feature of the model and elucidate the model's prediction process.
    Results A total of 502 DEGs were identified, and the intersection with immune-related genes yielded 166 DEIRGs. Enrichment analysis revealed that these DEIRGs were mainly associated with the Toll-like receptor signaling pathway, NOD-like receptor signaling pathway, nuclear factor-kappa B (NF-κB) signaling pathway, mitogen-activated protein kinase signaling pathway, and phosphatidylinositol 3-kinase signaling pathway. In the validation set, five TB-related DEIRGs were screened by using three machine learning algorithms. One gene with an AUC less than 0.70 was excluded, and four DEIRGs (AIM2, FCGR1A, IFITM3, SOCS1) were obtained, which were used to construct a diagnostic model (AUC =0.98). Calibration curves and the Hosmer-Lemeshow test indicated reliable calibration of the model, and decision curve analysis demonstrated the clinical efficiency of the model. SHAP analysis ranked the importance of the four feature genes as follows: IFITM3, FCGR1A, SOCS1, and AIM2.
    Conclusion Our findings indicate that DEIRGs are associated with immune responses, cellular structures, and enzymatic activities, further improving the understanding of TB. These biomarkers might outperform the previously reported molecular indicators in the combined detection and prediction of active TB.

     

/

返回文章
返回