基于泛基因组特征的机器学习模型预测肺炎克雷伯菌对美罗培南的表型耐药

Establishment of machine learning models based on pan-genome features for prediction of phenotypic resistance of Klebsiella pneumoniae to meropenem

  • 摘要:
    目的 建立基于全基因组特征的肺炎克雷伯菌对美罗培南表型耐药的机器学习模型,发现潜在耐药相关基因。
    方法 从细菌和病毒生物信息学资源中心数据库及美国国家生物技术信息中心的抗菌素耐药性生物体国家数据库数据库收集同时有表型数据和全基因组数据的菌株。 使用RGI 6.0.3软件分析菌株基因组携带的耐药基因,使用PanTa 1.0.0软件分析菌株的泛基因组,分别使用耐药基因和附属基因作为纳入特征,构建预测肺炎克雷伯菌对美罗培南耐药表型的LightGBM、随机森林、logistic回归模型,使用分层嵌套交叉验证对模型进行特征筛选、超参数优化及模型评估,得到最优模型,使用Shapley可加性解释算法对特征的贡献进行评估。
    结果 经质量控制,有5 800株基因组纳入模型,其中对美罗培南耐药和敏感的菌株分别有2 171和3 629株,这些菌株包含泛基因258 333个,耐药基因436个。 基于耐药基因分别构建的3种模型中,随机森林的拟合效果最佳。 模型筛选到64种耐药基因,其曲线下面积(AUC)值、平衡准确度、召回率、特异度、精确度和阴性预测值分别为0.916、87.84%、81.66%、94.02%、89.09%和89.55%。 基于泛基因组构建的3种模型中,拟合效果最好的是了logistic回归。 经过筛选得到了156种耐药相关的候选基因,其中包括27个已证实的耐药基因和129个潜在耐药相关基因,该模型优于耐药基因构建的模型,其AUC值、平衡准确度、召回率、特异度、精确度和阴性预测值分别0.943、89.48%、85.16%、93.79%、89.16%和91.36%。 进一步使用Shapley可加性解释算法评估了特征基因对模型的贡献。 模型发现的前15个贡献最大的基因中有6个为未被证实的潜在耐药相关基因,如yojI、mobA、xerCynfE。 所建立的预测模型已被封装为命令行软件predMemRes(https://github.com/Wangyuhao66/predMemRes),该软件能够基于肺炎克雷伯菌基因组序列快速预测菌株对美罗培南的耐药表型。
    结论 机器学习模型可有效用于预测肺炎克雷伯菌对美罗培南表型耐药,并发现潜在耐药相关基因。

     

    Abstract:
    Objective To establish machine learning models based on whole-genome features for the prediction of phenotypic resistance of Klebsiella pneumoniae (K. pneumoniae) to meropenem and identify potential resistance-associated genes.
    Methods The K. pneumoniae strains with both phenotypic data and whole-genome sequencing data were collected from the Bacterial and Viral Bioinformatics Resource Center and the Antibiotic Resistance Organism Reference Genome Database of the National Center for Biotechnology Information. Software RGI 6.0.3 was used to analyze the resistance genes carried by the strains, and software PanTa 1.0.0 was used to analyze the pangenome of the strains. LightGBM, Random Forest, and logistic regression models were constructed by using resistance genes and accessory genes as input features to predict the resistance phenotype of K. pneumoniae to meropenem. Stratified nested cross-validation was used for feature selection, hyperparameter optimization, and model evaluation to obtain the optimal model. Shapley additive exPlanations (SHAP) algorithm was used to evaluate the contribution of features.
    Results After quality control, 5 800 genomes were included in the models, with 2 171 meropenem resistant strains and 3 629 meropenem sensitive strains. These strains contained 258 333 pangenes and 436 resistance genes. In the three models based on resistance genes, Random Forest model showed the best fit. The model identified 64 resistance genes with an area under the curve (AUC) value of 0.916, balanced accuracy of 87.84%, recall rate of 81.66%, specificity of 94.02%, precision of 89.09%, and negative predictive value of 89.55%. In the three models based on the pangenome, logistic regression model performed best. After screening, 156 candidate resistance-associated genes were identified, including 27 confirmed resistance genes and 129 potential resistance-associated genes. This model outperformed the model based on resistance genes, with an AUC value of 0.943, balanced accuracy of 89.48%, recall rate of 85.16%, specificity of 93.79%, precision of 89.16%, and negative predictive value of 91.36%. SHAP algorithm was further used to evaluate the contribution of feature genes to the models. In the top 15 genes with the greatest contributions identified by the model, six were unconfirmed potential resistance-associated genes, such as yojI, mobA, xerC, and ynfE. The established predictive model has been encapsulated into a command-line software named predMemRes (https://github.com/Wangyuhao66/predMemRes), which can rapidly predict the resistance phenotype of K. pneumoniae strains to meropenem based on their genome sequences.
    Conclusion Machine learning models can effectively predict phenotypic resistance of K. pneumoniae to meropenem and identify potential resistance-associated genes.

     

/

返回文章
返回