Abstract:
Objective To establish machine learning models based on whole-genome features for the prediction of phenotypic resistance of Klebsiella pneumoniae (K. pneumoniae) to meropenem and identify potential resistance-associated genes.
Methods The K. pneumoniae strains with both phenotypic data and whole-genome sequencing data were collected from the Bacterial and Viral Bioinformatics Resource Center and the Antibiotic Resistance Organism Reference Genome Database of the National Center for Biotechnology Information. Software RGI 6.0.3 was used to analyze the resistance genes carried by the strains, and software PanTa 1.0.0 was used to analyze the pangenome of the strains. LightGBM, Random Forest, and logistic regression models were constructed by using resistance genes and accessory genes as input features to predict the resistance phenotype of K. pneumoniae to meropenem. Stratified nested cross-validation was used for feature selection, hyperparameter optimization, and model evaluation to obtain the optimal model. Shapley additive exPlanations (SHAP) algorithm was used to evaluate the contribution of features.
Results After quality control, 5 800 genomes were included in the models, with 2 171 meropenem resistant strains and 3 629 meropenem sensitive strains. These strains contained 258 333 pangenes and 436 resistance genes. In the three models based on resistance genes, Random Forest model showed the best fit. The model identified 64 resistance genes with an area under the curve (AUC) value of 0.916, balanced accuracy of 87.84%, recall rate of 81.66%, specificity of 94.02%, precision of 89.09%, and negative predictive value of 89.55%. In the three models based on the pangenome, logistic regression model performed best. After screening, 156 candidate resistance-associated genes were identified, including 27 confirmed resistance genes and 129 potential resistance-associated genes. This model outperformed the model based on resistance genes, with an AUC value of 0.943, balanced accuracy of 89.48%, recall rate of 85.16%, specificity of 93.79%, precision of 89.16%, and negative predictive value of 91.36%. SHAP algorithm was further used to evaluate the contribution of feature genes to the models. In the top 15 genes with the greatest contributions identified by the model, six were unconfirmed potential resistance-associated genes, such as yojI, mobA, xerC, and ynfE. The established predictive model has been encapsulated into a command-line software named predMemRes (https://github.com/Wangyuhao66/predMemRes), which can rapidly predict the resistance phenotype of K. pneumoniae strains to meropenem based on their genome sequences.
Conclusion Machine learning models can effectively predict phenotypic resistance of K. pneumoniae to meropenem and identify potential resistance-associated genes.