Attribute-Ranking Methods and Their Comparison with Classifiers Optimized Through Hyperparameter Optimization Techniques

Machine learning models have revolutionized various fields, including disease diagnosis, medical science, agriculture, and soil classification. Their ability to analyze intricate data and identify patterns has led to remarkable achievements in classifying soil-borne pathogenic bacteria and comprehending their behavior in soil environments.

Attribute-Ranking Techniques

To unravel the key factors influencing the presence of Francisella tularensis (Ft) in soil, we employed four attribute-ranking techniques: ReliefF (RLF), SVM, Chi-Sq, and GI, on a comprehensive dataset. These techniques assess the relevance of individual features in predicting Ft’s presence, assigning weights to each attribute, with higher weights indicating greater significance.

Key Findings from Attribute Ranking

Our analysis revealed several consistent patterns across the attribute-ranking models:

  • Consistent Features across Models: Five attributes—Zinc (Zn), Clay (Cy), Soluble Salts (SS), Nitrogen (N), and Silt (Si)—consistently appeared in the top ranks across all models, emphasizing their strong association with Ft prevalence in soil.
  • Shared Features among Models: Six attributes—Zinc (Zn), Clay (Cy), Soluble Salts (SS), Nitrogen (N), Silt (Si), and Lead (Pb)—were present in the rankings of SVM, Chi-Square (Chi-Sq), and Gini-Index (GI), further highlighting their relevance to Ft prevalence.
  • Common Features between ReliefF and SVM: Seven attributes—Nickel (Ni), Zinc (Zn), Clay (Cy), Soluble Salts (SS), Nitrogen (N), Silt (Si), and Moisture (Ms)—were shared between ReliefF (RLF) and SVM, indicating their importance in predicting Ft presence.
  • Overlapping Features between ReliefF, Chi-Square, and Gini-Index: Another set of seven attributes—Magnesium (Mg), Zinc (Zn), Clay (Cy), Soluble Salts (SS), Nitrogen (N), Silt (Si), and Organic Matter (OM)—were common among ReliefF (RLF), Chi-Square (Chi-Sq), and Gini-Index (GI), reinforcing their significance in determining Ft prevalence.
  • Shared Features between Chi-Square and Gini-Index: Finally, nine attributes—Magnesium (Mg), Manganese (Mn), Zinc (Zn), Clay (Cy), Soluble Salts (SS), Nitrogen (N), Silt (Si), Lead (Pb), and Organic Matter (OM)—were shared between Chi-Square (Chi-Sq) and Gini-Index (GI), further corroborating their association with Ft prevalence.
  • Least Contributing Features: Examining the 11 attributes contributing the least, five of them -Potassium (K), Calcium (Ca), Chromium (Cr), Copper (Cu), and pH- persist across all feature-ranking models, suggesting their low significance in predicting Ft presence.

Two-Stage Attribute Ranking

To delve deeper into the impact of each feature on Ft prevalence, we performed a two-stage attribute ranking. This involved ranking soil features using various ranking approaches and calculating weighted scores to determine the final rank. The results revealed that Clay (Cy) emerged as the top-ranked feature, followed by Nitrogen (N). On the other hand, Potassium (K) held the lowest rank.

Evaluation of Attribute-Ranking Models against Classifiers

We assessed the performance of various attribute-ranking models against different classifiers, including SVM, EM, and NN. Bayesian and random search techniques were employed to optimize these classifiers. The results indicated that SVM outperforms other classifiers for both Bayesian optimization (BO) and Random Search (RS) techniques. Additionally, BO optimization yields more favorable outcomes compared to RS for all classification models.

Proposed SVM Classifier

We present an SVM classifier optimized using the Bayesian optimization technique. This classifier achieves an F-1 Score of 86.5% and an accuracy of 86.5%. The confusion matrix and error plot for the SVM model are analyzed to assess its performance and identify the best hyperparameter settings.

Performance Analysis with Different Feature Sets

To investigate the influence of feature selection on classification performance, we varied the number of attributes used. The results showed that SVM consistently yielded the best results, achieving an accuracy of 86.5% for the first 15 soil features.

Summary of Findings

  • Top Contributing Features: The five most contributing features common among all attribute-ranking models are Clay (Cy), Nitrogen (N), Soluble Salts (SS), Silt (Si), and Zinc (Zn).
  • Least Significant Features: The five least significant features for Ft are Potassium (K), Calcium (Ca), Chromium (Cr), Copper (Cu), and pH.
  • Superiority of Bayesian Optimization: Hyperparameter optimization using BO produces better outcomes than other optimization techniques for all classification models.
  • SVM’s Dominance: SVM is the best performer among classification models, achieving the highest accuracy for both BO and RS techniques.
  • Optimal Feature Set: SVM achieves the best classification accuracy of 86.5% for the first 15 soil features using BO and RS.
  • Importance of Hyperparameter Optimization: Optimizing the parameters of machine learning models using hyperparameter optimization techniques can significantly improve performance.

Comparative Analysis with Prior Machine Learning Techniques

Comparing our findings with previous machine learning techniques applied to classify soil-borne pathogenic bacteria reveals that our proposed design, utilizing hyperparameter tuning with two-stage attribute-ranking on a new F. tularensis dataset, stands out from previous research.

Discussions

  • Benefits of Machine Learning Models: Machine learning models demonstrate outstanding results for classifying F. tularensis and learning its behavior in soil settings, outperforming current statistical techniques.
  • Influence of Soil Characteristics: Specific soil characteristics, such as clay, nitrogen, soluble salts, silt, organic matter, and zinc, are crucial for the survival of F. tularensis.
  • Correlation with Abiotic Factors: Abiotic factors like organic matter, clay, and various micro-nutrients are primary drivers of bacterial communities in soil.
  • Positive Associations: Clay, nitrogen, soluble salts, silt, organic matter, zinc, lead, manganese, magnesium, and nickel are positively correlated with the prevalence of F. tularensis, C. burnetii, and B. anthracis.
  • Intermediary Roles: Cadmium, moisture, sand, and pH play intermediary roles in the survival of F. tularensis.
  • Least Contributing Soil Attributes: Potassium, calcium, copper, sodium, iron, phosphorus, and chromium are the least contributing soil attributes for F. tularensis prevalence.
  • Comparison with Previous Findings: The current study’s findings align with previous research on F. tularensis and other soil-borne pathogenic bacteria.
  • Importance of Hyperparameter Optimization: Hyperparameter optimization plays a pivotal role in enhancing accuracy and improving model performance.

Conclusion

This study highlights the effectiveness of machine learning models in classifying F. tularensis and understanding its behavior in soil settings. The two-stage attribute ranking and hyperparameter optimization techniques contribute to improved model performance. The findings provide insights into the key soil characteristics associated with Ft prevalence and emphasize the importance of hyperparameter optimization for accurate classification.

Call to Action: If you’re intrigued by the fascinating world of soil-borne pathogenic bacteria and want to delve deeper into their classification using machine learning models, don’t hesitate to reach out to our team of experts. We’re always eager to share our knowledge and collaborate on groundbreaking research!