Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection
:::info Stub — Full Engineering Breakdown Coming This paper was auto-fetched from arXiv on 2026-06-01. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::
| Authors | Maksuda Bilkis Baby et al. |
| Year | 2026 |
| Field | AI / ML |
| arXiv | 2605.31520 |
| Download | |
| Categories | cs.SE, cs.AI, cs.CR |
Abstract
Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.
Engineering Breakdown
The Problem
Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials.
The Approach
We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages.
Key Results
Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.
Research Areas
This paper contributes to the following areas of AI/ML engineering:
- Machine learning
- Deep learning
- Neural networks
- Model optimization
- AI systems
- Separating
:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::
