ZeroFalse: Improving Precision in Static Analysis with LLMs
A multi-stage LLM pipeline that takes raw static-analyzer alerts and triages them through contextual reasoning and structured evidence validation, reducing false positives without sacrificing recall. Evaluated 10 frontier LLMs across 6 model families (Gemini, GPT, Grok, Mistral, DeepSeek, Qwen) on the OWASP Java Benchmark (1,974 cases / 10 CWE categories) and CWE-bench — a real-world dataset of 755 CodeQL alerts across 56 project–CVE pairs from 37 open-source Java repositories. CWE-specialized prompting improved F1 by up to +0.26 on real-world code; best F1 is 0.912 on OWASP and 0.837 on CWE-bench.
First-author submission, currently under review at RAID 2026.