Projects

Selected work

Engineering projects from internship work, founding-team builds, and personal experiments. Each card opens with the recruiter-layer claim; deeper writeups and source live on the linked pages.

ZeroFalse

Multi-stage LLM pipeline that reduces false positives in static analysis.

Best F1 = 0.912 (OWASP) and 0.837 (real-world CWE-bench); +0.26 F1 from CWE-specialized prompting.

Takes raw CodeQL alerts and runs them through contextual reasoning + structured evidence validation to filter false positives. Evaluated 10 frontier LLMs across 6 model families (Gemini, GPT, Grok, Mistral, DeepSeek, Qwen) on two benchmarks: OWASP Java Benchmark (1,974 cases / 10 CWE categories) and CWE-bench, a real-world dataset of 755 CodeQL alerts across 56 project–CVE pairs from 37 open-source Java repositories. CWE-specialized prompting improved F1 by up to +0.26 on real-world code.

  • LLMs
  • CodeQL
  • Python
  • Static Analysis
  • Multi-Stage Prompting

Fabric — Agentic IDE (Farpoint)

LLM-powered agentic IDE. I own the multi-agent DAG orchestration, subagent system, and context-management layers.

Authored the empirical study behind Fabric’s externally-published March-2026 benchmark report — 99% of frontier accuracy at 18% of frontier cost on Aider Polyglot (225+ exercises, 6 languages).

Production agentic IDE in the Cursor product space. Shipped: a six-tool subagent surface (DelegateTask / SendMessage / WaitForTask / CheckTaskOutput / StopTask / ListTasks) with headless execution, foreground/background promotion, and notification-queue injection back into LLM conversation history; a TDD-style RED→GREEN multi-agent DAG orchestrator with Mission Control dashboard; chain-of-density + KV-cache-aware summarization with unified context-budget tracking; the prepare→permission→execute tool lifecycle with path-scoped Bash/Read/Write/Edit/Glob; SWE-Bench and Aider-Polyglot evaluation infrastructure; and an MCP server exposing the test-and-break loop to AI agents. Also designed and ran a SWE-bench-with-vs-without-GraphRAG experiment over an 18,000-LoC code-knowledge-graph subsystem; the negative result (no measurable improvement) informed the team’s no-ship recommendation.

  • TypeScript
  • Electron
  • React
  • LLM Agents
  • MCP
  • SWE-Bench
  • Docker

Golden Repository — Verified, Executable CVE Reproductions

LangGraph-orchestrated agentic pipeline that reproduces and patches CVEs end-to-end. 89 verified completions (61 Python + 28 Java).

89 verified end-to-end CVE completions at commit 05743f35 with 100% success across exploit / patch / diff / verification checks.

SFU lab project. Eight-phase LangGraph state machine drives the full exploit-and-patch lifecycle per CVE: parallel PoC analysis across 7 sources (GitHub, GitLab, Exploit-DB, PacketStorm, Nuclei, Metasploit, vendor advisories) + advisory enrichment, 0–10 composite PoC scoring with a synthesis fallback below threshold, parallel dockerized vulnerable + patched builds, automated exploit validation that verifies EXPLOIT_SUCCESS on vuln and EXPLOIT_FAILED on patched, and a three-layer hallucination defense at validation (filesystem-grounded verdict, fresh-context re-read, persistent audit trail).

  • LangGraph
  • LangChain
  • Docker
  • Python
  • Claude Code SDK

Pabla — Crypto Social-Trading Engine

Real-time copy-trading engine for crypto markets. Iran’s leading platform in the space — ~40k users in 18 months.

Iran’s leading crypto social-trading platform — ~40k users in 18 months on a 24/7 financial system.

Co-founded the company and architected the trading engine: smart order routing across 5+ exchanges (Binance, KuCoin, regional venues), best-execution price aggregation over a consolidated best-bid/best-ask view, per-exchange adapter pattern over a normalized internal schema, async Python + Celery, sub-second cross-exchange price-refresh fan-out via Redis Pub/Sub, idempotent copy-replication state machine (Copycat) with slippage controls and per-follower position sizing tracking low-thousands of active leader-follower pairs at peak. Shipped MVP in ~2 months; platform reached ~40k users in 18 months.

  • Python
  • Django
  • PostgreSQL
  • Celery
  • Redis
  • asyncio
  • WebSocket
  • Docker
  • Real-time Systems

SnappFood — ETA, Churn, Fraud Models (10M+ users)

Production ML on Iran’s largest food-delivery platform: ~27% better ETA, 13% lower churn, 10% CSAT lift.

~27% ETA accuracy improvement, 13% churn reduction, 10% CSAT lift — measured on 10M+ users.

Customer Experience team — built the Octopus BI layer (department-specific KPI dashboards), adapted Uber’s DeepETA to motorbike delivery for ~27% ETA-accuracy improvement and 24% fewer delivery delays, shipped a churn-prediction pipeline (RFM features + logistic regression on 3M+ users) that fed reactivation campaigns dropping monthly churn by 13%, and a vendor-fraud detection system that lifted CSAT by 10% and NPS from 5 to 7.

  • Python
  • PyTorch
  • Keras
  • scikit-learn
  • SQL
  • Power BI
  • Pandas

Clarion — Voice-to-Prompt Desktop Agent

Tauri 2 macOS menu-bar agent: hotkey → Whisper → Haiku rewrite → paste. Built for bilingual developers.

End-to-end voice-to-prompt desktop agent shipped in a single working commit (Tauri 2, dual-path Whisper).

Personal project. Tauri 2 macOS app (~2,460 LOC Rust + TypeScript/Svelte, 5 MB bundle) with global-hotkey audio capture, dual-path Whisper (OpenAI Whisper API + local whisper.cpp via whisper-rs with 5 GGML model variants), Claude Haiku prompt structuring with shallow project-context injection (CLAUDE.md / README.md / package.json), and auto-paste via osascript. Five-phase state machine: idle → recording → transcribing → structuring → pasting with live UI feedback. Planned upgrade: tree-sitter + tantivy symbol index and a two-stage grounded rewrite with deterministic Levenshtein identifier guard.

  • Rust
  • Tauri
  • Svelte
  • Whisper
  • whisper.cpp
  • Anthropic SDK

Research infrastructure & experiments

ThreatEZ — Automated Bottom-Up Threat Modeling

6-phase static-analysis-grounded multi-agent pipeline shipped as a VS Code extension; derives architecture + STRIDE threats from source code.

First-author SFU CMPT 785 paper; ships as a VS Code extension producing OWASP-schema-compatible threat models.

SFU CMPT 785 first-author course paper (ACM-formatted). Six-phase pipeline: static cartographer (AST/regex extraction of routes, DB ops, auth, inputs, external interfaces) → MCP context gathering (DeepWiki / Context7) → architecture inferrer → planner-driven exploration loop (Strategic Planner + Code Analyst agents refining the DFD and ThreatFindings with code-level evidence) → verifier → synthesizer with STRIDE + CWE / OWASP Top 10 enrichment. Output conforms to the OWASP Threat Model Library JSON schema v1.0.2.

  • LLMs
  • STRIDE
  • Static Analysis
  • MCP
  • VS Code Extension
  • OWASP

AutoSec — Fully Agentic Vulnerability Remediation

End-to-end agentic system that takes a raw codebase through detection, triage, and validated patch generation.

End-to-end agentic remediation spanning detection → triage → patch on real codebases.

SFU lab project, ongoing. Multi-agent workflow: a detection agent that runs and interprets static analyzers (CodeQL), a triage agent that filters false positives and prioritizes findings, and a patch-generation agent that proposes and validates fixes. Agents share a structured state and hand off via machine-checkable artifacts. Research bet: multi-stage prioritization improves precision (lower false positives) and remediation coverage (higher fix rate) jointly rather than as a tradeoff.

  • LLMs
  • CodeQL
  • Static Analysis
  • Multi-Agent Systems
  • Python

Preference-Tuned Small-Model Safety (DPO + LoRA)

Planned: DPO fine-tune of a small open-weight instruct model on a curated prompt-injection-resistance dataset.

Defensible post-training artifact characterizing the refusal-rate-vs-helpfulness frontier on a small model.

Personal project, planned. Curated ~800 preference pairs across 5 attack categories (jailbreak / indirect prompt injection / data exfiltration / tool misuse / ambient-authority abuse), sourced from JailbreakBench, AdvBench, PromptInject-style indirect-injection cases, and custom security-domain triples authored from Golden Repository CVE descriptions. DPO fine-tune of Qwen2.5-1.5B-Instruct on a single Compute Canada H100 using HuggingFace TRL’s DPOTrainer with LoRA rank-16 adapters on a 4-bit NF4-quantized base. Eval harness measures refusal rate on held-out adversarial split + helpfulness retention on Alpaca-Eval, MMLU sample, and TruthfulQA.

  • HuggingFace TRL
  • DPO
  • LoRA
  • PEFT
  • bitsandbytes
  • Compute Canada
  • vLLM