Methodology
Methodology and Sources
Military Capability Gap Matcher combines public-source collection, source-specific normalization, vector search, lexical search, topic modeling, and optional AI summaries.
Last updated April 28, 2026. Back to search.
Current Public Sources
- SAM.gov notices
- AFWERX topics
- DoD SBIR/STTR topics from DSIP
- USA.gov challenge listings
- ERDCWERX challenge and capability-assessment posts
- SOFWERX event problem statements
- Army xTech competition pages
- Defense Innovation Unit challenge pages
- NAVSEA / NSWC Dahlgren innovation challenge pages
- DARPA opportunity pages
- Modern War Institute analysis
Collection
Each source has a dedicated fetch script that collects public pages, public APIs, or public JSON where available. Fetchers keep raw records so normalization can be reviewed and improved without losing the source context.
Normalization
Source-specific normalizers convert raw source material into a common capability-record shape with title, gap statement, problem statement, mission context, desired capability, dates, agency path, source URL, tags, and metadata.
Structured solicitation and challenge sources are normalized around their stated problem, objective, requirement, or topic text. Broad analytical sources are handled more conservatively because they may describe military problems without being formal requirements.
Gap Detection
The pipeline uses a mix of source-specific parsing, relevance heuristics, and LLM review where appropriate. SAM.gov records receive the heaviest filtering because many notices are acquisition paperwork rather than actual capability gaps.
LLM review is used as a gate for SAM.gov when configured. If the LLM quota is exhausted, the pipeline preserves existing output or skips new uncertain records rather than silently lowering the standard.
Search
Search uses semantic embeddings from BAAI/bge-small-en-v1.5 when embeddings are available, plus lexical overlap as a fallback and tie-breaker. Embeddings are stored in Postgres with pgvector.
Search scores are ranking signals, not confidence scores. High scores mean the submitted text is similar to a record in the current corpus, not that the solution is qualified, compliant, or selected.
Topic Modeling
The primary topic model clusters embedded records with k-means, then labels clusters with class-based TF-IDF terms. A legacy TF-IDF/NMF model may also be retained for comparison.
Topics are corpus navigation aids. They do not represent official taxonomies, validated requirements, or acquisition categories.
AI Summaries
When configured, an LLM summarizes the top matched records for a query. The summary is generated from the ranked records and should be checked against the linked source pages.
If the AI provider is unavailable or rate-limited, ranked search results continue to work without the summary.
Known Limitations
- The corpus only includes sources currently fetched and normalized by this project.
- Some public pages are stale, incomplete, moved, or blocked by source-site protections.
- Automated extraction can miss context or overstate the specificity of a capability need.
- A match does not mean a program is open, funded, eligible, relevant to a specific company, or still accepting submissions.