Methodology

Methodology and Sources

Military Capability Gap Matcher combines public-source collection, source-specific normalization, vector search, lexical search, topic modeling, and optional AI summaries.

Last updated April 28, 2026. Back to search.

Current Public Sources

SAM.gov notices
AFWERX topics
DoD SBIR/STTR topics from DSIP
USA.gov challenge listings
ERDCWERX challenge and capability-assessment posts
SOFWERX event problem statements
Army xTech competition pages
Defense Innovation Unit challenge pages
NAVSEA / NSWC Dahlgren innovation challenge pages
DARPA opportunity pages
Modern War Institute analysis

Collection

Each source has a dedicated fetch script that collects public pages, public APIs, or public JSON where available. Fetchers keep raw records so normalization can be reviewed and improved without losing the source context.

Normalization

Source-specific normalizers convert raw source material into a common capability-record shape with title, gap statement, problem statement, mission context, desired capability, dates, agency path, source URL, tags, and metadata.

Structured solicitation and challenge sources are normalized around their stated problem, objective, requirement, or topic text. Broad analytical sources are handled more conservatively because they may describe military problems without being formal requirements.

Gap Detection

The pipeline uses a mix of source-specific parsing, relevance heuristics, and LLM review where appropriate. SAM.gov records receive the heaviest filtering because many notices are acquisition paperwork rather than actual capability gaps.

LLM review is used as a gate for SAM.gov when configured. If the LLM quota is exhausted, the pipeline preserves existing output or skips new uncertain records rather than silently lowering the standard.

Search

Search uses semantic embeddings from BAAI/bge-small-en-v1.5 when embeddings are available, plus lexical overlap as a fallback and tie-breaker. Embeddings are stored in Postgres with pgvector.

Search scores are ranking signals, not confidence scores. High scores mean the submitted text is similar to a record in the current corpus, not that the solution is qualified, compliant, or selected.

Topic Modeling

The primary topic model clusters embedded records with k-means, then labels clusters with class-based TF-IDF terms. A legacy TF-IDF/NMF model may also be retained for comparison.

Topics are corpus navigation aids. They do not represent official taxonomies, validated requirements, or acquisition categories.

AI Summaries

When configured, an LLM summarizes the top matched records for a query. The summary is generated from the ranked records and should be checked against the linked source pages.

If the AI provider is unavailable or rate-limited, ranked search results continue to work without the summary.

Known Limitations

The corpus only includes sources currently fetched and normalized by this project.
Some public pages are stale, incomplete, moved, or blocked by source-site protections.
Automated extraction can miss context or overstate the specificity of a capability need.
A match does not mean a program is open, funded, eligible, relevant to a specific company, or still accepting submissions.