Methodology — MCP Observatory

what counts as an mcp server

Discovery spans six surfaces. From npm we run the registry's own keyword search for mcp — so a package is found because it declares that keyword, not because of how it is named — and read the top slice of those results each cycle, ranked by the registry's relevance. PyPI is searched the same way. A package that omits the keyword, or that ranks below the window against the tens of thousands that carry it, is not picked up from these two surfaces; it may still arrive via GitHub, the official registry, or a directory listing below. Smithery (smithery.ai) and mcp.so contribute their published directory listings. From GitHub we follow repos tagged with the mcp or model-context-protocol topics, surfacing the most recently active first and skipping repos quiet for over 18 months. From the official MCP registry we ingest every listed server with a public repo or package. Slice the resulting set by inferred capability and permission in the servers browser.

Beyond discovery, dedicated workers deepen and watch the set: they track MCP client releases, infer capabilities from each README, statically analyze the published source artifact, and ingest CVE / advisory feeds. Every worker and its cadence is in the poll-cadence table below; live health for each is on the feeds page.

one project, one identity

The same project routinely appears on several registries with mildly different names. A resolver collapses those into one canonical record, in priority order:

explicit override — a maintainer-curated map keyed by source + slug, loaded from overrides.json at boot.
cross-source repository url match — two records from different sources pointing at the same git remote are merged into one canonical identity. Same-source repo sharers are treated as monorepo siblings and are not merged.
exact source-record fold — a normalized (source, slug) pair that the source_records provenance ledger has seen before resolves to the existing identity, making re-resolution deterministic.

Failing all three, a record keeps its source-qualified id (e.g. npm:foo). This is why the tracked-server count is a count of identities, not of registry rows — a server on npm, GitHub, and Smithery is one server here.

Levenshtein / fuzzy merging was explicitly removed: it chained distinct same-author tools together (e.g. two unrelated tools by the same publisher would collapse into one). The Levenshtein threshold now only feeds the typosquat radar (naming flags), which flags — never merges. Retired ids forward via identity_alias so old URLs redirect with 301 rather than 404.

verification

A server is verified when at least one independent strong trust signal vouches for it. Each signal is recorded separately, so the badge always carries a reason:

official-registry — listed and active in the official MCP registry.
npm-provenance / pypi-attestation — a cryptographic build-provenance attestation asserting the published artifact was built from its source repo; we record that the registry publishes one.
smithery — Smithery's own verified flag.

These differ in how far we can stand behind them, and we label each accordingly. npm-provenance / pypi-attestation is attested — a build-provenance attestation the registry publishes and we read first-hand; we record that it exists, but don't re-verify the signature ourselves. official-registry and smithery are reported: a third party vouches and we relay it, but the observatory hasn't independently confirmed it. "Verified" is never our own endorsement — it always names whose claim it is.

A weak signal (a publisher owning the declared repo) is recorded but never sets the badge alone — otherwise nearly every server would qualify and the word would lose meaning. Losing a strong signal (an official entry going deprecated) drops the badge, which is itself a tracked change.

reading the code

Two layers populate each server's capability surface. Registry introspection reads the declared tool / resource / prompt counts and transport from Smithery and the official registry. Static code analysis goes deeper: for npm, PyPI, and GitHub servers a separate, isolated scanner process downloads the published artifact (npm tarball, PyPI sdist, or GitHub source tarball), unpacks it into a throwaway directory of its own, and reads the tree statically — no code is ever run (not the server, its tests, or its install hooks). Running the read + analysis in its own container is deliberate: a heavy scan can never stall ingestion or the live site. Every finding quotes the exact file and line:

permissions — evidence-backed filesystem, network, shell, secrets, database, and untrusted-content access (the injection-entry axis: web scrapes, issue/PR/comment bodies, fetched documents).
tools — SDK tool / resource registrations and decorators.
dependencies & install hooks — parsed manifests, plus pre/post-install scripts (read, never executed).
transports — stdio, SSE, streamable-HTTP, HTTP binds.
prompt surface — shipped skill / agent-instruction files (SKILL.md, .cursorrules, prompts/*) scanned for hidden-channel injection: invisible unicode, Unicode-Tag smuggling, and comment-buried model directives.
danger signals — committed secrets, dynamic-exec sinks (eval, pickle.loads), and suspicious call-home endpoints.

The analyzer also counts code files (any language) in the unpacked tree and records how many dependency-manifest entries it could parse. A repo with zero code files cannot implement a server whatever its README claims, so it's classified unconfirmed rather than trusted at face value; a repo whose manifest we couldn't parse stays pending — unknown, not absent — so it isn't hidden or wrongly branded either way. Both counts are shown alongside the evidence itself, never collapsed into a verdict.

A second static pass runs Semgrep's taint / dataflow rules (vendored and pinned in-repo) over the same unpacked source, proving when a credential or environment value actually flows to a network or shell sink — the kind of source-to-sink path the line-level scan can only approximate. It runs in the same isolated scanner process as the line-level pass — so both stay off the main worker — is still static (no code is executed), and its findings are tagged separately (semgrep) and treated as inferred review prompts, not verdicts.

Coverage: 58% of 26,365 analyzable servers analyzed by the current analyzer (15,407 current · 8,150 re-analysis due — 8,150 awaiting an analyzer re-scan + 0 source changed · 2 not yet · 2,806 not analyzable). An analyzer bump marks the back-catalogue due and drops this figure until the re-scan catches up, so it tracks the running analyzer's live reach. A further 3,168 listed repos have since vanished upstream (deleted / renamed / made private) and are excluded from every count.

The analyzer is versioned: each detector change bumps a version that marks the back-catalogue stale and re-scans it. The scanner changelog lists what every version detects and when it shipped. The findings themselves — hidden prompt content, committed secrets, dynamic-exec sinks, suspicious endpoints — are browsable and filterable on the code analysis page.

the security signals we derive

On top of the capability surface, the observatory derives a stack of independent security signals. Each is collected on its own cadence and surfaced on the security hub:

vulnerabilities — OSV.dev query for every npm / PyPI server, with CVSS severity, fix availability, and EPSS exploit-probability scoring.
capability drift — we snapshot each server's permission mask, tool count, transport, and verification, and raise an advisory when they change. Silent permission escalations on high-adoption servers are flagged for review as possible rug-pulls — a prompt to look, not an accusation; routine version bumps expand capabilities too.
naming & impersonation — servers whose names sit a single edit (Levenshtein distance 1) from a verified or official-scope target are flagged as possible typosquats. Distance 2 is deliberately not used: on a short name it collides unrelated real words far more often than it catches a squat, and this radar is a review prompt, so precision is worth more than recall.
abandonment — scored from release age and cadence; long-silent or content-less repos surface so dependents can plan around them.
supply chain — install-hook presence, build-provenance coverage, and registry deprecation flags.
license posture — SPDX classification flagging missing, non-commercial, or strong-copyleft terms.
hijack exposure — the intersection of abandoned, widely-depended-on, and CVE-carrying servers: the highest-leverage takeover targets.

scoring risk

A background job folds those signals into a single 0–100 risk score and an A–E grade per server. Vulnerabilities (weighted by severity, halved when a fix exists, scaled by EPSS), inferred permission exposure, recent drift, CVEs inherited through the dependency graph, supply-chain flags, abandonment, and tool-safety findings all add weight; mitigators subtract it — being verified, shipping build provenance, and staying actively released. The score is a read cache, recomputed whenever a signal changes, and drives the risk leaderboard.

OWASP MCP Top 10 coverage

How our in-house detectors map to the OWASP MCP Top 10 (2025). We grade ourselves honestly — 8 covered, 1 partial, and 1 out of scope for a static, no-execution observatory. "Covered" means a heuristic detector points at the risk — not that every instance is caught. Detection is inferred throughout; this matrix is a coverage map, not a guarantee.

MCP01 Token Mismanagement & Secret Exposure covered

Committed credentials / .env, and credential values fed into log sinks.

secrettoken-logdangerous_code
MCP02 Privilege Escalation via Scope Creep covered

Over-broad OAuth scopes, capability gained after first observation, benign-named tools carrying dangerous perms.

oauth-scoperug_pullpurpose_mismatch
MCP03 Tool Poisoning covered

Name + description + full input schema (Full-Schema Poisoning), plus shipped prompt/skill files.

tool_poisoningprompt_injection_fileskill-exfil
MCP04 Software Supply Chain Attacks & Dependency Tampering covered

Dependency manifest, install hooks, OSV CVEs (incl. inherited), abandonment, build-provenance coverage.

depinstall-hookcveabandonmentprovenance
MCP05 Command Injection & Execution covered

Dynamic-exec sinks, unconstrained command/path params, shell capability.

dynamic-execdangerous_codeloose_schemaperm:shell
MCP06 Intent Flow Subversion covered

Puppet attacks (untrusted tool steering toward a verified server's tool) and in-description tool steering.

cross_server_steeringtool_poisoning
MCP07 Insufficient Authentication & Authorization partial

Gap: we infer scope/secret-handling from source but do not model a server's auth/authz configuration directly (much of it is deploy-time, not in the artifact).

oauth-scopetoken-log
MCP08 Lack of Audit and Telemetry out of scope

Gap: whether a server logs/audits its own actions is a runtime property a static, no-execution observatory cannot observe. Tracked for completeness.
MCP09 Shadow MCP Servers covered

Unverified servers whose tool names or identities collide with a verified/official target (typosquat radar).

tool_shadownaming/impersonation
MCP10 Context Injection & Over-Sharing covered

Hidden-channel injection (zero-width / bidi / tag-smuggling / HTML-comment directives) in tool schemas and shipped prompt files, plus the lethal trifecta (toxic_flow): private-data access + untrusted-content ingestion + a network exfil channel reachable in one session (per-server and cross-server via the dependency graph).

hidden-promptprompt_injection_filetool_poisoningtoxic_flow

poll cadences

Each source has its own ingest worker on a fixed interval. Discovery workers find and refresh servers; enrichment & security workers deepen and watch them. Live status for every worker — last fetch, next due, errors — is on the feeds page.

discovery

npm

60 seconds

pypi

60 seconds

github

5 minutes topic search, rate-limit aware

smithery

15 minutes capability enrichment

mcpso

15 minutes

official

6 hours official MCP registry — discovery + verification

enrichment & security

clients

30 minutes MCP client release tracking

static

30 minutes README capability inference

code

5 minutes static artifact analysis — no code run

cve

4 hours OSV + GitHub advisories + EPSS scoring

related servers

The relatedness graph derives edges from two signals: a shared maintainer (same npm user, PyPI author, or GitHub owner) and shared runtime dependencies. The dependency graph filters to the most active nodes — at least one release in the last 30 days — so dormant projects don't crowd the view. That makes it a deliberate selection bias: the graph reads as "who's active and clustered", not a census of the whole ecosystem — stable, mature, infrequently-released packages are under-represented by design. The same dependency edges power the inherited-CVE walk and the hijack-exposure model above.

what we don't do

no user-submitted servers — every record comes from a polled upstream.
no comments, votes, or rankings — the observatory measures, it doesn't curate.
no runtime execution — code analysis is purely static. We never run a server, its tests, or its install hooks; we only read the published source.

For the tech stack behind all of this, see the colophon.

How We Track the MCP Ecosystem