Semantic Analysis
Gustav Svalander ยท
The Problem With Strong Signals
In 2018, a handful of commodity traders noticed something odd. Satellite imagery showed unusual grain shipment patterns in the Black Sea region weeks before any official export data was published. The traders who acted on that ambiguous, hard-to-interpret signal made fortunes. The ones who waited for the official numbers found the opportunity already gone.
Or consider pharmaceutical investing. By the time the FDA publishes an approval decision, the stock has already moved. But months earlier, scattered clues were visible to anyone watching closely enough: a clinical trial site quietly accelerating patient recruitment, a patent filing by a competitor positioning around the drug's mechanism, an insurance company beginning formulary pre-negotiations. Each clue was inconclusive on its own. Together, they told a story.
This is the pattern. The information that matters most is rarely the headline. It's the collection of faint, scattered signals that precede the headline, visible across dozens of unrelated sources, none of them conclusive individually.
Everyone is chasing the same data. The same earnings calls, the same Reuters wires, the same government press releases. By the time a piece of information is clear enough to act on, everyone has already acted on it. The advantage lives somewhere else.
It lives in weak signals.
What a Weak Signal Is
A weak signal is a piece of information that, on its own, has almost zero predictive value. A logistics trade publication noting unusual pre-shipment volumes at a European port. A think tank working paper getting cited more than usual. An industry lobbyist's newsletter quietly shifting tone. A mid-level official's LinkedIn post echoing language patterns from previous pre-announcement periods.
No single one of these tells you anything. But when five or six of them converge, from independent sources, in a tight time window, they tell you something the market hasn't priced yet.
The trouble is that no human can watch all of these simultaneously. Not across domains, not across languages, not at the velocity that information moves today. By the time an analyst reads the logistics report, the LinkedIn post is three days old and the think tank paper has already been cited in a mainstream article. The window closed.
This is a coverage problem and a synthesis problem. No individual, no team, can monitor enough sources and connect the dots fast enough. You need infrastructure that watches everything and surfaces convergence automatically.
Meaning, Not Keywords
Traditional approaches to information monitoring rely on keywords. Set up a Google Alert for "EU tariffs." Subscribe to an RSS feed. Build a classifier.
Keywords fail here because weak signals rarely contain the words you'd search for. The logistics publication doesn't say "tariff." The lobbyist doesn't say "protectionism." The LinkedIn post doesn't say "announcement." Each source uses its own vocabulary, its own framing, its own context. What connects them isn't a shared word. It's a shared meaning.
This is what embedding models are good at. They don't match strings. They match concepts. A sentence about "pre-positioning inventory ahead of anticipated regulatory changes" and a sentence about "level playing field for domestic manufacturers" sit close together in embedding space, even though they share almost no words.
But embedding a document and running a nearest-neighbor search isn't enough. You need geometric precision. You need to say: I want things that are near this concept, aligned with this direction, and away from that noise. Nearest-neighbor doesn't give you that. You need a query language.
Geometric Queries
SemQL, the query language in Semantik, expresses subscriptions as geometric predicates over embedding space. Three primitives.
DISTANCE defines a sphere. Everything within a cosine distance of an anchor point matches. This is your classic similarity search, the baseline.
DIRECTION defines a cone. It matches anything aligned with a concept direction, regardless of how far away it is. This is how you watch for a theme across diverse sources. A logistics article and a policy paper may be far apart in absolute distance, but if both point in the direction of "trade protection preparation," a DIRECTION query catches them.
CONTRAST defines a region through attraction and repulsion. Attract toward "preparation, pre-positioning, anticipatory action." Repel from "routine trade, normal operations." The result is a composite vector that carves out exactly the semantic region you care about, filtering noise by definition rather than by manual exclusion rules.
These compose with boolean logic. AND narrows. OR broadens. NOT excludes. The result is a subscription that watches a precise region of meaning across every source flowing through the broker.
Convergence Is the Signal
A single match from a single source is noise. Convergence from independent sources is signal.
This is the key insight. When a DIRECTION query toward "regulatory action preparation" starts returning matches from a logistics namespace, a policy namespace, an industry lobbying namespace, and an earnings call namespace, all within a 72-hour window, something is happening. No individual match is conclusive. The pattern is.
Semantik's namespace model makes independence verification possible. Each data source publishes into its own namespace. A query can span multiple namespaces, and the results carry metadata, including source address and timestamp. A verification agent can check: are these five matches genuinely from five independent sources, or are they all downstream of the same wire report? If three messages from different namespaces turn out to embed within a tight DISTANCE of each other and share similar timestamps, they're probably echoes of a single source. That's one signal, not three.
Time windowing adds another dimension. Run the same query at WINDOW 1h, WINDOW 6h, WINDOW 24h, WINDOW 7d. Compare the density and match quality across windows. If short-window results are intensifying while the 7d window shows no prior activity in that region, you're catching an emerging trend. If the 7d window is already dense, the information is stale. The gradient across time windows is itself a feature.
What Actors Do Before They Announce
The most valuable weak signals are almost never about the event directly. They're about actors preparing for it.
Before a regulator moves, there are precursors: the agency posts for specialists in the relevant domain, FOIA request patterns shift, inspector travel records change, industry conference panels pivot topics, and law firm blogs start hedging language.
Before a scientific approval, there are precursors: clinical trial recruitment velocity changes, principal investigators shift their conference appearances and tone, competitors file adjacent patents, insurance companies begin formulary pre-negotiations, and manufacturing partners scale up.
Before a geopolitical escalation, there are precursors: state media framing shifts, diplomatic travel patterns change, military procurement shows anomalies, communications traffic deviates from baseline, and ally nations' language subtly adjusts.
Each of these is a different domain, a different vocabulary, a different publication cadence. What unites them is a shared direction in embedding space: preparation for an outcome that hasn't been announced.
A single SemQL subscription, scoped across the relevant namespaces, with a DIRECTION toward the preparation concept and a CONTRAST repelling routine activity, can watch all of these simultaneously. The subscription doesn't need to know what keywords each domain uses. It's watching for meaning.
The Infrastructure Layer
None of this works with batch processing. By the time you've aggregated yesterday's data, run embeddings overnight, and generated a report in the morning, the window is closed. Weak-signal synthesis is a streaming problem. Signals need to flow to the agents that care about them as they arrive, not when a cron job runs.
This is what a semantic message broker is built for. Agents publish what they learn. Agents subscribe to what they need. The broker matches by meaning and delivers in real time. No polling, no batch ETL, no manual topic string configuration.
The subscription is the strategy. The broker is the execution layer.
What Comes Next
We're building Semantik for this. A broker where subscriptions are geometric predicates, where messages route by meaning, and where agents cooperate through shared semantic context rather than hardwired integrations.
The teams that figure out how to synthesize weak signals across diverse, independent sources will have a structural advantage that keyword-based systems can't match. Not because they have better data, but because they have better geometry.
The data is already out there. The question is whether your infrastructure can hear what it's saying.