Hosted onnoosphere.hyper.mediavia theHypermedia Protocol

Problem

Search lives in a dropdown. You type, you get title matches, you pick one. It works for "I know the name of the thing I want." It falls apart for everything else.

Try searching for a concept that spans multiple documents. Or finding something you wrote last week but can't remember the title of. Or figuring out which version of a doc had that specific paragraph.

We need a dedicated search page. But the current search API wasn't built for that — Now with the addition of semantic search and RRF combining, we can make a full search experience super useful.

Solution

1. IRI Filter — Scope Search to a Document or Subpath

Currently, search is scoped to either an entire account (for web search) or the whole library. However, With IRI filtering, users can narrow search to a specific document or folder subpath.

Examples:

Search only within a single document: hm://<account>/cars/honda

Search within a subpath: hm://<account>/cars/* (all documents under "cars")

Leave empty to search the entire account (current behavior)

This lets the search page offer a "search within" dropdown scoped to the user's current location in the document tree.

2. Content-Type Filter — Choose What Gets Searched

Today the search either looks at titles only, or titles + document bodies. The new content-type filter gives fine-grained control over which content types are included:

  • Document — document body content

  • Comment — comments on documents

  • Contact — contact/profile information

  • Title — document titles

The search page can expose these as checkboxes or a filter bar. When no filter is selected, the existing behavior is preserved.

3. Authority Ranking — Citation-based Result Quality

Opt-in ranking signal that uses citation data to surface more authoritative results. Two signals are blended into the existing search ranking:

Document authority — how many other documents cite/link to this document

Author authority — how many external citations the document's author has received across all their work (self-citations excluded)

When enabled, the ranking weights become:

Semantic similarity 35%

Keyword match 35%

Document citations 20%

Author citations 10%

Why exclude self-citations? Testing showed one author had 98% self-citations, inflating their score from 4 to 227. Filtering self-citations keeps the signal honest.

Performance: Authority scores are computed on-the-fly from existing indexed data — ~7ms for 200 documents. No precomputation or caching needed.

4. Semantic Dedup — Remove Near-Duplicate Versions

Problem: When a document has multiple versions with minor edits (e.g., "cars" changed to "cars."), search returns both versions as separate results even though they're semantically identical.

Solution: For semantic and hybrid search modes, group results by document + block + content type, then compare how similarly each version matches the query. If two consecutive versions score within 20% of each other, only the newest version is kept.

Versions with meaningfully different content (>20% score difference) are both preserved

Keyword-only search keeps the existing exact-match dedup (appropriate since it's character-level matching)

This reduces clutter from minor edits without hiding genuinely different content across versions.

Results Visualization

The dedicated search page displays results as a vertical scrollable list of cards. Each card contains:

Document title — clickable, navigates to the document

Full path breadcrumb — e.g. My Account / cars / honda / civic showing the document's position in the hierarchy

Version indicator — which version matched (timestamp or version label)

Content snippet — preview of the matched block with query terms highlighted inline

Cards stack vertically in a single column, optimized for scanability and variable-length snippets.

Title-only matches show the first content block as a fallback snippet.

Scope

  • 4 backward-compatible additions to the existing search API

  • Backend-only changes — no database migrations, no new tables

Rabbit Holes

  • Materialized authority caches — not needed, batch queries are fast enough on indexed data

  • Configurable ranking weights or A/B testing — premature; hardcoded constants for now

  • Complex dedup strategies (e.g., per-paragraph diffing) — percentage threshold on query score is simpler and sufficient

No Gos

Open Question

How should search type (keyword / semantic / hybrid) be exposed on the search page?

Explicit toggle — full control, potentially confusing for non-technical users

Smart defaults with override — hybrid for search page, keyword for dropdown

Backend heuristic — auto-select based on query length

Always hybrid — simplest ¿?

This doesn't block backend work since the search type selection already exists in the API.

Do you like what you are reading?. Subscribe to receive updates.

Unsubscribe anytime