Search Interface

easysearch is a lightweight search interface for browsing and querying alignment outputs, built into the companion package easytranscriber. It indexes the alignment JSON files into a SQLite database with full-text search and serves a web UI for searching, browsing documents, and playing back audio with synchronized transcript highlighting.

See the demo in the overview for a preview of the synchronized highlighting.

Installation

The easysearch dependencies are optional. Install them with:

pip install easytranscriber[search]

Quick start

After running the alignment pipeline, start the search server by pointing it at your alignment outputs and audio files:

easysearch --alignments-dir output/alignments --audio-dir data/audio

This will:

  1. Index all alignment JSON files into a local SQLite database (search.db).
  2. Start a web server at http://127.0.0.1:8642.

On subsequent launches, only new or modified files are re-indexed. Use --reindex to force a full re-index.

Audio playback

Clicking a search result takes you to the document page at the matching timestamp. The audio player seeks to that position and begins playback. The transcript view highlights the currently playing word in real time, and you can click any sentence or paragraph to jump to that point in the audio.

Note

Some browsers block autoplay by default. If audio doesn’t start automatically when navigating from a search result, a play button overlay will appear. Click it to begin playback.

How indexing works

easysearch automatically detects how file contents should be indexed depending on whether the output was created in the library easyaligner or easytranscriber. For easyaligner output, the audio does not undergo automatic transcription. Therefore, the VAD chunks are not going to contain any ASR generated text.

The force aligned ground-truth texts will be in the alignments field. Each one of these alignment segments (sentence, paragraph, or other granularity, depending on your tokenizer) become searchable rows in the database.

This means:

  • A search query matches when all terms appear within the same alignment segment.
  • If your tokenization is fine-grained (sentence-level), broad queries may become more difficult.

The document page displays the transcript at the alignment segment level, so you can click any segment, and jump to that point in the audio.

Note

easysearch also supports chunks mode for files produced by the easytranscriber ASR pipeline, where indexing is at the VAD chunk level (~20–30 seconds per chunk). The mode is detected automatically, but can be overridden with --index-mode.

Search syntax

The search uses SQLite’s FTS5 full-text search engine. The following query syntax is supported:

Query Matches
climate change Segments containing both words (implicit AND)
"climate change" Exact phrase
climate OR weather Either word
climate NOT weather climate but not weather
econom* Prefix match: economy, economic, etc.
NEAR(climate change, 3) Both words within 3 tokens of each other

CLI reference

easysearch --help
Option Default Description
--alignments-dir output/alignments Directory containing alignment JSON files
--audio-dir data Directory containing source audio files
--db search.db Path to the SQLite database file
--host 127.0.0.1 Host to bind to
--port 8642 Port to listen on
--per-page 20 Results per page
--snippets-per-doc 5 Max matching snippets shown per document in search results
--reindex Force full re-index of all JSON files
--index-mode auto Override index mode: chunks or alignments (auto-detected if omitted)