Fast Full-text Search with PGroonga, Postgres, and Elixir

2024-02-17 The author profile picture Peter Ullrich

We’re Indie Courses, the video course platform for indie creators. We help you sell your online courses quickly and give you the tools and knowledge to make more sales than you would elsewhere. No marketing degree needed.

Our goal at Indie Courses is to offer creators the most advanced tools for hosting and selling their video courses. One of them is auto-transcription.

When you upload a video course, we create AI-generated subtitles, transcripts, and chapters for every single lesson using AssemblyAI. After you publish your course, students can search through all transcripts of all lessons through a single search bar. This makes it super easy to find the parts in your video course where you discuss specific technologies or topics.

We implemented the full-text search of all lesson transcripts using the Postgres extension pgroonga. Here’s how we use the extension from Elixir.

Installing PGroonga

The first step is to enable the pgroonga extension in your Postgres instance. There are a few ways to do so:

  1. Start a Postgres Docker image that has pgroonga pre-installed.
  2. Enable the extension through your managed database provider.
  3. Install pgroonga manually.

The first option is the fastest. You can quickly spin up a local Postgres instance that has pgroonga pre-installed. To do so, simply run:

docker run -d -e POSTGRES_PASSWORD=postgres groonga/pgroonga:3.1.5-debian-15

This will start Postgres version 15 with pgroonga version 3.1.5 installed. You can connect to the database through localhost:5432 now using the username and password postgres:postgres.

The second option is also fast and works well if you host your database with a managed service like Supabase. Most database providers offer a fast way to enable extensions. In Supabase, simply go to Database > Extensions and enable pgroonga. That’s it!

The third option is a bit more complicated but is useful if you manage your Postgres instance yourself. You have to install pgroonga on your operation system and then you can create the extension in your database. Consult pgroonga’s install instructions to find out how to install it on your system.

Adding the Extension to the Database

You need to enable the extension per database using a migration. In Elixir, you can generate a new migration with mix ecto.gen.migration add_pgroonga and add this code:

defmodule MyApp.Repo.Migrations.AddPgroonga do
  use Ecto.Migration

  def change do
    execute(
      "CREATE EXTENSION IF NOT EXISTS pgroonga;",
      "DROP EXTENSION IF EXISTS pgroonga;"
    )
  end
end

This will enable pgroonga in your database if it hasn’t been enabled before. Now, we can use it to create some indexes!

Adding the Index

Before we can query a text column, we need to create a pgroonga index on it. Let’s generate another migration with mix ecto.gen.migration add_pgroonga_transcript_index. In the migration, we create a pgroonga index on our transcript column, which contains the transcript of the lesson as text.

This is what the migration looks like:

defmodule MyApp.Repo.Migrations.AddPgroongaTranscriptIndex do
  use Ecto.Migration

  def change do
    execute(
      """
      CREATE INDEX lessons_transcript_pgroonga_index
        ON lessons
        USING pgroonga (transcript);
      """,
      """
      DROP INDEX IF EXISTS lessons_transcript_pgroonga_index;
      """
    )
  end
end

Once we apply the migration with mix ecto.migrate, pgroonga creates an index on our transcript column that makes our full-text search super efficient.

Querying the Transcripts

Now with the extension and index in place, let’s see how we can use the index to do full-text search on our transcripts. Here’s an Ecto query that searches the transcript for a search term and returns highlighted matches together with their surrounding characters to give context to the match.

def search_transcripts(course_id, search_term) do
  from(lesson in Lesson,
    where: lesson.course_id == ^course_id,
    where: fragment("? &@~ ?::text", lesson.transcript, ^search_term),
    select: %{
      lesson: lesson,
      matches:
        fragment(
          "pgroonga_snippet_html(?, pgroonga_query_extract_keywords (?::text), 200)",
          lesson.transcript,
          ^search_term
        )
    }
  )
  |> Repo.all()
end

A lot is going on here, so let’s dissect it one by one.

The actual full-text search happens in this line:

where: fragment("? &@~ ?::text", lesson.transcript, ^search_term)

PGroonga offers various operators to do full-text search. The one used here &@~ supports web search queries. It allows search specifiers like OR and the NOT specifier -keyword to exclude certain keywords.

We don’t want to return the lesson records if they match the search query, but only their matches together with some words around them to give a bit of context. For that we use the combination of pgroonga_snippet_html/3 and pgroonga_query_extract_keywords/1 in the select block:

fragment(
  "pgroonga_snippet_html(?, pgroonga_query_extract_keywords (?::text), 200)",
  text_track.subtitles,
  ^search_term
)

The first function pgroonga_query_extract_keywords/1 extracts the exact matches from the transcript text. pgroonga_snippet_html/3 then surrounds the matches with <span class="keyword">match here</span> HTML tags and also returns the surrounding characters of the match based on the last parameter, 200 in this case. The 200 here means that the total length of the snippet will be 200 characters, including the length of the match.

Putting it all together

Now that we have our query, let’s test it! Let’s start our application with an interactive shell and run some searches with iex -S mix phx.server.

iex> search_transcripts(course_id, "elixir")
[
  %{
    matches: [
      "more active on these platforms. In my case, this is LinkedIn and the <span class=\"keyword\">elixir</span> forum. So these are two platforms that if I wanted to increase the",
      "y because I never thought about  LinkedIn as a good platform for finding <span class=\"keyword\">elixir</span> developers  who might be interested in buying my video course"
    ],
    lesson: %Lesson{}
  }
]

It works! We now receive our matches as a list of strings and they are surrounded by the span.keyword tags. In our heex template, we can simply render them with:

<div :for={match <- @matches}>
  ...<%= raw(match) %>...
</div>

The raw/1 function will render the span-tags without escaping, which means that they will render as proper span-tags and we can style them through the keyword-class. A small trick is to add three dots before and after the snippet. This glosses over abrupt sentence cutoffs and word breaks.

Be careful with allowing user input in the transcript because a bad actor could easily embed malicious code in the transcript which every student browser would then render

Conclusion

And that’s it! If you enjoyed this article, consider hosting your video course with us! If you want to be notified of future blog posts, subscribe to your newsletter below. Until the next time!

Sign up for Product Updates and Articles