FIX: search index failing on certain posts (#20736)

During search indexing we "stuff" the index with additional keywords for entities that look like domain names. This allows searches for `cnn` to find URLs for `www.cnn.com` The search stuffing attempted to keep indexes aligned at the correct positions by remapping the indexed terms. However under certain edge cases a single word can stem into 2 different lexemes. If this happened we had an off by one which caused the entire indexing to fail. We work around this edge case (and carry incorrect index positions) for cases like this. It is unlikely to impact search quality at all given index position makes almost no difference in the search algorithm.
2025-03-01 13:16:41 +08:00 · 2023-03-20 15:43:08 +11:00 · 2023-03-20 15:43:08 +11:00 · 4a3c13a37b
commit 4a3c13a37b
parent 38fdd842f5
2 changed files with 29 additions and 1 deletions
--- a/app/services/search_indexer.rb
+++ b/app/services/search_indexer.rb
@ -87,7 +87,17 @@ class SearchIndexer
          .scan(TS_VECTOR_PARSE_REGEX)
          .map do |term, _, indexes|
            new_indexes =
-              indexes.split(",").map { |index| additional_words[index.to_i - 1][1] }.join(",")
+              indexes
+                .split(",")
+                .map do |index|
+                  existing_positions = additional_words[index.to_i - 1]
+                  if existing_positions
+                    existing_positions[1]
+                  else
+                    index
+                  end
+                end
+                .join(",")
            "#{term}#{new_indexes}"
          end
          .join(" ")
--- a/spec/services/search_indexer_spec.rb
+++ b/spec/services/search_indexer_spec.rb
@ -139,6 +139,24 @@ RSpec.describe SearchIndexer do
      }
    end

+    it "should work with edge case domain names" do
+      # 00E5A4 stems to 00e5 and a4, which is odd, but by-design
+      # this may cause internal indexing to fail due to indexes not aligning
+      # when stuffing terms for domains
+      post.update!(cooked: <<~HTML)
+        Test.00E5A4.1
+      HTML
+
+      SearchIndexer.update_posts_index(
+        post_id: post.id,
+        topic_title: post.topic.title,
+        category_name: post.topic.category&.name,
+        topic_tags: post.topic.tags.map(&:name).join(" "),
+        cooked: post.cooked,
+        private_message: post.topic.private_message?,
+      )
+    end
+
    it "should work with invalid HTML" do
      post.update!(cooked: "<FD>" * Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH)