FULLTEXT index behavior

vkruoso · August 15, 2021, 9:37pm

Hi!

I have a strange behavior on FULLTEXT index. The use case is a partner table, and I want to provide an auto-complete like experience to search on that table.

The query below is used to demonstrate the first issue:

select name, MATCH (name) AGAINST ('SENOR ABRAVANEL')
from partner
where MATCH (name) AGAINST ('SENOR ABRAVANEL')
and name = 'SENOR ABRAVANEL';

The exact same name can yield different relevance scores:
Screenshot from 2021-08-15 18-30-39

Why is that the case? I can assume that the relevance is calculated on each segment and is relative to the segment data?

The other situation is that the relevance behaves strangely in this situation:

select name, max(MATCH (name) AGAINST ('SENOR ABRAVANEL')) as relevance
from partner
where MATCH (name) AGAINST ('SENOR ABRAVANEL')
group by name
order by 2 desc;

Screenshot from 2021-08-15 18-29-47

The natural way for me here is that the exact match would have a better score than lines that have just one word match. In an auto-complete field if there is an exact match it would be better to show it on the first line. Also, because the results are in “random” order for the same relevance, the exact match can be last in the list and the user probably misses it.

Is this expected behavior?

As a workaround, this seems to work better, but is more painful to write that SQL and also for the database to execute it:

select * from (
select name, 
  max(MATCH (name) AGAINST ('SENOR')) as r1,
  max(MATCH (name) AGAINST ('ABRAVANEL')) as r2
from partner
where MATCH (name) AGAINST ('SENOR ABRAVANEL')
group by name) as tt
order by r1+r2 desc
limit 30;

Any hints would be appreciated.

zhou · August 19, 2021, 6:50pm

I believe both of them come from the same reason, the relevance score calculated on each segment. We are using the standard scoring function from lucene Similarity (Lucene 3.5.0 API)
Note that it normalizes the score to the range (0,1) and that’s performed per segment.
if a term is rare enough in a segment, and there is no exact match in that segment, an inexact match can give you a match score of 1.

vkruoso · August 19, 2021, 8:15pm

Hi @zhou, thanks for your answer. I understand this is good for most use cases and the distributed nature of the database prevents it from calculating a global relevance score for the rows.

Seems like the workaround above is going to be the way to go. Do you see any other approaches being used?