Planning to support Full-text index with Ngram parser?

kyoungho.kum · July 13, 2020, 2:00am

Hi,

When will MemSQL’s full-text search (FTS) support parsers like MySQL’s ngram?

Creating a full-text index using the ngram parser for Chinese, Japanese, and Korean (CJK) may not be the best solution, but it is mandatory.

Please, I hope it will be done as soon as possible.

Best Regards.

hanson · July 13, 2020, 6:19pm

Can you tell us more about what you want to do and why this will help you with your application compared to our standard fulltext support?

We are tracking this feature request. There’s no specific timetable for delivery.

kyoungho.kum · July 15, 2020, 7:56am

In Korean, there are spaces among words in a sentence, and many of them are composed of compound words.

For example, when I want to find a product called ‘공기청정기’ in Korean in an online shopping mall, if I search for the product name ‘청정기’, the desired product ‘공기청정기’ should be searched.

If I create indexes by cutting individual characters to the specified number in a particular word, such as MySQL’s NGram parser, I would get the above result.

When I tested the above search in MemSQL, I couldn’t get the desired result. If the parsing method of the MemSQL FTS behaves like the NGram parser, I’m pretty sure I got the result I wanted. This will be very useful for Korean.

I knew that MemSQL FTS has applied the Lucene engine. If so, I think the NGram Tokenizer of the Lucene can be easily adopted. For your reference, many companies in Korea are using the ElasticSearch based on the Lucene engine for the search.

For reference, Air Purifier is “공기청정기” in Korean.

hanson · July 15, 2020, 4:42pm

Thank you! This is a very good description.

mpskovvang · February 5, 2021, 9:35pm

The n-gram parser isn’t only useful by CJK languages.

It is also very beneficial when searching for email addresses, names, URLs etc.

I’m also hoping to see this feature supported someday.

It would also help against the stopwords issues in current FULLTEXT index when working with non-English languages.