In Korean, there are spaces among words in a sentence, and many of them are composed of compound words.
For example, when I want to find a product called ‘공기청정기’ in Korean in an online shopping mall, if I search for the product name ‘청정기’, the desired product ‘공기청정기’ should be searched.
If I create indexes by cutting individual characters to the specified number in a particular word, such as MySQL’s NGram parser, I would get the above result.
When I tested the above search in MemSQL, I couldn’t get the desired result. If the parsing method of the MemSQL FTS behaves like the NGram parser, I’m pretty sure I got the result I wanted. This will be very useful for Korean.
I knew that MemSQL FTS has applied the Lucene engine. If so, I think the NGram Tokenizer of the Lucene can be easily adopted. For your reference, many companies in Korea are using the ElasticSearch based on the Lucene engine for the search.
- For reference, Air Purifier is “공기청정기” in Korean.