Understanding String Comparison of Databases

Three approaches to speed up string comparison

Photo by Niko Photos on Unsplash

In the previous article, we briefly talked about how B-tree-based databases use partial indexes to improve performance. In this article, we will talk about another story of indexing.

Generally speaking, B-tree-based database indexes have one characteristic, which is leftmost matching. Take MySQL for example, if the WHERE condition is a = 1 AND b = 2 AND c = "Hello" then the index must be built in this order, i. e. (a, b, c).

In addition, if the index is (a, b, c, d), then the same query can be applied based on the leftmost match principle.

Furthermore, if the condition we want to query is a string match in the c field, for example, WHERE a = 1 AND b = 2 AND c LIKE "Hello%", such a condition can still apply to the (a, b, c) index. Because of the leftmost match rule, even if LIKE is used, the leftmost edge of the string is already specified, so it still meets the rule.

On the contrary, if the query condition becomes WHERE a = 1 AND b = 2 AND c LIKE "%World", then only the a and b fields can be indexed, while the c field must compare all rows matched (a, b) values, which will use a lot of CPU computation of the database, in other words, very resource consuming.

The simplest solution to improve this situation is to drop the percentage sign in the middle and after the string. Nevertheless, there must be a use case for using it this way in the first place, so this article will not ask you to simply drop it, but to provide some alternatives.

Modify the original string pattern so that the index order is reversed.

For instance, the original query needs to match the string pattern "PREFIX_ABC_SUFFIX" with a query like a LIKE "%ABC%".

Then reverse the original string pattern and change it to "ABC_PREFIX_SUFFIX". With this change, you can remove the top percentage sign and change the query to the leftmost match.

Add a new boolean field, abc_matched, to the original database table. When the database client writes data, it determines in advance whether the original field meets the conditions and sets the corresponding flag.

That is to say, the original query a LIKE "%ABC%" can be changed to abc_matched = 1.

By doing so, the database computation can be transferred to the database client, thus effectively reducing the database overhead.

Such a practice is also common in data science and is called data preprocessing. It replaces the N operations of subsequent queries with a single operation.

In MySQL there is no way to do this kind of query.

However, in both Postgres and MongoDB there are corresponding index types that can effectively deal with fuzzy string comparisons.

Postgres uses GIN indexes instead of the default Bitmap type, while MongoDB uses text indexes instead of regular indexes.

By modifying the indexing type, even fuzzy comparisons of strings can be done with good performance. Nevertheless, all technical selections have advantages and disadvantages; indexes of GIN or text type take up a lot of space, several times the cost of the default type, and the process of building indexes is time consuming.

Therefore, in the case of limited resources, even if the solution looks very easy to implement, I do not recommend it that much.

In software development, there are always many approaches to solving an existing problem, and each approach comes with pros and cons.

Take the three approaches mentioned in this article.

The cost of modifying the outermost client is the highest, but the cost of designing a new feature is the lowest.

Because the modification of the client is followed by the compatibility of the old and new versions, besides, the outermost client may be a mobile application, which is not only difficult to deploy but also hard to rollout.

However, if the design is correct at the very beginning, the implementation cost is very low because it is just a reverse order of the original string.

The cost of modifying the database schema is the second highest, although it does not necessarily need to change the client at the outermost tier, but it also needs to change the client of the database.

The modification of the database schema requires the migration of the old data, which also requires time and human effort.

However, preprocessing of the data offers additional benefits beyond the improved query performance, such as more explicit feature matrix, which can effectively improve the accuracy of the model during machine learning.

The implementation cost of modifying the index type is the lowest, just change the index once, basically no need to modify the implementation of the application.

But this is the highest cost for the system.

Firstly, such an index type takes up a very large amount of memory space, which cannot be ignored when the data volume is large.

Secondly, when the indexes are too large to fit the database memory size or have to be swapped frequently, it will affect the performance of other queries as well.

Moreover, the complexity of the indexes can affect the performance of writes, which is an overhead that must be carefully considered if this is a high-frequency writing system.

Therefore, although there are three approaches, there is a trade-off as to which one is suitable and the answer will not be the same for each system.

News Credit

%d bloggers like this: