SECRET OF CSS

15 Best Practices for SQL Optimization | by Dwen | Jun, 2022


Boost the performance of SQL

0*vrUH LYiJrEHv9Z4
Photo by Kam Idris on Unsplash

SQL optimization is a hot topic that everyone is concerned about. Whether you are in an interview or at work, you are likely to encounter it.

If there is a performance problem on an online interface you are responsible for one day, you need to optimize it. Then your first thought is probably to optimize the SQL statement because its transformation cost is much smaller than the code.

So, how to optimize the SQL statement?

Today I will share some tips on SQL optimization, I hope they will help you.

Many times, when we write SQL statements, for convenience, we like to use select * directly to find out the data of all columns in the table at one time.

# incorrect example
select * from user where id = 1;

In actual business scenarios, maybe we only really need to use one or two of these columns. I checked a lot of data, but no, it wasted database resources, such as memory or CPU.

In addition, during the process of data transmission through network IO, the data transmission time will also be increased.

Another most important problem is: select * will not use the covering index, and there will be a large number of return table operations, resulting in low performance of query SQL.

So, how to optimize it?

# correct example
select name, age from user where id = 1;

When querying the SQL statement, only the columns that need to be used are checked, and the redundant columns do not need to be checked out at all.

We all know that after using the union keyword in the SQL statement, we can obtain the data after reordering.

And if you use the union all keyword, you can get all data, including duplicate data.

# incorrect example
(select * from user where id=1)
union
(select * from user where id=2);

The process of reordering requires traversal, sorting, and comparison, which is more time-consuming and consumes more CPU resources.

So if you can use union all, try not to use union.

# correct example
(select * from user where id=1)
union all
(select * from user where id=2);

Unless there are some special scenarios, such as after union all, duplicate data appears in the result set, and duplicate data is not allowed in business scenarios, then the union can be used.

A small table drives a large table, that is, a data set of a small table drives a data set of a large table.

If there are two tables, order and user, the order table has 10,000 pieces of data, and the user table has 100 pieces of data.

At this time, if you want to check the list of orders placed by all valid users.

This can be achieved using the in keyword:

select * from order
where user_id in (select id from user where status=1)

This can also be achieved using the exists keyword:

select * from order
where exists (select 1 from user where order.user_id = user.id and status=1)

In the business scenario mentioned above, it is more appropriate to use the in keyword to achieve business requirements.

why?

Because of the in a keyword is included in the SQL statement, it will execute the subquery statement in first and then execute the statement outside in. If the amount of data in is small, the query speed is faster as a condition.

And if the SQL statement contains the exists keyword, it executes the statement to the left of exists first (main query statement).

Then use it as a condition to match the statement on the right. If it matches, you can query the data. If there is no match, the data is filtered out.

In this requirement, the order table has 10,000 pieces of data and the user table has 100 pieces of data.

The order is a large table, and the user is a small table.

If the order is on the left, it is better to use the in keyword.

In conclusion:

  • in applies to the large table on the left and the small table on the right.
  • exists applies to the small table on the left and the large table on the right.

What if you have a batch of data that needs to be inserted after business processing?

# incorrect example
for (Order order: list) {
orderMapper.insert(order);
}

Insert data one by one in a loop.

insert into order(id,code,user_id) 
values(123,'001',100);

This operation requires multiple requests to the database to complete the insertion of this batch of data.

But as we all know, in our code, every time we request the database remotely, it will consume a certain amount of performance.

And if our code needs to request the database multiple times to complete this business function, it will inevitably consume more performance.

The correct way is to provide a method to insert data in batches.

# correct example
orderMapper.insertBatch(list);
# insert into order(id,code,user_id)
# values(123,'001',100),(124,'002',100),(125,'003',101);

In this way, you only need to request the database remotely once, and the SQL performance will be improved. The more data, the greater the improvement.

However, it should be noted that it is not recommended to operate too much data in batches at one time. If there is too much data, the database response will be very slow.

Batch operations need to grasp a degree, and it is recommended that each batch of data be controlled within 500 as much as possible. If the data is more than 500, it will be processed in multiple batches.

Sometimes, we need to query the first item of some data, for example: query the first order placed by a user, and want to see the time of his first order.

# incorrect example
select id, create_date
from order
where user_id=123
order by create_date asc;

Query orders according to the user id, sort by order time, first find out all the order data of the user, and get an order set.

Then in the code, get the data of the first element, that is, the data of the first order, to get the time of the first order.

List<Order> list = orderMapper.getOrderList();
Order order = list.get(0);

Although this approach has no problem in function, it is very inefficient. It needs to query all the data first, which is a bit of a waste of resources.

So, how to optimize it?

# correct example
select id, create_date
from order
where user_id=123
order by create_date asc
limit 1;

Use limit 1 to return only the data with the smallest order time of the user.

In addition, when deleting or modifying data, in order to prevent misoperation, resulting in deletion or modification of irrelevant data, the limit can also be added at the end of the SQL statement.

update order set status=0,edit_time=now(3) 
where id>=100 and id<200 limit 100;

In this way, even if the wrong operation, such as the id is wrong, it will not affect too much data.

For the batch query interface, we usually use the in keyword to filter out data. For example, I want to query user information in batches through some specified ids.

The SQL statement is as follows:

select id,name from category
where id in (1,2,3...100000000);

If we do not impose any restrictions, the query statement may query a lot of data at one time, which may easily cause the interface to time out.

What should we do then?

select id,name from category
where id in (1,2,3...100)
limit 500;

You can limit the data with the limit in SQL.

However, we are more about adding restrictions to the business code. The pseudo-code is as follows:

public List<Category> getCategory(List<Long> ids) {
if(CollectionUtils.isEmpty(ids)) {
return null;
}
if(ids.size() > 500) {
throw new BusinessException("too many")
}
return mapper.getCategoryList(ids);
}

Another solution is: that if there are more than 500 records in ids, you can use multiple threads to query the data in batches. Only 500 records are checked in each batch, and finally, the queried data are aggregated and returned.

However, this is only a temporary solution and is not suitable for scenes with too many ids. Because there are too many ids, even if the data can be quickly detected, if the amount of data returned is too large, the network transmission is very performance-intensive, and the interface performance is not much better.

Sometimes, we need to query data through a remote interface and then synchronize it to another database.

# incorrect example
select * from user;

If you get all the data directly, then sync it. Although this is very convenient, it brings a very big problem, that is, if there is a lot of data, the query performance will be very poor.

What should we do then?

select * from user 
where id>#{lastId} and create_time >= #{lastCreateTime}
limit 100;

In ascending order of id and time, only one batch of data is synchronized each time, and this batch of data has only 100 records. After each synchronization is completed, save the largest id and time of the 100 pieces of data for use when synchronizing the next batch of data.

This incremental query method can improve the efficiency of a single query.

Sometimes, when querying data on the list page, in order to avoid returning too much data at one time and affecting the performance of the interface, we generally paginate the query interface.

The limit keyword commonly used for paging in MySQL:

select id,name,age 
from user limit 10,20;

If the amount of data in the table is small, using the limit keyword for paging is no problem. But if there is a lot of data in the table, there will be performance problems with it.

For example, the pagination parameter now becomes:

select id,name,age 
from user limit 1000000,20;

MySQL will find 1,000,020 pieces of data, then discard the first 1,000,000 pieces of data, and only check the last 20 pieces of data, which is a waste of resources.

So, how do paginate this massive data?

Optimize SQL:

select id,name,age 
from user where id > 1000000 limit 20;

First, find the largest id of the last paging, and then use the index on the id to query. However, in this scheme, the id is required to be continuous and ordered.

You can also use between to optimize pagination.

select id,name,age 
from user where id between 1000000 and 1000020;

It should be noted that between should be paginated on the unique index, otherwise there will be an inconsistent size of each page.

If you need to query data from more than two tables in MySQL, there are generally two implementation methods: subquery and join query.

An example of a subquery is as follows:

select * from order
where user_id in (select id from user where status=1)

Sub-query statements can be implemented through in the keyword, and the conditions of one query statement fall within the query results of another select statement. The program runs on the nested innermost statement first and then runs the outer statement.

The advantage of a subquery statement is that it is simple and structured if the number of tables involved is small.

But the disadvantage is that when MySQL executes sub-queries, temporary tables need to be created. After the query is completed, these temporary tables need to be deleted, which has some additional performance consumption.

At this time, it can be changed to a connection query.

select o.* from order o
inner join user u on o.user_id = u.id
where u.status=1
# incorrect example
select a.name,b.name.c.name,d.name
from a
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id
inner join d on d.c_id = c.id
inner join e on e.d_id = d.id
inner join f on f.e_id = e.id
inner join g on g.f_id = f.id

If there are too many join, MySQL will be very complicated when selecting indexes, and it is easy to choose the wrong index.

And if there is no hit, the nested loop join is to read a row of data from the two tables for pairwise comparison, and the complexity is .

So we should try to control the number of joined tables.

# correct example
select a.name,b.name.c.name,a.d_name
from a
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id

If you need to query the data in other tables in the implementation of the business scenario, you can have redundant special fields in the a, b, and c tables, for example, the redundant d_name field in the table a to save the data to be queried.

However, I have seen some ERP systems before. The concurrency is not large, but the business is relatively complex. It needs to join more than a dozen tables to query the data.

Therefore, the number of joined tables should be determined according to the actual situation of the system. It cannot be generalized. The less the better.

We generally use the join keyword when it comes to the join a query of multiple tables.

The most commonly used joins are left join and inner join.

  • left join: Find the intersection of two tables plus the remaining data in the left table.
  • inner join: finds the data of the intersection of two tables.

An example of using inner join is as follows:

select o.id,o.code,u.name 
from order o
inner join user u on o.user_id = u.id
where u.status=1;

If two tables are related using inner join, MySQL will automatically select the small table in the two tables to drive the large table, so there will not be too many problems in performance.

An example of using left join is as follows:

select o.id,o.code,u.name 
from order o
left join user u on o.user_id = u.id
where u.status=1;

If two tables are associated using left join, MySQL will use the left join keyword to drive the table on the right by default. If there is a lot of data in the left table, there will be performance problems.

It should be noted that when using left jointo query, use a small table on the left and a large table on the right. If you can use inner join, use left join as little as possible.

As we all know, indexes can significantly improve the performance of query SQL, but the number of indexes is not the better.

Because when new data is added to the table, an index needs to be created for it at the same time, and the index requires additional storage space and a certain performance consumption.

The number of indexes in a single table should be controlled within 5 as much as possible, and the number of fields in a single index should not exceed 5.

The structure of the B+ tree used by MySQL to save the index, the B+ tree index needs to be updated during insert, update and delete operations. If there are too many indexes, it will consume a lot of extra performance.

So, the question is, what if there are too many indexes in the table, more than 5?

This question needs to be viewed dialectically. If your system has a low concurrency and the amount of data in the table is not too much, in fact, more than 5 can be used, as long as it does not exceed too much.

But for some high-concurrency systems, be sure to abide by the limit of no more than 5 indexes on a single table.

So, how can a high concurrency system optimize the number of indexes?

If you can build a joint index, don’t build a single index, you can delete a useless single index.

Migrating some query functions to other types of databases, such as Elastic Seach, HBase, etc., only needs to build a few key indexes in the business table.

char represents a fixed string type, and the storage space of the field of this type is fixed, which will waste storage space.

alter table order 
add column code char(20) NOT NULL;

varchar represents a variable-length string type, and the field storage space of this type will be adjusted according to the length of the actual data, without wasting storage space.

alter table order 
add column code varchar(20) NOT NULL;

If it is a field with a fixed length, such as the user’s mobile phone number, it is generally 11 bits and can be defined as a char type with a length of 11 bytes.

But if it is an enterprise name field, if it is defined as a char type, there is a problem.

If the length is defined as too long, for example, it is defined as 200 bytes, and the actual enterprise length is only 50 bytes, 150 bytes of storage space will be wasted.

If the length is defined as too short, for example, it is defined as 50 bytes, but the actual enterprise name has 100 bytes, it will not be stored, and an exception will be thrown.

Therefore, it is recommended to change the enterprise name to varchar type. The storage space of variable-length fields is small, which can save storage space, and for queries, the search efficiency in a relatively small field is obviously higher.

When we choose field types, we should follow these principles:

If you can use numeric types, you don’t need strings, because characters tend to be slower to process than numbers.

Use small types as much as possible, such as use bit to store boolean values, tinyint to store enumeration values, etc.

A fixed-length string field, of type char.

A variable-length string field, of type varchar.

decimal is used for the amount field to avoid the problem of loss of precision.

We have many business scenarios that need to use the group by keyword. Its main function is to deduplicate and group.

Usually, it is used in conjunction with having, which means grouping and then filtering data according to certain conditions.

# incorrect example
select user_id,user_name from order
group by user_id
having user_id <= 200;

This writing method has poor performance. It first groups all orders according to the user id and then filters users whose user id is greater than or equal to 200.

Grouping is a relatively time-consuming operation, why don’t we narrow down the data before grouping?

# correct example
select user_id,user_name from order
where user_id <= 200
group by user_id

Use the where condition to filter out the redundant data before grouping so that the efficiency will be higher when grouping.

In fact, this is an idea, not limited to the optimization of group by. Before our SQL statements do some time-consuming operations, we should reduce the data range as much as possible, which can improve the overall performance of SQL.

Among SQL optimization, there is a very important content: index optimization.

In many cases, the execution efficiency of SQL statements is very different when the index is used and the index is not used. Therefore, index optimization is the first choice for SQL optimization.

The first step in index optimization is to check whether the SQL statement is indexed.

So, how to check whether SQL has gone to the index?

You can use the explain command to view the execution plan of MySQL.

explain select * from `order` where code='002';
1*DXeOjTYldlZzwgn8caOhQQ

I will write a separate blog to explain the specifically detailed parameters.

Here are some common reasons for index failure:

  • Does not satisfy the leftmost prefix principle.
  • The range index column was not put last.
  • Used select * .
  • There are calculations on indexed columns.
  • There are functions on indexed columns.
  • String type without quotes.
  • Used like % .
  • Wrong use is null .
  • Wrong use or .

If it is not for the above reasons, you need to further investigate other reasons.

In addition, have you ever encountered such a situation: it is obviously the same SQL, only the input parameters are different. Sometimes the index a goes, and sometimes the index b?

Yes, sometimes MySQL chooses the wrong index.

If necessary, you can use force index to force the query SQL to go to a certain index.



News Credit

%d bloggers like this: