MariaDB 10: Zoom in on table statistics

I - Table Statistics - Why?

When you execute a SQL query that uses an index, creates a join, or another complex operation, MySQL will read the statistics linked to these tables, which will allows it to chose the optimal plan of execution.
For InnoDB for example, this behaviour is controlled by innodb_stats_% type variables:

show variables LIKE 'Innodb_stats_%'; 
+--------------------------------------+-------------+
| Variable_name                        | Value       |
+--------------------------------------+-------------+
| innodb_stats_auto_recalc             | ON          |
| innodb_stats_method                  | nulls_equal |
| innodb_stats_on_metadata             | OFF         |
| innodb_stats_persistent              | ON          |
| innodb_stats_persistent_sample_pages | 20          |
| innodb_stats_sample_pages            | 8           |
| innodb_stats_transient_sample_pages  | 8           |
+--------------------------------------+-------------+
7 rows in set (0.01 sec)

II - What is the purpose of the new function?

The persistent statistics were introduced for InnoDB with version 5.6.6; before this version, InnoDB calculated the statistics on the fly based on a sample of the data, which signified that the execution plan could vary after a MySQL restart or a large table modification.
For TokuDB, the calculation of the statistics takes place as well on a sample of data.
For a same table, we can have totally different data:

TokuDB:

SHOW INDEX FROM url_list;
+----------+-------------+
| Table    | Cardinality |
+----------+-------------+
| url_list |   116935313 |
| url_list |    14616914 |
+----------+-------------+
2 rows in set (0.00 sec)
InnoDB:

 SHOW INDEX FROM url_list_innodb;
+-----------------+-------------+
| Table           | Cardinality |
+-----------------+-------------+
| url_list_innodb |   115352677 |
| url_list_innodb |    23070535 |
+-----------------+-------------+
2 rows in set (0.00 sec)

To address the differences in implementations and to allow all the engines to benefits from persistent statistics, a new function was added to MariaDB 10: the "Engine-independent statistics table" (EIST).

III - What Impact on Performances?

This new function introduces a new variable, "use_stat_tables", that had 3 modes:
  • never
  • complementary
  • preferable
With 'never', the "engine-independent" statistics are never used or updated.
With 'complementary', the statistics are computed in the engine and in the new tables specific to MariaDB 10. If the data are not available in the engine, they are computed from the new tables.
With 'preferable', the data from the new tables are prioritized for collection, and if they are not available, the data from the engine is used.

Let's take a look at this function in a TokuDB type table:

With the function deactivated:

ANALYZE TABLE url_list;
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| test.url_list | analyze | status   | OK       |
+---------------+---------+----------+----------+
1 row in set (5.43 sec)
With the function activated:

SET use_stat_tables='preferably';
MariaDB [test]> ANALYZE TABLE url_list;
^CCtrl-C -- query killed. Continuing normally.
65 rows in set (25 min 50.62 sec)
The command in fact never terminated, as the computation of the statistics filled the whole tmpdir. A bug was opened to report the problem. We understand the comment found on the MariaDB website:
"note: currently, engine-independent statistics are collected by executing a complete browse of the table and indexes. This can be quite costly."
Generally speaking, if you are particularly sensitive to the speed of execution of ALTER TABLES both for InnoDB and TokuDB, it is a function with a certain cost.
Additionally the INSERT and UPDATE operations are slower, as the statistics must be twice updated: once for the engine stats and once for the independent stats.

In Conclusion...

It seems preferable to activate the function only if you work with engines that explicitly do not have statistics, or incomplete statistics.
It would be interesting that the MariaDB developers introduce a new variable that allows you to control the list of engines for which it would be useful to activate these statistics.

Contributors: