Cardinality对表Join的影响 - 飞飞和贝贝爸的博客 - ITeye博客

`

oliver_peng

浏览: 43184 次
性别:
来自: 深圳

最近访客更多访客>>

limars

raines8231

Cannel_2020

xwins

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

社区版块

存档分类

最新评论

sevk： #让ruby1.8兼容1.9的写法 if RUBY_ ...
升级到Ruby 1.9 笔记二
lmxbitihero：最大亮点当然是性能提高了好几倍。
升级到Ruby 1.9 笔记一
skyblue1984：好像变复杂了.......
升级到Ruby 1.9 笔记二
doabit：想升1.9，只是RAILS在1.9下有编码问题，，又跑回1.8 ...
升级到Ruby 1.9 笔记一
gfdice：性能是最大亮点。反而这个unicode，我还是习惯程序员自己去 ...
升级到Ruby 1.9 笔记一

Cardinality对表Join的影响

博客分类：

MySQL

阅读更多

如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息，中间会有一列Cardinality，MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join，这种算法的前提条件是索引数据时很平均分配的，但如果索引中的数据非常不均衡，会导致MySQL做出错误的选择。下面是一个完整的例子：

首先创建三个测试表：

CREATE TABLE customer (
-- Unique ID
customer_id integer NOT NULL,

CONSTRAINT PK_customer PRIMARY KEY(customer_id)
);

CREATE TABLE my_order (
-- Unique ID
order_id integer NOT NULL,

product_id integer NOT NULL,

customer_id integer NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ),
CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
CONSTRAINT UQ_order_id UNIQUE(order_id)

);

CREATE TABLE delivery (

order_id int NOT NULL,

time datetime NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(order_id, time),
CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id)
);

然后加入数据：

set @N=0;

insert into customer select @N:=@N+1 from mysql.help_topic LIMIT 1000;

set @N=0;

insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000;

set @I=1;

insert into my_order select @N:=@N+1, @N, @I:=@I+1 from mysql.help_topic a, mysql.help_topic b LIMIT 600;

set @N=0;

insert into delivery select @N:=@N+1, '2010-05-10 15:22:02' from mysql.help_topic a, mysql.help_topic b LIMIT 600;

注意在my_order表中大多数记录的customer_id的值是1。

执行ANALYZE TABLE来更新Cardinality：

ANALYZE TABLE my_order;

ANALYZE TABLE delivery;

下面我们来执行一个Query：

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.51 sec)

然后用Explain分析一下：

mysql> explain SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| id | select_type | table    | type   | possible_keys                    | key                  | key_len | ref                          | rows | Extra       |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| 1 | SIMPLE      | a        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY              | 4       | const                        |    1 | Using index |
| 1 | SIMPLE      | b        | ref    | UQ_order_id,FK_order2customer_id | FK_order2customer_id | 4       | oliver_test.a.customer_id    | 167 |             |
| 1 | SIMPLE      | delivery | eq_ref | PRIMARY                          | PRIMARY              | 12      | oliver_test.b.order_id,const |    1 | Using index |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
3 rows in set (0.00 sec)

从explain的结果来看，MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。但实际情况是如何呢？

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 ;
+----------+
| count(*) |
+----------+
|   100000 |
+----------+
1 row in set (0.06 sec)

如果我们只执行第一个Join，实际将访问100000条记录，远远高于167，相反如果我们先做第二个Join，我们只需要访问600行：

mysql> select count(*) from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

为什么MySQL会选择错误的顺序来执行呢？首先察看一下my_order表的索引：

mysql> show index from my_order;
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| my_order |          0 | PRIMARY              |            1 | product_id | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | PRIMARY              |            2 | customer_id | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | UQ_order_id          |            1 | order_id    | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          1 | FK_order2customer_id |            1 | customer_id | A         |         602 |     NULL | NULL   |      | BTREE      |         |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)

我们可以看到对于索引FK_order2customer_id，Cardinality的值是602，MySQL在估算第一个Join的时候，假设索引是平均分布的，用总行数（100600）除以Cardinality，所以得到167，由于167小于第二个Join需要访问的行数600，所以选择先执行第一个Join。

如何解决这个问题？

修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行，下面是新的Query:

mysql> SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

你可以看到查询时间从0.51秒变成0秒。

用Explain察看新的执行顺序：

mysql> explain SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| id | select_type | table    | type   | possible_keys                    | key         | key_len | ref                           | rows | Extra                                       |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| 1 | SIMPLE      | b        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY     | 4       | const                         |    1 | Using index                                 |
| 1 | SIMPLE      | delivery | index | PRIMARY                          | PRIMARY     | 12      | NULL                          | 600 | Using where; Using index; Using join buffer |
| 1 | SIMPLE      | a        | eq_ref | UQ_order_id,FK_order2customer_id | UQ_order_id | 4       | oliver_test.delivery.order_id |    1 | Using where                                 |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
3 rows in set (0.00 sec)

另外很重要的一点是这个问题在开始阶段通常不会注意，因为数据量少不会影响性能，但随着数据不断增加，一旦达到一定数量，就会突然出现并影响系统的性能。

分享到：

如何释放已删除的InnoDB磁盘空间 | Apache mod_fcgid 配置实现fcgi process自 ...

2010-05-21 03:59
浏览 976
评论(0)
论坛回复 / 浏览 (0 / 2866)
分类:数据库
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

A Preprocessing Scheme for High-Cardinality Categorical Attributes: 在分类及预测任务中对高维类别（category）变量的预处理方法

Stream summarizer and cardinality estimator..zip: Stream summarizer and cardinality estimator..zip

cardinality: cardinality

简单解析MySQL中的cardinality异常: 最终通过analyze table feed_comment_info_id_0000 命令更新了Cardinality ，才能再次用到索引。排查过程如下： sql语句： select id from feed_comment_info_id_0000 where obj_id=101 and type=1; 索引信息： ...

Oracle优化器 Cardinality基数与Selectivity选择性: 优化器（optimizer）是oracle数据库内置的一个核心子系统。优化器的目的是按照一定的判断原则来得到它认为的目标SQL在当前的情形下的最高效的执行路径，也就是为了得到目标SQL的最佳执行计划。依据所选择执行计划时...

cardinality:用于确定和检查任何可迭代对象大小的 Python 库: 基数cardinality是一个 Python 库，用于确定和检查任何可迭代对象的大小。文档： : Python 包索引 (PyPI)： ://pypi.python.org/pypi/cardinality/ 源代码和问题跟踪器： :

low-cardinality-index-effectiveness: 低基数列索引有效性PostgreSQL 和 DB2 的示例程序。DB2 备忘录将 DB2 本机库的路径配置为LD_LIBRARY_PATH 。例子： LD_LIBRARY_PATH=/opt/ibm/db2/V10.5/lib64在 CLP 中使用分号作为分隔符。 db2 -t

New Cardinality Estimation Algorithms for HyperLogLog Sketches (1702.01284)-计算机科学: New cardinality estimation algorithms for HyperLogLog sketchesOtmar Ertl otmar.ertl@gmail.comFebruary 27, 2017This paper presents new methods to estimate the cardinalities of data sets recorded by ...

RFID Cardinality Estimation with Blocker Tags: RFID Cardinality Estimation with Blocker Tags

HyperLogLog - The Analysis of a Near-Optimal Cardinality Estimation Algorithm (914-3045-2-PB)-计算机科学: cardinality estimation algorithmPhilippe Flajolet1 and Éric Fusy1 and Olivier Gandouet2 and Frédéric Meunier11Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France) 2LIRMM, 161 rue Ada...

PowerDesigner应用教程(深蓝居): ------------PowerDesigner应用教程(深蓝居)---------------------------

论文研究-On Function Period and Coding-Alphabet Cardinality of Genetic Algorithms.pdf: 函数周期与遗传算法编码元数，莫鸿强，Zhong Li，本文根据一阶积木块的数量比较不同编码元数的遗传算法应用于单周期和多周期函数时的优化效果。分析表明，遗传算法以多个周期同时

matlab开发-MaximumCardinalitymatching: matlab开发-MaximumCardinalitymatching。构造（非加权）最大基数匹配

SSD7-EXCISE5: ssd7-ex5 ER Models The entity types (identify weak entities). The attributes in each entity type...The relationship types (identify the roles, the cardinality constraints, and participation constraints

Better with Fewer Bits - Improving the Performance of Cardinality Estimation of Large Data Streams (INFOCOM2017)-计算机科学: Better with Fewer Bits - Improving the Performance of Cardinality Estimation of Large Data Streams (INFOCOM2017)-计算机科学

HyperLogLog in Practice: Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of ...

Python库 | categorical_encoding-0.2.0-py3-none-any.whl: 资源分类：Python库所属语言：Python 资源全名：categorical_encoding-0.2.0-py3-none-any.whl 资源来源：官方安装方法：https://lanzao.blog.csdn.net/article/details/101784059

Global site tag (gtag.js) - Google Analytics