MySQL 如何存储大数据
KatriceMart
8年前
<p>最近,在工作中遇到了MySQL中如何存储长度较长的字段类型问题,于是花了一周多的时间抽空学习了一下,并且记录下来。</p> <p>MySQL大致的逻辑存储结构在这篇文章中有介绍,做为基本概念: InnoDB 逻辑存储结构</p> <p>注:文中所指的大数据指的是长度较长的数据字段,包括varchar/varbinay/text/blob。</p> <h3>Compact行格式</h3> <p>我们首先来看一下行格式为Compact是如何存储大数据的:</p> <pre> <code class="language-sql">mysql> select version(); +-----------+ | version() | +-----------+ | 5.1.73 | +-----------+ 1 row in set (0.01 sec) mysql> show table status like 'row'\G; *************************** 1. row *************************** Name: row Engine: InnoDB Version: 10 Row_format: Compact Rows: 1 Avg_row_length: 81920 Data_length: 81920 Max_data_length: 0 Index_length: 0 Data_free: 0 Auto_increment: NULL Create_time: 2017-01-04 21:46:02 Update_time: NULL Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)</code></pre> <p>我们建立一张测试表,插入数据:</p> <pre> <code class="language-sql">CREATE TABLE `row` ( `content` varchar(65532) NOT NULL DEFAULT '' ) ENGINE=InnoDB DEFAULT CHARSET=latin1 mysql> insert into row(content) select repeat('a',65532); Query OK, 1 row affected (0.03 sec) Records: 1 Duplicates: 0 Warnings: 0</code></pre> <p>我们使用 py_innodb_page_info.py 工具来查看表中的页分布:</p> <pre> <code class="language-sql">[root@localhost mysql]# python py_innodb_page_info.py -v com/row.ibd page offset 00000000, page type <File Space Header> page offset 00000001, page type <Insert Buffer Bitmap> page offset 00000002, page type <File Segment inode> page offset 00000003, page type <B-tree Node>, page level <0000> page offset 00000004, page type <Uncompressed BLOB Page> page offset 00000005, page type <Uncompressed BLOB Page> page offset 00000006, page type <Uncompressed BLOB Page> page offset 00000007, page type <Uncompressed BLOB Page> Total number of page: 8: Insert Buffer Bitmap: 1 Uncompressed BLOB Page: 4 File Space Header: 1 B-tree Node: 1 File Segment inode: 1</code></pre> <p>可以看出,第4页的 <B-tree Node>, page level <0000> 格式为数据页,存放着MySQL的行数据。 <Uncompressed BLOB Page> 可以理解为MySQL存放大数据的地方,暂且叫作外部存储页。Compact格式没有将大数据全部放在数据页中,而是将一部分数据放在了外部存储页中。那么,是全部数据在外部存储页中,还是一部分数据。假如是一部分数据,这一部分是多少呢?</p> <p>我们使用 hexdump -Cv row.ibd 查看一下数据页 <B-tree Node>, page level <0000> ,也就是第4页:</p> <pre> <code class="language-sql">3073 0000c000 8c 25 17 57 00 00 00 03 ff ff ff ff ff ff ff ff |.%.W....????????| 3074 0000c010 00 00 00 00 00 07 3a b8 45 bf 00 00 00 00 00 00 |......:?E?......| 3075 0000c020 00 00 00 00 00 02 00 02 03 a6 80 03 00 00 00 00 |.........?......| 3076 0000c030 00 7f 00 05 00 00 00 01 00 00 00 00 00 00 00 00 |................| 3077 0000c040 00 00 00 00 00 00 00 00 00 13 00 00 00 02 00 00 |................| 3078 0000c050 00 02 00 f2 00 00 00 02 00 00 00 02 00 32 01 00 |...?.........2..| 3079 0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......| 3080 0000c070 73 75 70 72 65 6d 75 6d 14 c3 00 00 10 ff f1 00 |supremum.?...??.| 3081 0000c080 00 00 00 04 03 00 00 00 00 13 12 80 00 00 00 2d |...............-| 3082 0000c090 01 10 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |..aaaaaaaaaaaaaa| 3083 0000c0a0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 3084 0000c0b0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 3085 0000c0c0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| .... .... 3128 0000c370 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 3129 0000c380 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 3130 0000c390 61 61 00 00 00 02 00 00 00 04 00 00 00 26 00 00 |aa...........&..| 3131 0000c3a0 00 00 00 00 fc fc 00 00 00 00 00 00 00 00 00 00 |....??..........| 3132 0000c3b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 3133 0000c3c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 3134 0000c3d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ... ... 4093 0000ffc0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 4094 0000ffd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 4095 0000ffe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 4096 0000fff0 00 00 00 00 00 70 00 63 01 a1 6c 2b 00 07 3a b8 |.....p.c.?l+..:?|</code></pre> <p>我们可以看出,数据页中存储了一部分数据,算下来一共是768字节,然后剩余部分存储在外部存储页中。那么数据页与外部存储页、外部存储页与外部存储页是如何连接在一起的呢?</p> <p>我们观察这一行:</p> <pre> <code class="language-sql">3130 0000c390 61 61 00 00 00 02 00 00 00 04 00 00 00 26 00 00 |aa...........&..| 3131 0000c3a0 00 00 00 00 fc fc 00 00 00 00 00 00 00 00 00 00 |................|</code></pre> <p>这一行是前缀768字节的结尾。注意最后的20个字节:</p> <ul> <li>00 00 00 02:4字节,代表外部存储页所在的space id</li> <li>00 00 00 04:4字节,代表第一个外部页的Page no</li> <li>00 00 00 26:4字节,值为38,指向blob页的header</li> <li>00 00 00 00 00 00 fc fc:8字节,代表该列存在外部存储页的总长度。此处的值为64764,加上前缀768正好是65532。(注意一点,虽然表示BLOB长度的是8字节,实际只有4个字节能使用,所有对于BLOB字段,存储数据的最大长度为4GB。)</li> </ul> <p>验证下第一个外部存储页的头部信息:</p> <pre> <code class="language-sql">4097 00010000 cd c3 b6 8e 00 00 00 04 00 00 00 00 00 00 00 00 |?ö.............| 4098 00010010 00 00 00 00 00 06 b8 a2 00 0a 00 00 00 00 00 00 |......??........| 4099 00010020 00 00 00 00 00 02 00 00 3f ca 00 00 00 05 61 61 |........??....aa| 4100 00010030 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| ... ...</code></pre> <p>前38个字节为File Header(关于InnoDB数据页的详细结构请参见《MySQL技术内幕 InnoDB存储引擎》4.4),这个简单提一下:</p> <ul> <li>cd c3 b6 8e:4字节,该页的checksum。</li> <li>00 00 00 04:4字节,页偏移,此页为表空间中的第5个页。</li> <li>00 00 00 00:4字节,当前页的上一个页。此页为 <Uncompressed BLOB Page> ,所以没有上一页。</li> <li>00 00 00 00:4字节,当前页的下一个页。此页为 <Uncompressed BLOB Page> ,所以没有下一页。</li> <li>00 00 00 00 00 06 b8 a2:8字节,该页最后被修改的日志序列位置LSN。</li> <li>00 0a:2字节,页类型,0x000A代表BLOB页。</li> <li>00 00 00 00 00 00 00 00:8字节,略过。</li> <li>00 00 00 02:页属于哪个表空间,此处指表空间的ID为2。</li> </ul> <p>之后是4字节的 00 00 3f ca ,这里的值为16330,代表此BLOB页的有效数据的字节数。 00 00 00 05 代表下一个BLOB页的page number。</p> <p>我们看最后一个 <Uncompressed BLOB Page> ,第8个页:</p> <pre> <code class="language-sql">7169 0001c000 fa 78 9b 27 00 00 00 07 00 00 00 00 00 00 00 00 |?x.'............| 7170 0001c010 00 00 00 00 00 07 3a b8 00 0a 00 00 00 00 00 00 |......:?........| 7171 0001c020 00 00 00 00 00 02 00 00 3d 9e ff ff ff ff 61 61 |........=.????aa| 7172 0001c030 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 7173 0001c040 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| ... ...</code></pre> <p>最后一页的有效数据大小为0x00003d9e=15774,768+16330*3+15774 = 65532字节,符合初始插入数据的大小。</p> <p>由于这是最后一个 <Uncompressed BLOB Page> ,所以指向下一个 <Uncompressed BLOB Page> 的指针为ff ff ff ff。</p> <p>由此我们可以很清晰的看出数据页与BLOB页的连接关系(引用淘宝数据库月报上的一张图):</p> <p style="text-align:center"><img src="https://simg.open-open.com/show/a37d6fcc7e78730b65b931e03cb1181c.png"></p> <p>我们来再看一个比较有意思的例子。:</p> <pre> <code class="language-sql">CREATE TABLE `testblob` ( `blob1` blob NOT NULL, `blob2` blob NOT NULL, `blob3` blob NOT NULL, `blob4` blob NOT NULL, `blob5` blob NOT NULL, `blob6` blob NOT NULL, `blob7` blob NOT NULL, `blob8` blob NOT NULL, `blob9` blob NOT NULL, `blob10` blob NOT NULL, `blob11` blob NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1; mysql> insert into testblob select repeat('a',1000),repeat('b',1000),repeat('c',1000),repeat('d',1000),repeat('e',1000),repeat('f',1000),repeat('g',1000),repeat('h',1000),repeat('i',1000),repeat('j',1000),repeat('k',1000); ERROR 1030 (HY000): Got error 139 from storage engine</code></pre> <p>我们建立一张新表,有11个blob字段。然后向每个字段插入1000字节的数据,MySQL会提示 ERROR 1030 (HY000): Got error 139 from storage engine ,什么意思呢?</p> <p>InnoDB是以B+树来组织数据的,假如每一行数据都占据一整个Page页,那么B+树将退化为单链表,所以InnoDB规定了一个Page必须包含两行数据。也就是一行数据存储在Page上的大小大概为8000字节。</p> <p>而上面的例子,一行数据有11个1000字节的数据,Page层肯定放不下,所以在Page层留下768*11=8448字节,已经超过了8000字节,所以MySQL会提示 ERROR 1030 (HY000): Got error 139 from storage engine 。我们很轻松的定义一个字段,来存储11000个字节,但是却无法将他们分成11个字段来存储,有点意思!</p> <p>那么如何解决上面的问题呢?</p> <ul> <li>将行格式转为接下来要说的Dynamic格式。此种格式只用20字节指向外部存储空间。</li> <li>将多个blob字段转为一个blob字段。多个字段可以用数组存储,然后json_encode打包进blob。</li> </ul> <p>我们向表中插入一条有效记录:</p> <pre> <code class="language-sql">mysql> insert into testblob(blob1,blob2,blob3,blob4,blob5,blob6,blob7,blob8,blob9) select repeat('a',8000),repeat('b',8000),repeat('c',8000),repeat('d',8000),repeat('e',8000),repeat('f',8000),repeat('g',8000),repeat('h',8000),repeat('i',8000); Query OK, 1 row affected (0.12 sec) Records: 1 Duplicates: 0 Warnings: 0</code></pre> <pre> <code class="language-sql">[root@localhost mysql]# python py_innodb_page_info.py -v com/testblob.ibd page offset 00000000, page type <File Space Header> page offset 00000001, page type <Insert Buffer Bitmap> page offset 00000002, page type <File Segment inode> page offset 00000003, page type <B-tree Node>, page level <0000> page offset 00000004, page type <Uncompressed BLOB Page> page offset 00000005, page type <Uncompressed BLOB Page> page offset 00000006, page type <Uncompressed BLOB Page> page offset 00000007, page type <Uncompressed BLOB Page> page offset 00000008, page type <Uncompressed BLOB Page> page offset 00000009, page type <Uncompressed BLOB Page> page offset 0000000a, page type <Uncompressed BLOB Page> page offset 0000000b, page type <Uncompressed BLOB Page> page offset 0000000c, page type <Uncompressed BLOB Page> Total number of page: 13: Insert Buffer Bitmap: 1 Uncompressed BLOB Page: 9 File Space Header: 1 B-tree Node: 1 File Segment inode: 1</code></pre> <p>我们可以看出这一行数据有9个外部存储页,而我们一共就插入了9列数据,是不是当每一列的数据在page页放不下,都单独申请一个外部存储页,而互相之前不共享外部存储页。我们看一下page页的结构就知道了:</p> <pre> <code class="language-sql">3130 0000c390 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| 3131 0000c3a0 61 61 61 61 00 00 00 05 00 00 00 04 00 00 00 26 |aaaa...........&| ... ... 3180 0000c6b0 62 62 62 62 62 62 62 62 00 00 00 05 00 00 00 05 |bbbbbbbb........| 3181 0000c6c0 00 00 00 26 00 00 00 00 00 00 1c 40 63 63 63 63 |...&.......@cccc| ... ... 3229 0000c9c0 63 63 63 63 63 63 63 63 63 63 63 63 00 00 00 05 |cccccccccccc....| 3230 0000c9d0 00 00 00 06 00 00 00 26 00 00 00 00 00 00 1c 40 |.......&.......@| ... ...</code></pre> <p>根据前面的分析,我们现在可以看出,外部存储页是不共享的,即使一个列的数据多出一个字节,这一个字节也是独占一个16KB空间的大小,这很浪费存储空间。(当然,这对现代计算机可能不是问题,呵呵)。</p> <p>说了这么多,总结下Compact格式存储大数据的缺点:</p> <ul> <li>由于存在768字节的前缀在Page页,所以会存在能定义一个字段,存储11000字节,但是不能定义11个字段,每个字段存储1000字节的"bug"。</li> <li>外部存储页不共享,即使多余一个字节也是独享16KB的页面。</li> </ul> <h3>Dynamic行格式</h3> <p>接着我们首先看一下行格式为Dynamic是如何存储大数据的:</p> <pre> <code class="language-sql">mysql> select version(); +-----------+ | version() | +-----------+ | 5.7.14 | +-----------+ 1 row in set (0.00 sec) mysql> show table status like 'row'\G; *************************** 1. row *************************** Name: row Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 0 Avg_row_length: 0 Data_length: 16384 Max_data_length: 0 Index_length: 0 Data_free: 0 Auto_increment: NULL Create_time: 2017-01-03 22:45:16 Update_time: NULL Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)</code></pre> <p>创建和compact格式一样的表:</p> <pre> <code class="language-sql">CREATE TABLE `row` ( `content` varchar(65532) NOT NULL DEFAULT '' ) ENGINE=InnoDB DEFAULT CHARSET=latin1 insert into row(content) select repeat('a',65532); Query OK, 1 row affected (0.03 sec) Records: 1 Duplicates: 0 Warnings: 0</code></pre> <p>看下页分布:</p> <pre> <code class="language-sql">[root@localhost mysql]# python py_innodb_page_info.py -v row.ibd page offset 00000000, page type <File Space Header> page offset 00000001, page type <Insert Buffer Bitmap> page offset 00000002, page type <File Segment inode> page offset 00000003, page type <B-tree Node>, page level <0000> page offset 00000004, page type <Uncompressed BLOB Page> page offset 00000005, page type <Uncompressed BLOB Page> page offset 00000006, page type <Uncompressed BLOB Page> page offset 00000007, page type <Uncompressed BLOB Page> page offset 00000008, page type <Uncompressed BLOB Page> Total number of page: 9: Insert Buffer Bitmap: 1 Uncompressed BLOB Page: 5 File Space Header: 1 B-tree Node: 1 File Segment inode: 1</code></pre> <p>第4页是数据页,第5-9页是二进制页。我们直接看磁盘中第4页的数据:</p> <pre> <code class="language-sql">3073 0000c000 dc 2d b0 f5 00 00 00 03 ff ff ff ff ff ff ff ff |.-..............| 3074 0000c010 00 00 00 00 00 a3 4b 59 45 bf 00 00 00 00 00 00 |......KYE.......| 3075 0000c020 00 00 00 00 00 36 00 02 00 a6 80 03 00 00 00 00 |.....6..........| 3076 0000c030 00 7f 00 05 00 00 00 01 00 00 00 00 00 00 00 00 |................| 3077 0000c040 00 00 00 00 00 00 00 00 00 64 00 00 00 36 00 00 |.........d...6..| 3078 0000c050 00 02 00 f2 00 00 00 36 00 00 00 02 00 32 01 00 |.......6.....2..| 3079 0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......| 3080 0000c070 73 75 70 72 65 6d 75 6d 14 c0 00 00 10 ff f1 00 |supremum........| 3081 0000c080 00 00 00 02 00 00 00 00 00 07 07 a7 00 00 01 1b |................| 3082 0000c090 01 10 00 00 00 36 00 00 00 04 00 00 00 26 00 00 |.....6.......&..| 3083 0000c0a0 00 00 00 00 ff fc 00 00 00 00 00 00 00 00 00 00 |................| 3084 0000c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 3085 0000c0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 3086 0000c0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 3087 0000c0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ... ... ...</code></pre> <p>和Compact格式有着明显的不同,当大数据在Page页存放不下时,Dynamic行格式不会留768字节在Page页,并且将全部大数据都放在外部存储页。具体的数据页和外部存储页的连接关系同Compact格式一样。</p> <p>我们再看看Dynamic格式的外部存储页是不是每一个列独享外部存储空间,还是同Compact格式实验过程一样:</p> <pre> <code class="language-sql">CREATE TABLE `testblob` ( `blob1` blob NOT NULL, `blob2` blob NOT NULL, `blob3` blob NOT NULL, `blob4` blob NOT NULL, `blob5` blob NOT NULL, `blob6` blob NOT NULL, `blob7` blob NOT NULL, `blob8` blob NOT NULL, `blob9` blob NOT NULL, `blob10` blob NOT NULL, `blob11` blob NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1; mysql> insert into testblob(blob1,blob2,blob3,blob4,blob5,blob6,blob7,blob8,blob9,blob10,blob11) select repeat('a',8000),repeat('b',8000),repeat('c',8000),repeat('d',8000),repeat('e',8000),repeat('f',8000),repeat('g',8000),repeat('h',8000),repeat('i',8000),repeat('j',8000),repeat('k',8000); Query OK, 1 row affected (0.10 sec) Records: 1 Duplicates: 0 Warnings: 0</code></pre> <p>看一下外部存储页数据:</p> <pre> <code class="language-sql">4599 00011f60 61 61 61 61 61 61 61 61 61 61 61 61 61 61 00 00 |aaaaaaaaaaaaaa..| 4600 00011f70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|</code></pre> <p>好的,可以不用向下看其他列的了,Dynamic的外部存储页也不是共享的。</p> <p>但是MySQL为什么要这么设计呢?可能是为了实现简单吧,沿着链表通过有效数据大小就能读取blob的全部数据。假如多个字段的blob混在一起,可能设计更复杂,要更新每个字段的偏移量之类的,更新的话页数据管理也比较麻烦。我的个人猜测,呵呵。</p> <p>总结下Dynamic格式存储大数据的特点:</p> <ul> <li>当数据页放不下时,MySQL会将大数据全部放在外部存储页,数据页只留指向外部存储页的指针。</li> <li>外部存储页不共享,即使多余一个字节也是独享16KB的页面。</li> </ul> <h3>将列放入外部存储页的标准</h3> <p>当一行中的数据不能在数据页中放下,需要申请外部存储页时,MySQL需要决定将哪一列的数据放到外部存储页,遵循的规则如下:</p> <ul> <li>长度固定的字段不会被放到外部存储页(int、char(N)等)</li> <li>长度小于20字节的字段不会被放到外部存储页。(假如放到外部存储页,不仅会单独占据16KB,还要额外的20字节指针,没有必要)</li> <li>对于Compact和REDUNDANT格式的行数据,长度小于768字节的字段不会被放到外部存储页。(这个原因很显然,本来就不够768字节的前缀,总不能生搬硬凑吧)。</li> </ul> <p>当有多个大数据字段满足上面条件,需要被放到外部存储页时,MySQL会优先选择大的字段放到外部存储页,因为这样可以最大限度的省下数据页的空间,使得更多的字段能够被放到数据页。</p> <p>由于有较多的实验过程,所以显得比较乱,建议看到这篇文章人自己实践一遍,毕竟自己动手会思考更多的问题与细节,理解的也比较深刻,哈哈哈。</p> <p>参考资料: <a href="/misc/goto?guid=4959736347158162765" rel="nofollow,noindex">http://dev.mysql.com/doc/refman/5.7/en/column-count-limit.html</a></p> <p><a href="/misc/goto?guid=4959736347245761665" rel="nofollow,noindex">http://mysqlserverteam.com/externally-stored-fields-in-innodb/</a></p> <p><a href="/misc/goto?guid=4959736347341615212" rel="nofollow,noindex">https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/</a></p> <p><a href="/misc/goto?guid=4958987955502305550" rel="nofollow,noindex">http://mysql.taobao.org/monthly/2016/02/01/</a></p> <p> </p> <p>来自:https://github.com/zhangyachen/zhangyachen.github.io/issues/96</p> <p> </p>