Apache Kudu v0.9.0 发布,一个数据存储系统
jopen 8年前
<p style="text-align: center;"><img alt="" src="https://simg.open-open.com/show/a52bebcca88f4596ab371e9541707155.png" /></p> <p>为了应对先前发现的这些趋势,有两种不同的方式:持续更新现有的Hadoop工具或者重新设计开发一个新的组件。其目标是:</p> <ul> <li>对数据扫描(scan)和随机访问(random access)同时具有高性能,简化用户复杂的混合架构;</li> <li>高CPU效率,最大化先进处理器的效能;</li> <li>高IO性能,充分利用先进永久存储介质;</li> <li>支持数据的原地更新,避免额外的数据处理、数据移动</li> </ul> <p>我们为了实现这些目标,首先在现有的开源项目上实现原型,但是最终我们得出结论:需要从架构层作出重大改变。而这些改变足以让我们重新开发一个全新的数据存储系统。经过多年的努力如今终于可以分享我们 多年来的努力成果:Kudu,一个新的数据存储系统。</p> <h2>更新日志</h2> <h3>不兼容的更改</h3> <ul> <li> <p>The <code>KuduTableInputFormat</code> command has changed the way in which it handles scan predicates, including how it serializes predicates to the job configuration object. The new configuration key is <code>kudu.mapreduce.encoded.predicate</code>. Clients using the<code>TableInputFormatConfigurator</code> are not affected.</p> </li> <li> <p>The <code>kudu-spark</code> sub-project has been renamed to follow naming conventions for Scala. The new name is <code>kudu-spark_2.10</code>.</p> </li> <li> <p>Default table partitioning has been removed. All tables must now be created with explicit partitioning. Existing tables are unaffected. See the <a href="/misc/goto?guid=4958991320803360376">schema design guide</a> for more details.</p> </li> </ul> <h3>新功能</h3> <ul> <li> <p><a href="/misc/goto?guid=4958991320918001397">KUDU-1002</a> Added support for <code>UPSERT</code> operations, whereby a row is inserted if it does not already exist, but updated if it does. Support for <code>UPSERT</code> is included in Java, C++, and Python APIs, but not in Impala.</p> </li> <li> <p><a href="/misc/goto?guid=4958991321027543772">KUDU-1306</a> Scan token API for creating partition-aware scan descriptors. This API simplifies executing parallel scans for clients and query engines.</p> </li> <li> <p><a href="/misc/goto?guid=4958991321127685202">Gerrit 2848</a> Added a kudu datasource for Spark. This datasource uses the Kudu client directly instead of using the MapReduce API. Predicate pushdowns for <code>spark-sql</code> and Spark filters are included, as well as parallel retrieval for multiple tablets and column projections. See an example of <a href="/misc/goto?guid=4958991321227480297">Kudu integration with Spark</a>.</p> </li> <li> <p><a href="/misc/goto?guid=4958991321334248751">Gerrit 2992</a> Added the ability to update and insert from Spark using a Kudu datasource.</p> </li> </ul> <h3>改进</h3> <ul> <li> <p><a href="/misc/goto?guid=4958991321441669921">KUDU-1415</a> Added statistics in the Java client such as the number of bytes written and the number of operations applied.</p> </li> <li> <p><a href="/misc/goto?guid=4958991321539175566">KUDU-1451</a> Improved tablet server restart time when the tablet server needs to clean up of a lot previously deleted tablets. Tablets are now cleaned up after they are deleted.</p> </li> </ul> <h3>问题修复</h3> <ul> <li> <p><a href="/misc/goto?guid=4958991321644138138">KUDU-678</a> Fixed a leak that happened during DiskRowSet compactions where tiny blocks were still written to disk even if there were no REDO records. With the default block manager, it usually resulted in block containers with thousands of tiny blocks.</p> </li> <li> <p><a href="/misc/goto?guid=4958991321742752720">KUDU-1437</a> Fixed a data corruption issue that occured after compacting sequences of negative INT32 values in a column that was configured with RLE encoding.</p> </li> </ul> <h3>其他值得关注的变化</h3> <p>All Kudu clients have longer default timeout values, as listed below.</p> <h3>Java</h3> <ul> <li> <p>The default operation timeout and the default admin operation timeout are now set to 30 seconds instead of 10.</p> </li> <li> <p>The default socket read timeout is now 10 seconds instead of 5.</p> </li> </ul> <h3>C++</h3> <ul> <li> <p>The default admin timeout is now 30 seconds instead of 10.</p> </li> <li> <p>The default RPC timeout is now 10 seconds instead of 5.</p> </li> <li> <p>The default scan timeout is now 30 seconds instead of 15.</p> </li> <li> <p>Some default settings related to I/O behavior during flushes and compactions have been changed: The default for<code>flush_threshold_mb</code> has been increased from 64MB to 1000MB. The default <code>cfile_do_on_finish</code> has been changed from<code>close</code> to <code>flush</code>. <a href="/misc/goto?guid=4958991321845944995">Experiments using YCSB</a> indicate that these values will provide better throughput for write-heavy applications on typical server hardware.</p> </li> </ul> <h2>下载 </h2> <ul> <li><a href="http://www.apache.org/closer.cgi?filename=incubator/kudu/0.9.0/apache-kudu-incubating-0.9.0.tar.gz&action=download">Kudu 0.9.0 source tarball</a></li> <li><a href="/misc/goto?guid=4958991322054438924" rel="nofollow"><strong>Source code</strong> (zip)</a></li> <li><a href="/misc/goto?guid=4958991322151064249" rel="nofollow"><strong>Source code</strong> (tar.gz)</a></li> </ul>