Hadoop运维经验杂谈

mrul0595 9年前
   <p><img src="https://simg.open-open.com/show/0d725b3780f2a39f01f691ec7afe5c19.jpg" alt="Hadoop运维经验杂谈" width="550" height="344"></p>    <p>系统架构:</p>    <p><img src="https://simg.open-open.com/show/6705b409893347ce229dad20226bb1c9.jpg" alt="Hadoop运维经验杂谈" width="550" height="294"></p>    <p>Cloudera和它的产品们</p>    <p>Apache Hadoop与CDH版本关系</p>    <p><img src="https://simg.open-open.com/show/b3164fbba38a80dc23e11e593e2feba0.jpg" alt="Hadoop运维经验杂谈" width="550" height="309"></p>    <p>CDH为什么更好?</p>    <p>yum ,tar, rpm, cloudera manager 四种安装方法</p>    <p>CDH3u3重大改善</p>    <p><img src="https://simg.open-open.com/show/059c056e17723f75a03444b8d2dc1be5.jpg" alt="Hadoop运维经验杂谈" width="550" height="227"></p>    <p>CDH3u4重大改善</p>    <p><img src="https://simg.open-open.com/show/85109651b6a441c7851860cf2acf1d0d.jpg" alt="Hadoop运维经验杂谈" width="550" height="181"></p>    <p>Cloudera Manager</p>    <p><img src="https://simg.open-open.com/show/94c51688fc5b58e3bdf83c57aeb8f2af.jpg" alt="Hadoop运维经验杂谈" width="550" height="309"></p>    <p><img src="https://simg.open-open.com/show/86a106fe961e581f946f0f2cc8970933.jpg" alt="Hadoop运维经验杂谈" width="550" height="303"></p>    <p><img src="https://simg.open-open.com/show/870f619fff577d937f9b22a55f029e92.jpg" alt="Hadoop运维经验杂谈" width="550" height="301"></p>    <p>Cloudera Training</p>    <p>分为Administrator和Development两门课程</p>    <p>运维事故</p>    <h2>1、伤不起的内存</h2>    <p>现象1</p>    <pre>  <code>系统上线第二天,Jobtracker不工作,web页面打不开</code></pre>    <p>原因</p>    <pre>  <code>一次提交Job数量太多,导致Jobtracker 内存溢出</code></pre>    <p>解决</p>    <pre>  <code>调大JT内存;限制Running Job数量</code></pre>    <p>现象2</p>    <pre>  <code>NN内存溢出,重启后发现50030页面显示fsimage损坏,调查发现SNN fsimage同样损坏了</code></pre>    <p>原因</p>    <pre>  <code>小文件太多导致NN/SNN内存溢出,导致fsimage文件损坏,但是重启后的NN可以正常服务。</code></pre>    <p>原因</p>    <pre>  <code>Cloudera google group去救,获得后门脚本</code></pre>    <h2>2、低效的MapReduce Job</h2>    <p>现象</p>    <pre>  <code>MapReduce Job执行时间过长</code></pre>    <p>原因</p>    <pre>  <code>MR中用到了Spring,小文件导致Map方法效率低下,GZ文件读写效率低</code></pre>    <p>解决</p>    <pre>  <code>MR去Spring化;开启JVM重用;使用LZO作为输入和map输出结果;加大reduce并行copy线程数</code></pre>    <p>压缩与MapReduce性能</p>    <p><img src="https://simg.open-open.com/show/b9972f8b853cded4e2c631a252d44c64.jpg" alt="Hadoop运维经验杂谈" width="550" height="237"></p>    <h2>3、OMG,整个集群完蛋了</h2>    <p>现象</p>    <pre>  <code>早上来发现所有DataNode都dead了,重启后10分钟,DN陆续又都dead了;调查发现节点有8%左右丢包率</code></pre>    <p>原因</p>    <pre>  <code>交换机模块故障;DN不能Hold住大量小文件</code></pre>    <p>解决</p>    <pre>  <code>升级3u2到3u4;设置DN内存到2GB</code></pre>    <p>遇到无法跨越的问题解决办法</p>    <p>监控与高级</p>    <p><img src="https://simg.open-open.com/show/d0f66cf6a1ff8643dc43fc184dcb221d.jpg" alt="Hadoop运维经验杂谈" width="550" height="394"></p>    <p>Nagios告警:</p>    <p><img src="https://simg.open-open.com/show/4d2fa3408b3b19ec2983e4e12b297f9c.jpg" alt="Hadoop运维经验杂谈" width="550" height="163"></p>    <p>业务监控:</p>    <p><img src="https://simg.open-open.com/show/d0f66cf6a1ff8643dc43fc184dcb221d.jpg" alt="Hadoop运维经验杂谈" width="550" height="336"></p>    <p><img src="https://simg.open-open.com/show/024b74f7fede391926e90fa68633fa15.jpg" alt="Hadoop运维经验杂谈" width="550" height="455"></p>    <p><img src="https://simg.open-open.com/show/e26d505133b0e3ee6d698e6847e3229c.jpg" alt="Hadoop运维经验杂谈" width="550" height="319"></p>    <p> </p>    <p><em>原文</em>  <a href="http://www.thebigdata.cn/Hadoop/29673.html?utm_source=tuicool&utm_medium=referral">http://www.thebigdata.cn/Hadoop/29673.html</a></p>