Hadoop运维经验杂谈
mrul0595
9年前
<p><img src="https://simg.open-open.com/show/0d725b3780f2a39f01f691ec7afe5c19.jpg" alt="Hadoop运维经验杂谈" width="550" height="344"></p> <p>系统架构:</p> <p><img src="https://simg.open-open.com/show/6705b409893347ce229dad20226bb1c9.jpg" alt="Hadoop运维经验杂谈" width="550" height="294"></p> <p>Cloudera和它的产品们</p> <p>Apache Hadoop与CDH版本关系</p> <p><img src="https://simg.open-open.com/show/b3164fbba38a80dc23e11e593e2feba0.jpg" alt="Hadoop运维经验杂谈" width="550" height="309"></p> <p>CDH为什么更好?</p> <p>yum ,tar, rpm, cloudera manager 四种安装方法</p> <p>CDH3u3重大改善</p> <p><img src="https://simg.open-open.com/show/059c056e17723f75a03444b8d2dc1be5.jpg" alt="Hadoop运维经验杂谈" width="550" height="227"></p> <p>CDH3u4重大改善</p> <p><img src="https://simg.open-open.com/show/85109651b6a441c7851860cf2acf1d0d.jpg" alt="Hadoop运维经验杂谈" width="550" height="181"></p> <p>Cloudera Manager</p> <p><img src="https://simg.open-open.com/show/94c51688fc5b58e3bdf83c57aeb8f2af.jpg" alt="Hadoop运维经验杂谈" width="550" height="309"></p> <p><img src="https://simg.open-open.com/show/86a106fe961e581f946f0f2cc8970933.jpg" alt="Hadoop运维经验杂谈" width="550" height="303"></p> <p><img src="https://simg.open-open.com/show/870f619fff577d937f9b22a55f029e92.jpg" alt="Hadoop运维经验杂谈" width="550" height="301"></p> <p>Cloudera Training</p> <p>分为Administrator和Development两门课程</p> <p>运维事故</p> <h2>1、伤不起的内存</h2> <p>现象1</p> <pre> <code>系统上线第二天,Jobtracker不工作,web页面打不开</code></pre> <p>原因</p> <pre> <code>一次提交Job数量太多,导致Jobtracker 内存溢出</code></pre> <p>解决</p> <pre> <code>调大JT内存;限制Running Job数量</code></pre> <p>现象2</p> <pre> <code>NN内存溢出,重启后发现50030页面显示fsimage损坏,调查发现SNN fsimage同样损坏了</code></pre> <p>原因</p> <pre> <code>小文件太多导致NN/SNN内存溢出,导致fsimage文件损坏,但是重启后的NN可以正常服务。</code></pre> <p>原因</p> <pre> <code>Cloudera google group去救,获得后门脚本</code></pre> <h2>2、低效的MapReduce Job</h2> <p>现象</p> <pre> <code>MapReduce Job执行时间过长</code></pre> <p>原因</p> <pre> <code>MR中用到了Spring,小文件导致Map方法效率低下,GZ文件读写效率低</code></pre> <p>解决</p> <pre> <code>MR去Spring化;开启JVM重用;使用LZO作为输入和map输出结果;加大reduce并行copy线程数</code></pre> <p>压缩与MapReduce性能</p> <p><img src="https://simg.open-open.com/show/b9972f8b853cded4e2c631a252d44c64.jpg" alt="Hadoop运维经验杂谈" width="550" height="237"></p> <h2>3、OMG,整个集群完蛋了</h2> <p>现象</p> <pre> <code>早上来发现所有DataNode都dead了,重启后10分钟,DN陆续又都dead了;调查发现节点有8%左右丢包率</code></pre> <p>原因</p> <pre> <code>交换机模块故障;DN不能Hold住大量小文件</code></pre> <p>解决</p> <pre> <code>升级3u2到3u4;设置DN内存到2GB</code></pre> <p>遇到无法跨越的问题解决办法</p> <p>监控与高级</p> <p><img src="https://simg.open-open.com/show/d0f66cf6a1ff8643dc43fc184dcb221d.jpg" alt="Hadoop运维经验杂谈" width="550" height="394"></p> <p>Nagios告警:</p> <p><img src="https://simg.open-open.com/show/4d2fa3408b3b19ec2983e4e12b297f9c.jpg" alt="Hadoop运维经验杂谈" width="550" height="163"></p> <p>业务监控:</p> <p><img src="https://simg.open-open.com/show/d0f66cf6a1ff8643dc43fc184dcb221d.jpg" alt="Hadoop运维经验杂谈" width="550" height="336"></p> <p><img src="https://simg.open-open.com/show/024b74f7fede391926e90fa68633fa15.jpg" alt="Hadoop运维经验杂谈" width="550" height="455"></p> <p><img src="https://simg.open-open.com/show/e26d505133b0e3ee6d698e6847e3229c.jpg" alt="Hadoop运维经验杂谈" width="550" height="319"></p> <p> </p> <p><em>原文</em> <a href="http://www.thebigdata.cn/Hadoop/29673.html?utm_source=tuicool&utm_medium=referral">http://www.thebigdata.cn/Hadoop/29673.html</a></p>