Hadoop的Jython封装 Happy

openkk 12年前

Hadoop + Python = Happy

Happy 为Jython开发者使用Hadoop框架提供了便利,Happy框架封装了Hadoop的复杂调用过程,让Map-Reduce开发变得更为容易。Happy中的Map-Reduce作业过程在子类happy.HappyJob中定义,当用户创建类实例后,设置作业任务的输入输出参数,然后调用run()方法即可启动分治规约处理,此时,Happy框架将序列化用户的作业实例,并将任务及相应依赖库拷贝到Hadoop集群执行。目前,Happy框架已被数据集成站点 freebase.com采纳,用于进行站点的数据挖掘与分析工作。

import sys, happy, happy.log    happy.log.setLevel("debug")  log = happy.log.getLogger("wordcount")    class WordCount(happy.HappyJob):        def __init__(self, inputpath, outputpath):          happy.HappyJob.__init__(self)          self.inputpaths = inputpath          self.outputpath = outputpath          self.inputformat = "text"              def map(self, records, task):          for _, value in records:              for word in value.split():                  task.collect(word, "1")            def reduce(self, key, values, task):          count = 0;          for _ in values: count += 1          task.collect(key, str(count))          log.debug(key + ":" + str(count))          happy.results["words"] = happy.results.setdefault("words", 0) + count          happy.results["unique"] = happy.results.setdefault("unique", 0) + 1    if __name__ == "__main__":      if len(sys.argv) < 3:          print "Usage: <inputpath> <outputpath>"          sys.exit(-1)      wc = WordCount(sys.argv[1], sys.argv[2])      results = wc.run()      print str(sum(results["words"])) + " total words"      print str(sum(results["unique"])) + " unique words"

项目主页:http://www.open-open.com/lib/view/home/1339137811672