快速自动提取关键词(RAKE)算法的Python实现:rake
jopen
10年前
快速自动提取关键词(RAKE)算法的一个Python实现。自动从单个文档关键字提取。
import rake import operator # EXAMPLE ONE - SIMPLE stoppath = "SmartStoplist.txt" ''''' # 1. initialize RAKE by providing a path to a stopwords file rake_object = rake.Rake(stoppath, 5, 3, 4) # the notation is: (1)Each word has at least 5 characters, (2)Each phrase has at most 3 words,(3)Each keyword appears in the text at least 4 times # 2. run on RAKE on a given text sample_file = open("data/docs/fao_test/w2167e.txt", 'r') text = sample_file.read() keywords = rake_object.run(text) # this command can output all the keywords and their scores # 3. print results print "Keywords:", keywords print "----------" ''' # EXAMPLE TWO - BEHIND THE SCENES (from https://github.com/aneesha/RAKE/rake.py) # initialize RAKE by providing a path to a stopwords file rake_object = rake.Rake(stoppath) text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility " \ "of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. " \ "Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating"\ " sets of solutions for all types of systems are given. These criteria and the corresponding algorithms " \ "for constructing a minimal supporting set of solutions can be used in solving all the considered types of " \ "systems and systems of mixed types." # Split text into sentences sentenceList = rake.split_sentences(text) # sentence was split by punctuation mark, comma and period here. for sentence in sentenceList: print "Sentence:", sentence # generate candidate keywords stopwordpattern = rake.build_stop_word_regex(stoppath) phraseList = rake.generate_candidate_keywords(sentenceList, stopwordpattern) # phrase is the candidated keywords # this method does not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who). # improvements can be made here Read more at https://www.airpair.com/nlp/keyword-extraction-tutorial#4Lc4GeP5t5cYe7OR.99 print "Phrases:", phraseList # calculate individual word scores wordscores = rake.calculate_word_scores(phraseList) # generate candidate keyword scores keywordcandidates = rake.generate_candidate_keyword_scores(phraseList, wordscores) # One issue here is that the candidates are not normalized in any way. # As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder. # Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first. # so stemming is always used before keyword extraction. This can be another improvement. for candidate in keywordcandidates.keys(): print "Candidate: ", candidate, ", score: ", keywordcandidates.get(candidate) # sort candidates by score to determine top-scoring keywords sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True) totalKeywords = len(sortedKeywords) # for example, you could just take the top third as the final keywords for keyword in sortedKeywords[0:(totalKeywords / 3)]: # note that hte final keywords are determined by top third print "Keyword: ", keyword[0], ", score: ", keyword[1] print rake_object.run(text) # this command outputs all the keywords and their scores.