开源知识提取系统:DeepDive
jopen
10年前
DARPA提供了一个开源的类似Watson项目 DeepDive ,主要基于SQL和Python。众知Watson是一个比较出色的QA系统,而DeepDive主要面向从互联网非结构化数据中抽取结构化信息,做一系列后处理,构建知识库并抽取关系等。
DeepDive有以下几种方式不同于传统系统:
- DeepDive is aware that data is often noisy and imprecise: names are misspelled, natural language is ambiguous, and humans make mistakes. Taking such imprecisions into account, DeepDive computescalibrated probabilities for every assertion it makes. For example, if DeepDive produces a fact with probability 0.9 it means the fact is 90% likely to be true.
- DeepDive is able to use large amounts of data from a variety of sources. Applications built using DeepDive have extracted data from millions of documents, web pages, PDFs, tables, and figures.
- DeepDive allows developers to use their knowledge of a given domain to improve the quality of the results by writing simple rules that inform the inference (learning) process. DeepDive can also take into account user feedback on the correctness of the predictions, with the goal of improving the predictions.
- DeepDive is able to use the data to learn "distantly". In contrast, most machine learning systems require tedious training for each prediction. In fact, many DeepDive applications, especially at early stages, need no traditional training data at all!
- DeepDive’s secret is a scalable, high-performance inference and learning engine. For the past few years, we have been working to make the underlying algorithms run as fast as possible. The techniques pioneered in this project are part of commercial and open source tools includingMADlib,Impala, a product fromOracle, and low-level techniques, such asHogwild!. They have also been included in Microsoft's Adam.