快速开发机器学习原型:Ramp

jopen 10年前

Ramp是一个Python库用于快速搭建机器学习解决方案原型。它是一个轻量级基于pandas的机器学习框架,可插入已有的Python机器学习和统计工具(如scikit-learn, rpy2等)。Ramp提供了一个简单的声明性语法探索功能,算法和快速,高效地转换。

Why Ramp?

  • Clean, declarative syntax

  • Complex feature transformations

    Chain and combine features:

Normalize(Log('x')) Interactions([Log('x1'), (F('x2') + F('x3')) / 2]) 
Reduce feature dimension:
DimensionReduction([F('x%d'%i) for i in range(100)], decomposer=PCA(n_components=3)) 
Incorporate residuals or predictions to blend with other models:
Residuals(simple_model_def) + Predictions(complex_model_def) 
  • Data context awareness

    Any feature that uses the target ("y") variable will automatically respect the current training and test sets. Similarly, preparation data (a feature's mean and stdev, for example) is stored and tracked between data contexts.

  • Composability

    All features, estimators, and their fits are composable, pluggable and storable.

  • Easy extensibility

    Ramp has a simple API, allowing you to plug in estimators from scikit-learn, rpy2 and elsewhere, or easily build your own feature transformations, metrics, feature selectors, reporters, or estimators.

快速入门

Getting started with Ramp: Classifying insults

Or, the quintessential Iris example:

import pandas  from ramp import *  import urllib2  import sklearn  from sklearn import decomposition      # fetch and clean iris data from UCI  data = pandas.read_csv(urllib2.urlopen(      "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"))  data = data.drop([149]) # bad line  columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']  data.columns = columns      # all features  features = [FillMissing(f, 0) for f in columns[:-1]]    # features, log transformed features, and interaction terms  expanded_features = (      features +      [Log(F(f) + 1) for f in features] +      [          F('sepal_width') ** 2,          combo.Interactions(features),      ]  )      # Define several models and feature sets to explore,  # run 5 fold cross-validation on each and print the results.  # We define 2 models and 4 feature sets, so this will be  # 4 * 2 = 8 models tested.  shortcuts.cv_factory(      data=data,        target=[AsFactor('class')],      metrics=[          [metrics.GeneralizedMCC()],          ],      # report feature importance scores from Random Forest      reporters=[          [reporters.RFImportance()],          ],        # Try out two algorithms      model=[          sklearn.ensemble.RandomForestClassifier(              n_estimators=20),          sklearn.linear_model.LogisticRegression(),          ],        # and 4 feature sets      features=[          expanded_features,            # Feature selection          [trained.FeatureSelector(              expanded_features,              # use random forest's importance to trim              selectors.RandomForestSelector(classifier=True),              target=AsFactor('class'), # target to use              n_keep=5, # keep top 5 features              )],            # Reduce feature dimension (pointless on this dataset)          [combo.DimensionReduction(expanded_features,                              decomposer=decomposition.PCA(n_components=4))],            # Normalized features          [Normalize(f) for f in expanded_features],      ]  )

项目主页:http://www.open-open.com/lib/view/home/1404267926436