Tryolabs收集的Python, NLP, Machine Learning资源
A curated list of books, libraries, apps and papers we love at Tryolabs. We work with blazing startups and help them build complex projects using Python, NLP & Machine Learning.
Overview
We create amazing Internet & Mobile products for blazing startups. We combine the python ecosystem with machine learning and natural language processing technologies to create heavy backend apps with artificial intelligence components. We follow agile methodologies in order to develop MVPs and full products the lean way.
Training
The training period at Tryolabs is at least two weeks. Its goal is to get up to speed with the tools the company uses. This repo contains a list of tutorials and documentation useful for becoming familiar with the Django/Python ecosystem, as well as some ML and NLP techniques.
During the training period, we recommend doing at least an hour a day of pairing with a mentor who has experience in the team, to get to know the work process and the tools. The goal is to get the mentor to coment on the tasks they are doing to the person in training.
Development Tools
Python
virtualenv and virtualenvwrapper
A very useful development tool that lets us create isolated Python environments for every project, isolating the set of libraries used in the project from the system.
iOS
cocoapods
Package manager for iOS projects. Handle the setup and update of XCode projects to speed up the integration of new components.
nomad
CLI for iOS projects. Has various tools to perform common task from the command line (ex: generate, sign and ditribute OTA an ipa)
Other
Vagrant
Vagrant is a tool for creating isolated, reproducible development environments using virtual machines. It is usually used with VirtualBox, but supports VMWare and other virtualization systems.
Docker
Docker is a tool for creating and managing software containers.
Metamon
Metamon is a tool to automatically set up an isolated execution environment for Django applications.
Source Control
Just use git. A good resource is the Pro Git book by Scott Chacon, and GitHub's help site.
Editors and IDEs
Standards and Conventions
The PEP8 is the definitive reference for Python coding style. The pep8 package can be used to scan code and find parts that don't conform to the PEP8 standard.
With Emacs, the emacs-pep8 package can be used to run the pep8.py script.
Deploying
We use Ansible for all our deployment and server orchestration tasks.
Databases
Relational
Just use Postgres. It's not just a database, it's a complete "relational database framework" that provides full-text search, GIS and extensive documentation of every knob and lever.
NoSQL
Are you sure Postgres can't do what you want?
Document
Key-Value
Graph
Libraries
Machine Learning
Web
Books
This list of books represents, in our opinion, a good balance between theory and practice. We don't expect everyone to read all of these, rather, they should take a few books from this common list.
Machine Learning
- Machine Learning: The Art and Science of Algorithms that Make Sense of Data
- Learning scikit-learn: Machine Learning in Python
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- Principles of Data Mining
- Foundations of Statistical Natural Language Processing
- Bayesian Reasoning and Machine Learning
- Gaussian Processes for Machine Learning
- Information Theory, Inference and Learning Algorithms
Information Retrieval
- Managing Gigabytes: Compressing and Indexing Documents and Images
- Introduction to Information Retrieval
- Information Retrieval: Algorithms and Heuristics
Computer Vision
- Concise Computer Vision
- Computer Vision: Algorithms and Applications
- Learning OpenCV: Computer Vision in C++ with the OpenCV Library
Scala
Software Architecture
Programming Language
Papers
Information Retrieval
General
- Functional Geometry
- Pictures: A simple structured graphics model
- The Problem with Threads (Threads are Evil)
Web Design
Aggregators
Icons
Libraries and Resources
Tech Stack
First things first: Machines are meant to be identical. Ansible provisions your local Vagrant box the same way it provisions a server. This way the production environment is the same as the development one, and we avoid hard to find bugs while being fairly certain that if something works in dev, it will work in prod.
Specifically, machines look like this:
-
The application is run inside a virtualenv, even if it's the only application in the server. This makes it easy to add other applications should the need arise, for instance, you might want to run an IPython Notebook server with a Notebook that provides some analytics and charts of the data in your database, without contaminating the app's environment with IPython's dependencies.
-
Nginx is used as a reverse proxy, sending requests from the Internet to the Django server and responses the other way around. Nginx can take care of load balancing, caching, HTTP acceleration and some degree of security.
-
Supervisor is used to keep the actual application server running, as well as running other scripts or processes. Every process is logged to disk for debugging.
-
Postgres is the database, of course.
Our tech stack looks roughly like this on most projects:
This is, of course, an approximation. Some projects use NoSQL databases in addition to relational ones, others use other things like message queues, some use specific tools like Varnish instead of Nginx for HTTP acceleration.
https://github.com/tryolabs/awesome-tryo