Machine learning for everyone
High quality pythonic library
Community-driven development
In Parietal: decode brain activity (fMRI)
Widely used in: astronomy, genomics, etc ...
Easy to use:
from sklearn import svm
classifier = svm.SVC()
svm.fit(X_train, y_train)
y_test = classifier.predict(X_test)
Consistent API for estimators
Optimised for speed: Numpy and Cython
~50 notifications per day from comments on PRs/issues
User support drowns core devs. Reviewing PR is the main bottleneck
Road and Bridges by Nadia Eghbal
New-York: 350 k$ More-Sloan grant
A. Mueller (full time). Students M. Kumar, V. Birodkar
Telecom Paris-Tech: 200 k€ WendelinIA grant + 12 k€ CDS
Programmers: T. Guillemot, T. Dupré. Students: M. Kumar, D. Sullivan, V.R.
Rajagopalan, N. Goix
Inria Parietal: 120 k€ Inria, + 100 k€ WendelinIA + 50 k€ ANR + 30 k€ CDS
Programmers: O. Grisel, L. Estève, G. Lemaître, J. Van den Bosche. Students:
A. Mensch, J. Schreiber, G. Patrini
> 400 k€ / year
http://contrib.scikit-learn.org
Not everything can (and has to) go in scikit-learn
For cutting-edge algorithms, quick development, maturation
nice template to start the project (testing, CIs, ...) + visibility
requirements: follow the scikit-learn API, docs, tests
10 projects in scikit-learn contrib currently
Launched 2 weeks ago
companies: better visibility for software they rely on, good for Public Relations
scikit-learn: permanent staff ("CDI") to consolidate project, useful feed-back from users
See blog post
Someone else may solve your problems One advantage of being part of the very dynamic Python ecosystem
High-level interfaces: collections with an interface very similar to numpy/pandas
Use case: pandas dataframe bigger than RAM
Low-level interfaces for parallel computing
Other goodies:
Try dask in your browser via binder
dask-ml: fit scikit-learn models on data bigger than RAM, or parallelising scikit-learn on a cluster. Integration with xgboost.
dask-jobqueue: smoothly transition your existing Python code from your machine to a HPC cluster (SLURM, PBS, etc ...)
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(<slurm_specific>)
cluster.scale(4) # 4 jobs
from dask.distributed import Client
client = Client(cluster)
# dask generic code agnostic to the cluster
Similar packages for running dask on Kubernetes or on a Hadoop/Yarn cluster.
Lots of clever people in Particle Physics + well-organised and reasonably well-funded field
Why not more activity coming from Particle Physics in the Python open-source space?
Contrast with Astrophysics (astropy)
Very interested to hear your thoughts on this!
Vision: Machine learning as a means not an end
Versatile library: the right level of abstraction. Close to research, but seeking different tradeoffs
Numpy arrays as data containers. Fast enough.
Ensure code quality and maintainability