@lesteve



https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset
Source of truth, historical bug until July 2018

clickpy: https://clickpy.clickhouse.com scikit-learn dashboard: https://clickpy.clickhouse.com/dashboard/scikit-learn
Copy and aggregation of the BigQuery DB. Historically there have been a few bugs so may want to double-check with Google BigQuery.

Can write your own SQL-like query to have more information
Open to change on their issue tracker (UI and “packages that needs refresh” wording)
maybe one day maybe: query + plots with Python using ibis on clickhouse DB
PyPI stats: https://pypistats.org 
PSF has recently taken over the maintenance https://github.com/crflynn/pypistats.org/issues/82
Advantage: can tell with mirrors without mirrors only use pip in their aggregated numbers. No per-package version info.
Pepy: https://pepy.tech avantage can look at different version.

Aggregate numbers use Paid option for longer timeline CI. Download numbers are inflated compared to pypistats.org (more info later)
top-pypi-packages: https://hugovk.github.io/top-pypi-packages.
github repo with one json per month for most downloaded projects. No need to write SQL query. Some small caveats query has changed a bit with time.
Hugo Van Keremade CPython developer with interesting blog posts on PyPI data
From Python packaging user guide
pip has a local cache (~/.cache/pip on Linux)pip install my-package only counts once (per my-package version) (per Python version for package with compiled code)the scikit-learn sidekick https://github.com/probabl-ai/skore
Talk by Marie Sacksick “Enhancing ML workflows with skore” next (11:25) in Gaston Berger
Most downloads are from mirrors / something else than pip, uv etc …

March 2025
| installer | downloads |
|---|---|
| bandersnatch | 771 |
| requests | 402 |
| uv | 336 |
| pip | 306 |
| Browser | 226 |
| NULL | 173 |
| … | … |
June 2025
| installer | downloads |
|---|---|
| pip | 1811 |
| uv | 814 |
| bandersnatch | 513 |
| NULL | 258 |
| Browser | 196 |
| requests | 78 |
| … | … |
installer type main source of discrepancy between different websites
pypistats.org only keep pip, others keep all installers
website statistics: 1.2M unique visitors per month
PyPI download stats: ~100M per month

Most downloaded scikit-learn release in 2025 is 1.0.2 released on 2021 Christmas day

Reminder: Python 3.7 EOL 2023-06-27
Found out through @hugovk blog post
BigQuery dataset has Linux distribution and version
92.4% of Python 3.7 downloads come from Amazon Linux 2
(Python 2 default, but you can install Python 3)
Query run in May 2025 (over one day):
name version
Amazon Linux 2 0.31
Amazon Linux 2023 0.22
Ubuntu 22.04 0.12
Debian GNU/Linux 12 0.07
Ubuntu 24.04 0.06
Ubuntu 20.04 0.05
Debian GNU/Linux 11 0.02
Ubuntu 18.04 0.01
CentOS Linux 7 0.01
scikit-learn depends on joblib
expectation: joblib is more downloaded than scikit-learn, right?
May 2025
scikit-learn 102M
joblib 87M
scikit-learn downloaded ~1.2x more than joblib 😱
possible hypothesis, no guarantee whatsoever:
![]()
conda: joblib downloaded 1.2x more than scikit-learn
❯ condastats overall joblib scikit-learn scikit-learn --start_month 2025-03 --end_month 2025-07 --monthly
pkg_name time
joblib 2025-03 2275295
2025-04 2616275
2025-05 2266739
2025-06 2216331
2025-07 2170978
scikit-learn 2025-03 1930584
2025-04 2207607
2025-05 1908129
2025-06 1927962
2025-07 1845839
Exercise left to the reader: uv should show a similar pattern because has a better constraint solver?
pip install scikit-learn but import sklearn
Historical decision make pip install sklearn “just work”, also avoid malicious actor taking the name
brownout strategy over one year (December 2022 - December 2023) SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True as last resort escape mechanism
Naive attempt to disable the cache broke poetry
one sklearn release roughly once per month 0.0.post3 for March 2023, 0.0.post5 for May, etc …
PyPI stats seemed a good way to check that the brownout was working, right?

installer is mostly pip for 0.0.post5 and other transition releases, doen’t make much sense …

boto3 x.y.z depends on botocore>=x.y.z
new release almost every day
similar inversion pattern for other boto3 dependents: python-dateutil
botocore cached but not boto3 seems weird
six is in the top 20 in most downloaded packages in 2025 (15 years after Python 2.7 EOL)
easy to misinterpret it: people still care about Python 2 compatibility
reality: python-dateutil still depends on six
plenty of popular packages depend on python-dateutil, boto3, pandas, etc …
details.ci field only in Google BigQuery dbCI, BUILD_ID, …scikit-learn: only 10-15% of the downloads are from CI
Best highly hypothetical wild-guess
Goodhart’s law: When a measure becomes a target, it ceases to be a good measure
Not everything that can be counted counts, and not everything that counts can be counted (not Albert Einstein apparently)
Reality is complex => metrics help simplifying the message but sometimes gain way too much importance and force people to play the metrics game IMO (e.g. Shangai ranking for universities)
proxy metrics and big picture: I get it
numbers as a time-effective way to give credit to an opinion
bias: tend to stop looking when data matches the story you wanted to tell. Otherwise completely ignored.
Truth-O-meter in political fact-checking: mostly true, mostly false, pants on fire
Problem: fact-checking is always late to the party and never manages to correct wild claims/feel-good stories
seems fair
OKish
misleading

But at least I tried a little bit 😅
CHAOSS Practicioner about interpreting metrics: https://chaoss.community/practitioner-guide-introduction/