scikit-learn Blog

Update on array API adoption in scikit-learn

2026-03-05T00:00:00+00:00

Author: Lucy Liu

Note: this blog post is a cross-post of a Quansight Labs blog post.

The Consortium for Python Data API Standards developed the Python array API standard to define a consistent interface for array libraries, specifing core operations, data types, and behaviours. This enables ‘array-consuming’ libraries (such as scikit-learn) to write array-agnostic code that can be run on any array API compliant backend. Adopting array API support in scikit-learn means that users can pass arrays from any array API compliant library to functions that have been converted to be array-agnostic. This is useful because it allows users to take advantage of array library features, such as hardware acceleration, most notably via GPUs.

Indeed, GPU support in scikit-learn has been of interest for a long time - 11 years ago, we added an entry to our FAQ page explaining that we had no plans to add GPU support in the near future due to the software dependencies and platform specific issues it would introduce. By relying on the array API standard, however, these concerns can now be avoided.

In this blog post, I will provide an update to the array API adoption work in scikit-learn, since it’s initial introduction in version 1.3 two years ago. Thomas Fan’s blog post provides details on the status when array API support was initially added.

Current status

Since the introduction of array API support in version 1.3 of scikit-learn, several key developments have followed.

Vendoring `array-api-compat` and `array-api-extra`

Scikit-learn now vendors both array-api-compat and array-api-extra. array-api-compat is a wrapper around common array libraries (e.g., PyTorch, CuPy, JAX) that bridges gaps to ensure compatibility with the standard. It enables adoption of backwards incompatible changes while still allowing array libraries time to adopt the standard slowly. array-api-extra provides array functions not included in the standard but deemed useful for array-consuming libraries.

We chose to vendor these now much more mature libraries in order to avoid the complexity of conditionally handling optional dependencies throughout the codebase. This approach also follows precedent, as SciPy also vendors these packages.

Array libraries supported

Scikit-learn currently supports CuPy ndarrays, PyTorch tensors (testing against all devices: ‘cpu’, ‘cuda’, ‘mps’ and ‘xpu’) and NumPy arrays. JAX support is also on the horizon. The main focus of this work is addressing in-place mutations in the codebase. Follow PR #29647 for updates.

Beyond these libraries, scikit-learn also tests against array-api-strict, a reference implementation that strictly adheres to the array API specification. The purpose of array-api-strict is to help automate compliance checks for consuming libraries and to enable development and testing of array API functionality without the need for GPU or other specialized hardware. Array libraries that conform to the standard and pass the array-api-tests suite should be accepted by scikit-learn and SciPy, without any additional modifications from maintainers.

Estimators and metrics with array API support

The full list of metrics and estimators that now support array API can be found in our Array API support documentation page. The majority of high impact metrics have now been converted to be array API compatible. Many transformers are also now supported, notably LabelBinarizer which is widely used internally and simplifies other conversions.

Conversion of estimators is much more complicated as it often involves benchmarking different variations of code or consensus gathering on implementation choices. It generally requires many months of work by several maintainers. Nonetheless, support for LogisticRegression, GaussianNB, GaussianMixture, Ridge (and family: RidgeCV, RidgeClassifier, RidgeClassifierCV), Nystroem and PCA has been added. Work on GaussianProcessRegressor is also underway (follow at PR #33096).

Handling mixed array namespaces and devices

scikit-learn takes a unique approach among ‘array-consuming’ libraries by supporting mixed array namespace and device inputs. This design choice enables the framework to handle the practical complexities of end-to-end machine learning pipelines.

String-valued class labels are common in classification tasks and enable users to work with interpretable categories rather than integer codes. NumPy is currently the only array library with string array support, meaning that any workflow involving both GPU-accelerated computation and string labels necessarily involves mixed array type inputs.

Mixed array input support also enables flexible pipeline workflows. Pipelines provide significant value by chaining preprocessing steps and estimators into reusable workflows that prevent data leakage and ensure consistent preprocessing. However, they have an intentional design limitation: pipeline steps can transform feature arrays (X) but cannot modify target arrays (y). Allowing mixed array inputs means a pipeline can include a FunctionTransformer step that moves feature data from CPU to GPU to leverage hardware acceleration, while allowing the target array, which cannot be modified, to remain on CPU.

For example, mixed array inputs enable a pipeline where string classification features are encoded on CPU (as only NumPy supports string arrays), converted to torch CUDA tensors, then passed to the array API-compatible RidgeClassifier for GPU-accelerated computation:

from functools import partial

from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, TargetEncoder

pipeline = make_pipeline(
    # Encode string categories with average target values
    TargetEncoder(),
    # Convert feature array `X` to Torch CUDA device
    FunctionTransformer(partial(torch.asarray, dtype="float32", device="cuda"))
    RidgeClassifier(solver="svd"),
)

Work on adding mixed array type inputs for metrics and estimators is underway and expected to progress quickly. This work includes developing a robust testing framework, including for pipelines using mixed array types (follow PR #32755 for details).

Finally, we have also revived our work to support the ability to fit and predict on different namespaces/devices. This allows users to train models on GPU hardware but deploy predictions on CPU hardware, optimizing costs and accommodating different resource availability between training and production environments. Follow PR #33076 for details.

Challenges

The challenges of array API adoption remain largely unchanged from when this work began. These are also common to other array-consuming libraries, with a notable addition: the need to handle array movement between namespaces and devices to support mixed array type inputs.

Array API Standard is a subset of NumPy’s API

The array API standard only includes widely-used functions implemented across most array libraries, meaning many NumPy functions are absent. When such a function is encountered while adding array API support, we have the following options:

add the function to array-api-extra - this allows other array-consuming libraries to benefit and allows sharing of maintenance burden, but is only relevant for more widely used functions
add our own implementation in scikit-learn - these functions live in sklearn/utils/_array_api.py
check if SciPy implements an array API compatible version of the function

The quantile function illustrates this decision-making process. quantile is not included in the standard as it is not widely used (outside of scikit-learn) and while it is implemented in most array libraries, the set of quantile methods supported and their APIs vary. Currently, scikit-learn maintains its own array API compatible version that supports both weights and NaNs, but due to the maintenance burden we decided to investigate alternatives. SciPy has an array API compatible implementation, but it did not support weights. We thus investigated adding quantile to array-api-extra; however, during this effort, SciPy decided to add weight support. Thus, we ultimately decided to transition to the SciPy implementation once our minimum SciPy version allows.

Compiled code

Many performance-critical parts of scikit-learn are written using compiled code extensions in Cython, C or C++. These directly access the underlying memory buffers of NumPy arrays and are thus restricted to CPU.

Metrics and estimators, with compiled code, handle this in one of two ways: convert arrays to NumPy first or maintain two parallel branches of code, one for NumPy (compiled) and one for other array types (array API compatible). When performance is less critical or array API conversion provides no gains (e.g., confusion_matrix), we convert to NumPy. When performance gains are significant, we accept the maintenance burden of dual code paths. This was the case for LogisticRegression and the extensive process required for making such implementation decisions can be seen in the PR #32644.

Unspecified behaviour in the standard

The array API standard intentionally leaves some function behaviors unspecified, permitting implementation differences across array libraries. For example, the order of unique elements is not specified for the unique_* functions and as of NumPy version 2.3, some unique_* functions no longer return sorted values. This will require code amendments in cases where sorted output was relied upon.

Similarly, NaN handing is also unspecified for sort; however, in this case, all array libraries currently supported by scikit-learn follow NumPy’s NaN semantics, placing NaNs at the end. This consistency eliminates the need for special handling code, though comprehensive testing remains essential when adding support for new array libraries.

Device transfer

Mixed array namespace and device inputs necessitates conversion of arrays between different namespaces and devices. This presented a number of considerations and challenges.

The array API standard adopted DLPack as the recommended data interchange protocol. This protocol is widely implemented in array libraries and offers an efficient, C ABI compatible protocol for array conversion. While this provided us with an easy way to implement these transfers, there were limitations. Cross-device transfer capability was only introduced in DLPack v1, released in September 2024. This meant that only the latest PyTorch and CuPy versions have support for DLPack v1. Moreover, not all array libraries have adopted support yet. We therefore implemented a ‘manual’ fallback; however, this requires conversion via NumPy when the transfer involves two non-NumPy arrays. Additionally, there are no DLPack tests in array-api-tests, a testing suite to verify standard compliance, leaving DLPack implementation bugs easier to overlook. Despite these challenges, scikit-learn will benefit from future improvements, such as addition of a C-level API for DLPack exchange that bypasses Python function calls, offering significant benefit for GPU applications.

Beyond the technical considerations, there were also user interface considerations. How should we inform users that these conversions, which incur memory and performance cost, are occurring? We decided against warnings, which risk being ignored or becoming a nuisance, and to instead clearly document this behaviour. Additionally, different devices have different data type limitations; for example, Apple MPS only supports float32. How best to handle these differences when performing conversions while ensuring users are informed of precision impacts is an ongoing consideration.

A quick benchmark

Array API support for Ridge regression was added in version 1.5, enabling GPU-accelerated linear models in scikit-learn. Combined with support of several transformers, this allows for complete preprocessing and estimation pipelines on GPU.

The following benchmark shows the use of the MaxAbsScaler transformed followed by Ridge regression using randomly generated data with 500,000 samples and 300 features. The benchmarks were run on AMD Ryzen Threadripper 2970WX CPU, NVIDIA Quadro RTX 8000 GPU and Apple M4 GPU (Metal 3).

The figure below shows the performance speed up on CuPy, Torch CPU and Torch GPU relative to NumPy.

Performance speedup relative to NumPy across different backends.

The observed speedups are representative of performance gains achievable with sufficiently large datasets on datacenter-grade GPUs for linear algebra-intensive workloads. Mobile GPUs, such as those in laptops, would typically yield more modest improvements.

Note that scikit-learn’s Ridge regressor currently only supports ‘svd’ solver. We selected this solver for initial implementation as it exclusively uses standard-compliant functions available across all backends and is the most stable solver. Support for the ‘cholesky’ solver is also underway (see details in PR #29318).

Looking forward

As of version 1.8, array API support is still in experimental mode and thus not enabled by default. However, we welcome early adopters and interested users to try it and report any issues. See our documentation for details on enabling array API support.

Before removing experimental status, we would like to:

develop a system for automatically documenting functions and classes that support array API, potentially with the ability to add relevant details
mixed array type input support
support fit and predict on different hardware by allowing conversion of fitted estimators between namespaces/devices using utility functions
improved testing, in particular for the new mixed array type functionalities
improved documentation, including adding an example to our gallery
decide on the minimal dependency versions required
get real world user feedback

Alongside these infrastructure and framework improvements, we look forward to adding support for more estimators. These improvements will deliver production-ready GPU support and flexible deployment options to scikit-learn users. We welcome community involvement through testing and feedback throughout this development phase.

Acknowledgements

Work on array API in scikit-learn has been a combined effort from many contributors. This work was partly funded by CZI and NASA Roses.

I would like to thank Olivier Grisel, Tim Head and Evgeni Burovski for helping me with my array API questions.

Enhancing user experience through interactive inspection

2026-01-06T00:00:00+00:00

Author: Dea María Léon

User experience (UX) has always been an important focus for scikit-learn. As we know, UX encompasses many aspects, but here we will focus specifically on how easy it is for the user to understand scikit-learn models during development, especially while using tools like jupyter notebooks.

First visualizations

Initial work to allow users to inspect their models interactively began in 2019, when Thomas J. Fan introduced HTML visualizations for estimators. He continued to build on this foundation with additional improvements in subsequent contributions.

Lack of resources to go forward

In June 2023, issue 26595 was opened by Gaël Varoquaux outlining several potential enhancements to the HTML displays. These ideas stemmed from direct interactions with users, which clearly highlighted the need for further work in this area. Unfortunately, due to a lack of resources, the issue remained open for approximately a year and a half.

Wellcome grant awarded to `scikit-learn`

This was until the end of 2023, when Guillaume Lemaitre applied for a grant with the help of NumFOCUS, that the broader topic of Predictive model evaluation and inspection was formalized. Enhancing user experience through interactive inspection is an essential part of this effort and falls within the scope of the grant.

The grant was awarded to scikit-learnand it is from the Chan Zuckerberg Initiative (CZI) through its Essential Open-Source Software for Science (EOSS) program. It is funded by The Wellcome Trust and administered by NumFOCUS. Thanks to this financial support, work is well underway. And several objectives from the said issue have already been completed. See the grant application here.

First milestone: Added interactive parameters table for each element

The first milestone was introduced inscikit-learn version 1.7, released in June 2025. A parameters table was added to the HTML representation of models, displaying parameter names and their corresponding values. Non-default parameters—those explicitly set by the user—are highlighted. In addition, a copy-to-clipboard button is available for each parameter name. The parameter name that is copied to the clipboard is the fully classified name, which is shown on hover as well. The parameters table is collapsed by default and can be opened by the user.

The following two images show a pipeline table before and after the milestone.

HTML visualization before scikit-learn version 1.7

HTML visualization with scikit-learn version 1.7

This feature was further enhanced in version 1.8, released in December 2025. We added tooltips that provide documentation for each parameter, as well as links to the online documentation. See the GIF below or this example for more details: Displaying estimators and complex pipelines.

HTML visualization after scikit-learn 1.8

Planned improvements

More features are now being implemented. In particular, users will be able to visualize feature names and values, display fitted attributes and further improve the overall appearance of the interactive displays.

Interview with Virgil Chan, scikit-learn Team Member

2025-11-26T00:00:00+00:00

Author: Reshama Shaikh , Virgil Chan

BIO: Virgil Chan is currently a Forward Deployed Engineer - Pre-Sales at Union.ai. Before that, he worked as a consultant in the San Francisco Bay Area, specialising in predictive data analytics and machine learning. Earlier, he studied mathematics before moving into data science. Virgil joined the scikit-learn team as a Contributor Experience Team member in December 2024.

GitHub: @virchan
LinkedIn: @virgil-chan
Website: https://virchan.github.io

Tell us about yourself.

My name is Virgil, and I’m currently working as a Forward Deployed Engineer – Pre-Sales at Union.ai. Based in San Jose, California, I previously worked as a consultant, using libraries from the scientific Python ecosystem on data science and machine-learning projects, including medical data analysis, traffic-network prediction, and model evaluation. Before deciding that computers are more fun, I was doing mathematical research in topology.
How did you first get involved in open source?

I first got involved in open source during the COVID-19 lockdown. I used that time to study Python programming, data science, analytics, and machine learning, and that’s when I discovered libraries like NumPy, Pandas, scikit-learn, NetworkX, and TensorFlow. Once I became more confident in my skills, I started working as a consultant and used these libraries to deliver data-driven solutions for clients.
We would love to learn of your open source journey.

I was transitioning from academia into software development, and I quickly learnt that companies valued hands-on experience more than an advanced degree. At the same time, the rise of GPU-driven workloads and LLM-based solutions made my earlier consulting projects look less impressive on paper. I ended up stuck in the infinite loop of no-job-no-experience.

Even though I came from a non-traditional background, and my resume didn’t match what recruiters and ATS systems usually look for, I’ve always believed that my experience is something I can build myself. Since companies weren’t keen on training junior developers, open source became one of the not-so-many viable paths. I started looking for a project where I could grow, be useful, and apply my academic training in a meaningful way. That search naturally led me to scikit-learn.
How did you get involved in scikit-learn?

My first PR to scikit-learn (scikit-learn/scikit-learn#27913) was a classic “good first issue”: adding the URL of a scikit-learn example to the relevant places in the documentation. I opened it in December 2023 and it was merged into the main branch in March 2024. Maren helped me navigate the codebase and understand the CI workflow, which gave me a solid foundation for later contributions. Even though I’m now more experienced with the contributing workflow, I still revisit that PR from time to time to remind myself of the challenges first-time contributors face, and how I can support them.

My next PR (scikit-learn/scikit-learn#29709) was more technical, fixing a bug in the (root) mean squared log error function. The expected behaviour was to check that inputs were in the domain of $\log(1 + x)$, but the implementation at the time checked the domain of $\log(x)$ instead. It was one of the few issues I fully understood and knew how to solve, so I volunteered to create a PR. Adrin reviewed it and mentored me throughout the process. Once everything looked good and the CI passed, he asked me to add array API support to the function. And that’s where the fun began.

I had no idea what the array API was, but I already had the habit of reading discussions and merged PRs in my spare time. With a bit of Googling, I quickly understood what needed to be done and the broader importance of the array API project. In fact, completing the array API project has become one of my mid-term goals for my scikit-learn work. Under the guidance of Adrin, Guillaume, Olivier, and Omar, my PRs improved, and contributing became even more rewarding because of how supportive the maintainers were. I also started reviewing PRs, especially from first-time contributors working on the same “good first issue” I began with. In December 2024, I joined the scikit-learn team.

I’m honoured that the team welcomed me and trusted me with more responsibility, such as representing scikit-learn at the Scientific Python Developer Summit in May 2025, implementing temperature scaling as a new feature (with Christian), and having the ability to run CUDA CI myself. It feels good to pass the same positivity I received back into the community.
To which OSS projects and communities do you contribute?

I’m also interested in scaling machine-learning algorithms, so I’ve been exploring CUDA and cuML as well.
What is alluring about OSS?

Open source fosters a collaborative environment where everyone wins: end-users, maintainers, and contributors. Because it is volunteer-driven, it becomes easier to recognise that the problem itself is the problem, the bug or the issue, rather than the people involved. As a result, the usual institutional complications, such as power or ego struggles, conflicts of interest over funding, or pressure from deadlines, are far less likely to drag the project down. People have more freedom to focus on solving problems, which creates an ideal environment for exploration, experimentation, collaboration, learning, and growth.

Open source has given me the chance to grow, develop new skills, and broaden my perspective, something I’ve been battling since finishing college. By trading my time for responsibility, I’ve found open source to be a meaningful and genuinely rewarding experience.
What are your favorite resources, books, courses, conferences, etc?

I found the interview between scikit-learn and Code for Thought on YouTube. The maintainers shared their open-source journeys from how they got started to how they became involved in scikit-learn, which I found inspiring and motivating. For example, I can’t agree more with Gael’s point that “open source should be spontaneous” and that “a diversity of opinion will make better software.” I also learned from Adrin that I could get more involved in the project by becoming the second reviewer for a PR, which gave me the confidence to start reviewing PRs. I think this interview can help people understand the project from a more human and non-technical perspective.
What are your hobbies, outside of work and open source?

If I’m done with work and house chores, I usually listen to music. I enjoy classical music (Mozart, Brahms, Rachmaninoff, etc.), and I’m currently getting more exposure to Chopin’s work. I also like Rock ‘n’ Roll (Led Zeppelin, Eric Clapton, Deep Purple, etc.), and I find that AC/DC can “push me to eleven” whenever I’m stuck at work.

I also enjoy reading novels. At the moment I’m reading The Silmarillion by Tolkien, and my to-read list keeps growing.

I like hanging out with cats as well. I volunteer with an animal rescue group in San Jose, where I help care for the cats in their sanctuary and assist at adoption fairs.

scikit-learn Completes the GitHub Secure Open Source Training

2025-08-16T00:00:00+00:00

Author: Reshama Shaikh

Summary

scikit-learn was honored to be selected to participate in Cohort 2 of the GitHub Secure Open Source Fund (OSF) Training Program. Cohort 1 took place earlier in 2025 with 19 projects, and Cohort 2 took place with 52 projects during June 2025.

Original post: GH Secure OSS Announcement

It was an intense 3-week training program, with over 90 open source maintainers joining the training. Read the announcement from GitHub: Securing the supply chain at scale: Starting with 71 important open source projects

There were numerous workshops delivered by experts in the GitHub Security Lab. For many of these workshops, the learning materials are publicly available, and they are shared below.

GitHub Security Lab

GitHub has its own security department, and GitHub Security Lab’s mission is to empower developers and secure open source.

GitHub Security Lab: Resources

Original post: GitHub Security Lab

Resources for Security Training

The training provided many trainings by experts in the field. Below we share trainings that are available to the public.

Configuring private vulnerability reporting for a repository

Owners and administrators of public repositories can allow security researchers to report vulnerabilities securely in the repository by enabling private vulnerability reporting.
OpenSSF Scorecard
Secure by design: A UX toolkit

CodeQL: From Zero to Hero

This workshop introduces fundamentals of security research and static analysis used when looking for vulnerabilities in software. They use an example of a simple vulnerability, walk through how CodeQL could detect it, and provide examples on how the audience could use CodeQL to find vulnerabilities themselves.

slides: Finding Vulnerabilities with CodeQL

Original post: Finding Vulnerabilities with CodeQL

Developing Secure Software

This course includes specific tips on how to use and develop open source and other software securely. Learn the security basics to develop software that is hardened against attacks, and understand how you can reduce the damage and speed the response when a vulnerability is exploited.

It was developed by the Open Source Security Foundation (OpenSSF), a cross-industry collaboration that brings together leaders to improve the security of open source software by building a broader community, targeted initiatives, and best practices.

Online, Self Paced
16-20 Hours of Course Material
Quizzes and Hands-on Labs

Original post: LFD121: Developing Secure Software

OSS-Fuzz

Fuzz testing is a well-known technique for uncovering programming errors in software.

Original post: OSS-Fuzz

Secure Code Game

Secure Code Game is a GitHub Security Lab initiative, providing an in-repo learning experience, where learners to secure intentionally vulnerable code. At the same time, this is an open source project that welcomes your contributions as a way to give back to the community.

Original post: Secure Code Game

Participate in Future Cohorts of the GitHub Secure Open Source Training

If you are a maintainer of an open source project, this training is an excellent opportunity to secure your project with guidance from highly trained experts in the security field. Applications are open.

References

Securing the supply chain at scale: Starting with 71 important open source projects (11-Aug-2025)
TechCrunch: GitHub launches $1.25M open source fund with a focus on security (19-Nov-2024)
GitHub Secure Open Source Fund
Eclipse Foundation Security Policy
Linux Foundation Security Policy

Blogs from Participating Open Source Projects

OpenCV: OpenCV’s Participation in the GitHub Secure Open Source Fund
Bootstrap: Bootstrap at GitHub Secure Open Source Fund
Cobra & Viper: Cobra & Viper Fortify Security as Part of GitHub Secure Open Source Fund
Zitadel: A Leap Forward in Security: Our Journey with the GitHub Secure Open Source Fund

Acknowledgments

Thank you to the funders and ecosystem partners of the GitHub Secure Open Source Fund.

Funding Partners: Alfred P. Sloan Foundation, American Express, Chainguard, Datadog, Herodevs, Kraken, Mayfield, Microsoft, Shopify, Stripe, Superbloom, Vercel, Zerodha, 1Password

Ecosystem Partners: Ecosyste.ms, CURIOSS, Digital Data Design Institute Lab for Innovation Science, Digital Infrastructure Insights Fund, Microsoft for Startups, Mozilla, OpenForum Europe, Open Source Collective, OpenUK, Open Technology Fund, OpenSSF, Open Source Initiative, OpenJS Foundation, University of California, Santa Cruz OSPO, Sovereign Tech Agency, SustainOSS

Skolar: an open-source initiative to democratize open data science

2025-06-30T00:00:00+00:00

Author: Skolar , Pénélope Gittos

This blog post has been submitted by Probabl, a sponsor of scikit-learn. The scikit-learn project values educational efforts that build and nurture a strong vibrant open-source community. The goal of this is straightforward: give everyone, everywhere, the tools they need to easily grasp, engage with, and meaningfully contribute to data science using open-source software. This mission is shared and actively supported by Probabl, a company that helps maintain scikit-learn by employing many of its core contributors and investing in its long-term sustainability. With Probabl’s support and a deep commitment from the community, the scikit-learn ecosystem continues building bridges between research, software, and education.

When the Inria scikit-learn MOOC (Massive Open Online Course) first went live, the community got a front-row seat to the amazing impact of practical, accessible and open learning. Created by several core developers and maintainers of scikit-learn—now working at Probabl—the MOOC has reached over 40,000 learners worldwide, clearly highlighting the demand for organized, hands-on resources that blend theory with real-world practice.

Today, Probabl is excited to introduce Skolar, a new, fully open-source educational initiative, built directly from your feedback and all the lessons we’ve learned along the way. Developed and extended by those same core developers of scikit-learn, Skolar is designed specifically for data science practitioners, offering hands-on, high-quality learning resources grounded in real-world applications and open-source values.

Skolar exists to boost our shared values: openness, teamwork, and practicality. It offers clear, interactive tutorials and structured courses carefully designed to match industry challenges and specialized use-cases. But even more importantly, it captures the true spirit of open source: encouraging collaboration, peer-to-peer learning, and guidance from experts.

Right now, we’re just at the beginning. Today, you can dive into our Scikit-learn Associate Practitioner online course, adapted from the popular Inria MOOC but enhanced with new material on unsupervised learning, especially clustering.

The next stages, professional and expert levels, will be released soon. We’ll also add more courses covering other open-source libraries such as skrub (for data wrangling), hazardous (for survival analysis), and fairlearn (for fairness). Additionally, our scikit-learn team is planning to create industry-specific modules tackling real-world needs in fields like healthcare, finance, medicine, and beyond.

At its core, Skolar is about empowering people through education, driven entirely by our passion for openness and collaboration. We firmly believe that true open data science begins with community-built learning resources. We warmly welcome you, whether you’re a contributor, learner, teacher, or just someone curious, to join us. Help shape Skolar’s future and support open-source education in data science.

Create your account on Skolar today: https://skolar.probabl.ai

Contribute to the scikit-learn course contents, or contribute to the learning platform’s backend or frontend.

Changes and development of scikit-learn’s developer API

2024-12-12T00:00:00+00:00

Author: Adrin Jalali

Historically, scikit-learn’s API has been divided into public and private. Public API is intended to be used by users, and private API is used internally in scikit-learn to develop new features and estimators. However, many of those functionalities have become essential to develop scikit-learn estimators by third parties who develop them outside the scikit-learn codebase.

When it comes to our public API, we have very strict and high standards on backward compatibility. The rule of thumb is that no change should cause a change in users’ code unless we warn about it for two release cycles, which means we give users a year time to update their code.

On the other hand, we have no such guarantees or constraints on our private API. This brings an issue to third party developers who would like to use methods used by scikit-learn developers to develop their estimators. Constantly changing private API without prior warning brings certain challenges to third party developers which is not ideal.

As a result, we’ve been working on creating a developer API which would sit somewhere between our public and private API in terms of backward compatibility. That means we intend to try to keep that API stable, and if needed, introduce changes with one release cycle warning.

In the past few releases, we’ve slowly introduced more functionalities under this umbrella. __sklearn_clone__ and __sklearn_is_fitted__ are two examples.

In the 1.6 release, we focused on the testing infrastructure and estimator tag system. Estimator tags used to be private, and we were not sure about their design. In the 1.6 release, new tags are introduced and using them looks like the following:

from sklearn.base import BaseEstimator, ClassifierMixin

class MyEstimator(ClassifierMixin, BaseEstimator):

  ...

  def __sklearn_tags__(self):
    tags = super().__sklearn_tags__()
    # modify tags here
    tags.non_deterministic = True
    return tags

The new tags mostly follow the same structure as the old tags, but there are certain changes to them. The main change is that the old _xfail_checks is no longer present in the new tags. That tag was used to tell the common testing tools about the tests which are known to fail and are to be skipped. That information is now directly passed to the test functionalities. The old way of skipping a test was the following:

from sklearn.base import BaseEstimator, ClassifierMixin

class MyEstimator(ClassifierMixin, BaseEstimator):

  ...

  def _more_tags(self):
    return {
      "_xfail_checks": {
        "check_to_skip_name": "this check is known to fail",
        ...
      }
    }

And then when calling check_estimator or using parametrize_with_checks with pytest would automatically ignore those tests for the estimator.

Instead, in this release, you pass that information directly to those methods:

from sklearn.utils.estimator_checks import check_estimator, parametrize_with_checks

CHECKS_EXPECTED_TO_FAIL = {
  "check_to_skip_name": "this check is known to fail",
  ...
}

# Using check_estimator
def test_with_check_estimator():
  check_estimator(MyEstimator(), expected_failed_checks=CHECKS_EXPECTED_TO_FAIL)

# Using parametrize_with_checks
@parametrize_with_checks(
  [MyEstimator()],
  expected_failed_checks=lambda est: CHECKS_EXPECTED_TO_FAIL
)
def test_with_parametrize_with_checks(estimator, check):
  check(estimator)

While working on the testing infrastructure, we have also been working on improving our tests and that means in this release we had a particularly high number of changes in their names and what they do. The changes will make it easier for developers to fix issues with their estimators. Note that you can now pass legacy=False to both check_estimator and parametrize_with_checks to include only strictly API related tests.

The above changes mean developers need to update their estimators and depending on what they use, write scikit-learn version specific code to handle supporting multiple scikit-learn versions. To make that process easier, we’ve worked on a package called sklearn_compat. You can either depend on it as a package dependency, or vendor a single file inside your project. At the moment this project is in its infancy and might change in the future. But hopefully it helps developers out there.

If you think there are missing functionalities in the developer API, please let us know and give us feedback on our issue tracker.

Announcing the launch of the scikit-learn user survey

2024-09-02T00:00:00+00:00

Author: Inessa Pawson , François Goupil

We are excited to announce the launch of the scikit-learn user survey! Scikit-learn continues to evolve thanks to contributions from its diverse user community. As we plan for future releases, we want to ensure we are focusing on what matters most to you — our users.

The goal of this survey is to better understand how users interact with the library, identify any pain points, learn about the features you find most useful, and what’s missing. This is your chance to have a say in how the library grows and adapts to meet the evolving needs of the machine learning community.

The survey will take about 15 minutes of your time. It is available in Arabic, French, English, Japanese, Mandarin, Spanish, and Portuguese. You have the option to remain completely anonymous, and the data collected will be used solely for the purpose of improving scikit-learn.

This user survey is a truly collaborative effort. We would like to thank the teams from probabl, University of Oxford (UK), and POSSEE OpenTeams, as well as many scikit-learn contributors, for their time and effort in designing and translating it.

Once the survey closes, we’ll analyze the responses and publish the findings in a follow-up blog post.

To take the survey, visit: https://forms.gle/p5P7AweCJCbFMzfo6. The survey will remain open until October 14th, 2024, and we encourage you to share it with your colleagues and extended network.

We value every contribution in our community, and we’re committed to making scikit-learn even better. Your feedback is the foundation upon which scikit-learn will continue to grow and evolve. We look forward to hearing from you!

Chan Zuckerberg Initiative considers scikit-learn an Essential Open Source Software

2024-08-06T00:00:00+00:00

Author: Guillaume Lemaitre , Lucy Liu

We are delighted to announce that scikit-learn has been awarded a grant from the Chan Zuckerberg Initiative (CZI)’s Essential Open Source Software for Science (EOSS) program. This grant is funded by Wellcome Trust. As in previous rounds, this cycle supports open-source software projects that are essential to biomedical research. This is the third time that CZI EOSS supports scikit-learn.

In this new grant, we will focus on improving the evaluation and inspection of predictive models.

Predictive models evaluation & inspection

When building a machine learning pipeline for a specific research problem, two key aspects are closely connected: (i) design of the pipeline and (ii) assessment, analysis, and inspection of it. Researchers strive to identify the optimal pipeline, maximizing specific evaluation metrics, while also seeking at explaining the validity and rationale behind the pipeline’s predictions. This is the cornerstone of answering research questions. With this proposal we aim to improve and extend the available scikit-learn tools.

scikit-learn provides building blocks for model evaluation and statistical analysis of results. Originally, this information was presented in a raw format and required expertise from scientists to create intuitive reports for outreach to peers and outsiders. Recently, the scikit-learn community developed displays to easily generate visual figures for communicating such results. However, these displays are still in their early development stages and do not leverage all available statistical analysis tools (i.e., cross-validation) from scikit-learn. Thus, we aim to expand these displays, using the right statistical tools and thus promote the adoption of best practices when reporting results. Additionally, we also intend to create new displays to support common analysis tasks that are not yet covered in scikit-learn.

In the domain of model inspection, we aim to address several areas: (i) model inspection during training, (ii) enhancing user experience through interactive inspection, and (iii) model explainability. First, during the training of a pipeline, researchers are interested in monitoring the internal characteristics of the model, which is a not yet addressed long-standing issue in scikit-learn. We want to build upon some initial work by implementing a “callback” framework that allows users to track these internal parameters. Next, researchers commonly use interactive tools such as Jupyter Notebook to develop pipelines. scikit-learn started some efforts to visually and interactively display pipelines in these environments. However, there is room for improvement in terms of user interaction and accessibility. Finally, as scikit-learn is widely used as a reference package, it is crucial to improve the section of the library dedicated to model explainability. We aim to improve the documentation and user experience with the existing explainability tools, making sure that they use the appropriate tool for their use cases. In addition, we propose to work on a scikit-learn enhancement proposal (SLEP) to define a common API for model explainability within scikit-learn. Ultimately, the goal is to come to a consensus to provide scikit-learn end-users with a consistent experience when using model explainability tools.

On top of all these items, we intend to continue working on the general maintenance of the project, addressing bug reports and performance regressions. As a community-driven project, we also want to dedicate time reviewing external contributions.

Involved people

To execute this project, we plan the following hires:

Lucy Liu (Quansight Labs) will work about half-time on the project, on topic related to displays and feature importance.
We will hire full-time internships to work on the other part of the project. The initial plan is to hire two interns for a period of 6 months each and repeat this process for the next 2 years. We want to provide opportunities to underrepresented groups in the field of machine learning and data science, similarly to previous initiatives (cf. NumFOCUS Small Development Grant).

Past CZI EOSS grants

In the past scikit-learn has been awarded two grants from the CZI EOSS program:

CZI EOSS Cycle 1 helped at creating to the HistGradientBoostingClassifier and HistGradientBoostingRegressor estimators. These estimators are the equivalent of gradient boosting models implemented in LightGBM and XGBoost.
CZI EOSS Cycle 4 extended scikit-learn to work better with missing values and categorical data in several estimators.

Both grants allowed us to maintain and enhance scikit-learn to better serve the community.

Interview with Adam Li, scikit-learn Team Member

2024-07-24T00:00:00+00:00

Author: Reshama Shaikh , Adam Li

BIO: Adam is currently a Postdoctoral Research Scientist at Columbia University in the Causal Artificial Intelligence Lab, directed by Dr. Elias Bareinboim. He is an NSF-funded Computing Innovation Research Fellow. He did his PhD in biomedical engineering, specializing in computational neuroscience and machine learning at Johns Hopkins University working with Dr. Sridevi V. Sarma in the Neuromedical Control Systems group. He also jointly obtained a MS in Applied Mathematics and Statistics with a focus in statistical learning theory, optimization and matrix analysis. He was fortunate to be a NSF-GRFP fellow, Whitaker International Fellow, Chateaubriand Fellow and ARCS Chapter Scholar during his time at JHU. Adam officially joined the scikit-learn team as a maintainer in July 2024.

GitHub: @adam2392
LinkedIn: @adam2392
Website: https://adam2392.github.io

Link to scikit-learn contributions (issues, pull requests):

Tell us about yourself.

I currently live in New York City, where I work on theoretical and applied AI research through the lens of causal inference, statistical modeling, dynamical systems and signal processing. My current research is focused on telling a causal story, specifically in the case one has multiple distributions of data from the same causal system. For example, one may have access to brain recordings from monkeys and humans. Given these heterogeneous datasets, I am interested in answering: what causal relationships can we learn. This is known as the causal discovery problem, where given data, one attempts to learn what causes what. Another problem that I work on that is highly relevant to generative AI is the problem of causal representation learning. Here, I develop theory and train deep neural networks to understand causality among latent factors. Specifically, we demonstrate how to leverage multiple datasets and a causal neural network to generate data that is causally realistic. This can enable more robust data generation from general latent variable models.
How did you first become involved in open source and scikit-learn?

I first got involved in open source as a user. I was making the switch from Matlab to Python and started using packages like numpy and scipy pretty regularly. In my PhD research, I dealt with a lot of electrophysiological data (i.e. EEG brain recordings). I was writing hundreds of lines of code to load and preprocess data, and it was always changing based on different constraints. That was when I discovered MNE-BIDS, a Python package within the MNE framework for reading and writing brain recording data in a structured format. This changed my life because now my preprocessing and data loading code was a few lines of code that adhered to an open standard tested by thousands of researchers. I realized the value of open source, and began contributing in my spare time.
We would love to learn of your open source journey.

I first started contributing to open-source in the MNE organization. This package implements data structures for the processing and analysis of neural recording data (e.g. MEG, EEG, iEEG data). I contributed over 70 pull requests in the MNE-BIDS package, and subsequently was invited to be a maintainer for MNE-BIDS and MNE-Python. Later one, I participated in a Google Summer of Code to port the connectivity submodule within MNE-Python to a new package, known as MNE-Connectivity. I added new data structures, and algorithms for the sake of improving the feature developments for connectivity algorithms among neural recording data. Later on, I also worked with a team on porting a neural network architecture from Matlab to the MNE framework to automatically classify ICA derived components. This became known as MNE-ICALabel. These experiences gave me the experience necessary to work in a large asynchronous team environment that is common in OSS. It also taught me how to begin contributing to an OSS project. This led me to scikit-learn.

I first got involved in scikit-learn as a user, who was heavily interested in the decision tree model in scikit-learn (random forest, randomized trees). Here, I was interested in contributing a new oblique decision tree model that was a generalization of the existing random forest model. However, the code was not easily added to scikit-learn, and currently the decision to include it is inconclusive. Throughout this process, I learned about the challenges and intricacies of maintaining such a large OSS project as scikit-learn. It is not trivial to simply add new features to a large OSS project because code comes with a maintenance cost, and should fit with the current internal design. At this point in time, there were very few maintainers that were able to maintain the tree submodule, and as such new features are included conservatively.

I was eager to improve the project to enable more exciting features for the community, so I began contributing to scikit-learn starting with smaller issues such as documentation improvements, or minor bug fixes to get acquainted with the codebase. I also refactored various Cython code to begin upgrading the codebase, especially in the tree submodule. Throughout this process, I identified other projects the maintainers team were working on, and also contributed there. For example, I added metadata routing to a variety of different functions and estimators in scikit-learn. I also began reviewing PRs for the tree submodule and metadata routing where I had knowledge. I also added missing-value support for extremely randomized tree models (called ExtraTrees in scikit-learn). This allows users to pass in data that contains missing values (encoded as np.nan) to ExtraTrees. Around this time, I was invited to join the maintainer team of scikit-learn. More recently, I have taken on the project to add categorical data support to the decision tree models, which will make random forests and extremely randomized tree models more performant and capable to handle real world settings where there is commonly categorical data.
To which OSS projects and communities do you contribute?

I currently primarily contribute to scikit-learn, PyWhy (a community for causal inference in Python), and also develop my own OSS project: treeple. Treeple is an exciting package that implements different decision tree models beyond those offered in scikit-learn with an efficient Cython implementation stemming from the scikit-learn tree internals.
What do you find alluring about OSS?

OSS is so exciting because of the impact it has. Everyone from private projects to other OSS projects will use OSS. Any fixes to documentation, performance improvements, or new features will potentially impact the workflows of potentially millions of people. This is what makes contributing to OSS so exciting. Moreover, this impact ensures that best practices are usually carried out in these projects, and it’s a great playground to learn from the best, while giving back to the larger community.
What pain points do you observe in community-led OSS?

Right now, community lead OSS moves very slowly in most places. This is for a number of very good reasons: i) not releasing buggy features that may impact millions of people, and ii) backwards compatibility. One of the challenges of maintaining a high-quality OSS project is that you would like to satisfy your users, who may all utilize different components of the project from different versions. As such, many community led OSS projects take a conservative approach when implementing new features and new ideas. However, there may be many exciting better features that are already known by the community, but still lack an OSS implementation.

I think this can be partially solved by increased funding for OSS, so OSS maintainers and developers are able to dedicate more time to maintaining and improving the projects. In addition, I think this can be improved if more developers in the community contribute to said OSS projects. I hope that I have convinced you though that contributing to OSS is impactful and highly educational.
If we discuss how far OS has evolved in 10 years, what would you like to see happen?

I think more interoperability and integrated workflows for projects will make projects that utilize OSS more streamlined and efficient. For example, right now there are different array libraries (e.g. numpy, cupy, xarray, pytorch, etc.), which all support some manner of a n-dimensional array, but with a slightly different API. This makes it very painful to transition across different libraries that use different arrays. In addition, there are multiple dataframe libraries, such as pandas and polars, and this problem of API consistency also arises there.

Some work has been made on the Array-API front to allow different array libraries to serve as backends given a common API. This will enable GPU acceleration for free without a single code change, which is great! This will be exciting because users will eventually only have to write code in a single way, and can then leverage any array/dataframe library that has different advantages and disadvantages based on the user use case.
What are your hobbies, outside of work and open source?

I enjoy running, trying new restaurants and bars, cooking and reading. I’m currently training for a half-marathon, where my goal is to run under 8 minutes per mile. I’m also trying to perfect a salad with an asian-themed dressing. In a past life, I was a bboy (breakdancer) for ten years until I stopped in graduate school because I got busy (and old).

Interview with Yao Xiao, scikit-learn Team Member

2024-07-18T00:00:00+00:00

Author: Reshama Shaikh , Yao Xiao

Yao Xiao recently earned his undergraduate degree in mathematics and computer science. He will be pursuing a Master’s degree in Computational Science and Engineering at Harvard SEAS. Yao joined the scikit-learn team in February 2024.

Tell us about yourself.

My name is Yao Xiao and I live in Shanghai, China. At the time of interview I have just got my Bachelor’s degree in Honors Mathematics and Computer Science at NYU Shanghai, and I’m going to pursue a Master’s degree in Computational Science and Engineering at Harvard SEAS. My current research interests are in networks and systems (e.g. sys4ml and ml4sys), but this may change in the future.
- GitHub: @Charlie
- LinkedIn: @yao-xiao
- Website: https://charlie-xiao.github.io
How did you first become involved in open source and scikit-learn?

In my junior year I took a course at NYU Courant called Open Source Software Development where we needed to make contributions to an open source software as our final project - and I chose scikit-learn.
We would love to learn of your open source journey.

I was lucky to get involved in a pretty easy meta-issue when I first started contributing to scikit-learn. I made quite a few PRs towards that issue, familiarizing myself with the coding standards, contributing workflow etc., and during which I gradually explored the codebase and learned a lot from maintainers how to write better code. After that meta-issue was completed, I decided to continue contributing since I enjoyed the experience, and I started looking through the open issues, tried reproducing and investigating them, then opened PRs for those that I was able to solve. It is the process of familiarizing with more parts of the codebase, being able to make more PRs, so on and so forth. While contributing to scikit-learn, sometimes there are also issues to solve upstream, so I also had opportunities to contribute to projects like pandas and pydata-sphinx-theme. Up till today I’m still far from familiar with the entire scikit-learn project, but I will definitely continue the amazing open-source journey.
To which OSS projects and communities do you contribute?

I have contributed to scikit-learn, pandas, pydata-sphinx-theme, sphinx-gallery. I’m also writing some small softwares that I decide to make open source.
What do you find alluring about OSS?

It is amazing to feel that my code is being used by so many people all around the world through contributing to open source projects. Well it might be inappropriate to say “my code”, but I do feel like making some actual contributions to the community instead of just writing code for myself. Also OSS makes me care about code quality and so on instead of merely making things “work”, which is very important for programmers but not really taught in school.
What pain points do you observe in community-led OSS?

Collaboration can lead to better code but also slows down the development process. Especially when there are not enough reviewers around, issues and PRs can easily get stale or forgotten. But I would say it’s more like a tradeoff rather than a pain point.
If we discuss how far OS has evolved in 10 years, what would you like to see happen?

I couldn’t say about the past 10 years since I’ve only been involved for about one and a half years, but regarding the scientific Python ecosystem I would like to see better coordination across projects (which is already happening). For instance a common interface for array libraries and dataframe libraries would allow downstream dependents to easily provide more flexible support for different input/output types, etc. And as a Chinese I would also hope that open source can thrive in my country some day as well.
What are your favorite resources, books, courses, conferences, etc?

As for physical books I would recommend The Pragmatic Programmer by Andy Hunt and Dave Thomas, and Refactoring: Improving the Design of Existing Code by Martin Fowler and Kent Back. As for courses I like MIT’s The Missing Semester of Your CS Education. In particular about learning Python, The Python Tutorial in the official Python documentation is good enough for me. By the way I want to mention that documentations of most languages and popular packages are very nice and they are the best place to learn the most up-to-date information.
What are your hobbies, outside of work and open source?

I would say my largest hobby is programming (not for school, not for work, just for fun). I’ve recently been fascinated with Tauri and wrote a lot of small desktop applications for myself in my spare time. Apart from this I also love playing the piano and I’m an anime lover, so I often listen to or play piano versions of anime theme songs (mostly arranged by Animenz).