.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/miscellaneous/plot_set_output.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_miscellaneous_plot_set_output.py>`
        to download the full example code or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_miscellaneous_plot_set_output.py:


================================
Introducing the `set_output` API
================================

.. currentmodule:: sklearn

This example will demonstrate the `set_output` API to configure transformers to
output pandas DataFrames. `set_output` can be configured per estimator by calling
the `set_output` method or globally by setting `set_config(transform_output="pandas")`.
For details, see
`SLEP018 <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html>`__.

.. GENERATED FROM PYTHON SOURCE LINES 16-17

First, we load the iris dataset as a DataFrame to demonstrate the `set_output` API.

.. GENERATED FROM PYTHON SOURCE LINES 17-24

.. code-block:: default

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split

    X, y = load_iris(as_frame=True, return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
    X_train.head()


.. GENERATED FROM PYTHON SOURCE LINES 25-27

To configure an estimator such as :class:`preprocessing.StandardScaler` to return
DataFrames, call `set_output`. This feature requires pandas to be installed.

.. GENERATED FROM PYTHON SOURCE LINES 27-36

.. code-block:: default


    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler().set_output(transform="pandas")

    scaler.fit(X_train)
    X_test_scaled = scaler.transform(X_test)
    X_test_scaled.head()


.. GENERATED FROM PYTHON SOURCE LINES 37-38

`set_output` can be called after `fit` to configure `transform` after the fact.

.. GENERATED FROM PYTHON SOURCE LINES 38-48

.. code-block:: default

    scaler2 = StandardScaler()

    scaler2.fit(X_train)
    X_test_np = scaler2.transform(X_test)
    print(f"Default output type: {type(X_test_np).__name__}")

    scaler2.set_output(transform="pandas")
    X_test_df = scaler2.transform(X_test)
    print(f"Configured pandas output type: {type(X_test_df).__name__}")


.. GENERATED FROM PYTHON SOURCE LINES 49-51

In a :class:`pipeline.Pipeline`, `set_output` configures all steps to output
DataFrames.

.. GENERATED FROM PYTHON SOURCE LINES 51-61

.. code-block:: default

    from sklearn.pipeline import make_pipeline
    from sklearn.linear_model import LogisticRegression
    from sklearn.feature_selection import SelectPercentile

    clf = make_pipeline(
        StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
    )
    clf.set_output(transform="pandas")
    clf.fit(X_train, y_train)


.. GENERATED FROM PYTHON SOURCE LINES 62-64

Each transformer in the pipeline is configured to return DataFrames. This
means that the final logistic regression step contains the feature names of the input.

.. GENERATED FROM PYTHON SOURCE LINES 64-66

.. code-block:: default

    clf[-1].feature_names_in_


.. GENERATED FROM PYTHON SOURCE LINES 67-69

Next we load the titanic dataset to demonstrate `set_output` with
:class:`compose.ColumnTransformer` and heterogenous data.

.. GENERATED FROM PYTHON SOURCE LINES 69-76

.. code-block:: default

    from sklearn.datasets import fetch_openml

    X, y = fetch_openml(
        "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
    )
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)


.. GENERATED FROM PYTHON SOURCE LINES 77-79

The `set_output` API can be configured globally by using :func:`set_config` and
setting `transform_output` to `"pandas"`.

.. GENERATED FROM PYTHON SOURCE LINES 79-105

.. code-block:: default

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.impute import SimpleImputer
    from sklearn import set_config

    set_config(transform_output="pandas")

    num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
    num_cols = ["age", "fare"]
    ct = ColumnTransformer(
        (
            ("numerical", num_pipe, num_cols),
            (
                "categorical",
                OneHotEncoder(
                    sparse_output=False, drop="if_binary", handle_unknown="ignore"
                ),
                ["embarked", "sex", "pclass"],
            ),
        ),
        verbose_feature_names_out=False,
    )
    clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
    clf.fit(X_train, y_train)
    clf.score(X_test, y_test)


.. GENERATED FROM PYTHON SOURCE LINES 106-108

With the global configuration, all transformers output DataFrames. This allows us to
easily plot the logistic regression coefficients with the corresponding feature names.

.. GENERATED FROM PYTHON SOURCE LINES 108-114

.. code-block:: default

    import pandas as pd

    log_reg = clf[-1]
    coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
    _ = coef.sort_values().plot.barh()


.. GENERATED FROM PYTHON SOURCE LINES 115-117

This resets `transform_output` to its default value to avoid impacting other
examples when generating the scikit-learn documentation

.. GENERATED FROM PYTHON SOURCE LINES 117-119

.. code-block:: default

    set_config(transform_output="default")


.. GENERATED FROM PYTHON SOURCE LINES 120-124

When configuring the output type with :func:`config_context` the
configuration at the time when `transform` or `fit_transform` are
called is what counts. Setting these only when you construct or fit
the transformer has no effect.

.. GENERATED FROM PYTHON SOURCE LINES 124-129

.. code-block:: default

    from sklearn import config_context

    scaler = StandardScaler()
    scaler.fit(X_train[num_cols])


.. GENERATED FROM PYTHON SOURCE LINES 130-135

.. code-block:: default

    with config_context(transform_output="pandas"):
        # the output of transform will be a Pandas DataFrame
        X_test_scaled = scaler.transform(X_test[num_cols])
    X_test_scaled.head()


.. GENERATED FROM PYTHON SOURCE LINES 136-137

outside of the context manager, the output will be a NumPy array

.. GENERATED FROM PYTHON SOURCE LINES 137-139

.. code-block:: default

    X_test_scaled = scaler.transform(X_test[num_cols])
    X_test_scaled[:5]


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_auto_examples_miscellaneous_plot_set_output.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/miscellaneous/plot_set_output.ipynb
        :alt: Launch binder
        :width: 150 px



    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/?path=auto_examples/miscellaneous/plot_set_output.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_set_output.py <plot_set_output.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_set_output.ipynb <plot_set_output.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_