.. _qtcuqu_solver:

************************************
cuQuantum Backend in QuTiP's Solvers
************************************

The ``qutip-cuquantum`` library provides a backend that leverages NVIDIA's cuQuantum SDK to accelerate the simulation of quantum systems.
This backend is specifically designed for simulating **large, composite quantum systems** by performing tensor network contractions on the GPU.

This document outlines how to activate and use this backend with QuTiP's solvers.

==================================
Enabling the cuQuantum Backend
==================================

To use the cuQuantum-based solvers, you must first enable the backend.
This is done by calling the ``set_as_default`` function and providing it with a ``WorkStream`` object from ``cuquantum.densitymat``.

.. code-block:: python

    import qutip
    import qutip_cuquantum
    from cuquantum.densitymat import WorkStream

    # Create a workstream for the backend.
    # This can be configured for multi-GPU and multi-node setups.
    stream = WorkStream()

    # Set cuQuantum as the default backend.
    qutip_cuquantum.set_as_default(stream)

The ``set_as_default`` function changes several QuTiP defaults to route computations through the cuQuantum library.
This includes setting the default data format for quantum objects (``Qobj``) to ``CuOperator`` and configuring the solvers to use GPU-compatible integrators.

This operation can be reversed with:

.. code-block:: python

    qutip_cuquantum.set_as_default(reverse=True)


The backend can also be enabled with a context:

.. code-block:: python

    with CuQuantumBackend(ctx):
        ...

However be careful when mixing core Qutip object and Qutip-cuQuantum's one.
Qutip's Qobj do not keep all the internal structure needed for cuQuantum's optimizations.
Qutip-cuQuantum's states can be distributed in multiple processes and unusable for many qutip's core features. 

==================================
Usage with Solvers
==================================

Once the backend is enabled, you can use compatible QuTiP solvers as you normally would.

Supported Solvers
-----------------
The backend has been tested and is known to work with the following solvers:

* ``qutip.sesolve``
* ``qutip.mesolve``
* ``qutip.mcsolve``

.. note::
    Other solvers have not been tested and may not function correctly with this backend.

Integration Methods
-------------------
The backend introduces its own set of ordinary differential equation (ODE) integration methods, which are wrappers around existing QuTiP methods, optimized for the ``CuOperator`` data type.
You must use one of the following methods in your solver options:

* ``"CuTsit5"``
* ``"CuVern7"``  (New default)
* ``"CuVern9"``

Using any other integration method will result in undefined behaviour.

Example
-------
Here is a simple example of running ``mesolve`` on the GPU:

.. code-block:: python

    # This example assumes the backend has already been set as shown above.

    N = 100
    H = qutip.tensor([qutip.rand_herm(N), qutip.qeye(2)])
    c_ops = [qutip.tensor(qutip.destroy(N), qutip.sigmax())]
    rho0 = qutip.tensor(qutip.basis(N, 0), qutip.basis(2, 1)).proj()
    times = [0.0, 1.0, 2.0]
    e_ops = [qutip.tensor(qutip.num(N), qutip.qeye(2))]

    # No options are needed if you want to use the default "CuVern7" method.
    result = qutip.mesolve(H, rho0, times, c_ops=c_ops, e_ops=e_ops)

=======================================
Performance and Advanced Usage
=======================================

When to Use This Backend
------------------------
The cuQuantum backend excels at simulating **large compound systems**.
Its performance advantage comes from using tensor network algorithms on the GPU, which is most effective when the Hilbert space is composed of many smaller subsystems.
For small, single-component systems, the overhead of GPU data transfers may make it slower than the default CPU-based solvers.

Multi-GPU and Multi-Node Execution
----------------------------------
The ``WorkStream`` object can be configured with MPI for distributed execution across multiple GPUs on a single node or across multiple nodes in a cluster.
For details on creating a multi-GPU or multi-node ``WorkStream``, please refer to the NVIDIA cuQuantum documentation.

However, users should be aware of the performance implications:

* **Single-Node, Multi-GPU**: Scaling from one to multiple GPUs on a single machine introduces significant overhead.
  A performance benefit is typically only seen for **very large** system sizes and requires a properly configured MPI and CUDA environment.

* **Multi-Node**: Extending to multiple nodes adds another layer of communication overhead.
  This is only beneficial for **extremely large** systems that cannot fit on a single node.
  Performance is highly dependent on the cluster's internode GPU communication protocols (e.g., NVLink, InfiniBand), which can be complex to debug and optimize.

==================================
Limitations and Best Practices
==================================

Please be aware of the following limitations when using the cuQuantum backend.

Rectangular Operators
---------------------
Rectangular operators (where the number of rows and columns are not equal) are not supported by the ``CuOperator`` data format.
If your calculation requires rectangular operators as an intermediate step, you must explicitly create them as dense matrices and only convert the final, square operator to the ``CuOperator`` format.

Time-Dependent Systems
----------------------
Time-dependent systems can be defined in two ways in QuTiP: as a list of ``[Operator, Coefficient]`` pairs or as a function that returns a ``Qobj`` at a given time ``t``.

* **Coefficient pairs**: This format is supported and works well with the backend.

* **Function-based `f(t) -> Qobj`**: This format works but can be inefficient if not constructed carefully.
  For best performance, the function should return a time-dependent operator defined on the **smallest possible Hilbert space**.
  The full operator can then be constructed using ``qutip.tensor`` and ``qutip.QobjEvo``.

For example, to create a time-dependent operator acting only on the first subsystem:

.. code-block:: python

    # Recommended approach:
    # Define the time-dependent part on the smallest Hilbert space.
    def g(t):
        # This returns a 2x2 Qobj.
        return (qutip.sigmax() * t).expm()

    # Build the full operator using QobjEvo and tensor products.
    oper = qutip.tensor(qutip.QobjEvo(g), qutip.qeye(M))


.. code-block:: python

    # Avoid this approach:
    # This function is inefficient as it creates a (2*M x 2*M) dense matrix
    # and lose the tensor structure used for efficient computations.
    def f(t):
        return qutip.tensor((qutip.sigmax() * t).expm(), qutip.qeye(M))

    oper = qutip.QobjEvo(f)