About MELGYM
************

.. image:: ../_static/images/logo.png
    :align: center
    :scale: 30 %

**MELGYM** is a Python tool designed to facilitate interactive control over `MELCOR 1.8.6 <https://en.wikipedia.org/wiki/MELCOR>`_ simulations.

Every control feature in MELCOR is governed by *Control Functions* (CFs). However, MELCOR's batch execution mode makes it difficult to interactively control and modify these functions under user-defined conditions. Control conditions must be defined *a priori* and often require chaining several CFs in a cumbersome and unintuitive way.

MELGYM enables the use of external, user-defined controllers—such as **reinforcement learning** agents or any other custom **control** algorithm.

.. figure:: ../_static/images/mdp-simp.png
    :align: center
    :scale: 27 %

\

.. warning:: The MELGEN and MELCOR executables are **NOT INCLUDED** in this package, as they are not freely distributable. MELGYM requires both executables, which must be located at the following paths: ``melgym/exec/MELGEN`` and ``melgym/exec/MELCOR``.

.. note:: Access to MELGEN/MELCOR executables can be requested via the `Sandia National Laboratories website <https://www.sandia.gov/MELCOR/code-distribution/>`_.

How MELGYM works
================

MELGYM's control system is based primarily on MELCOR's **restart** capabilities. The restart mechanism in MELCOR involves:

1. Dumping the current simulation state into a *restart file* generated by MELGEN (i.e., creating a *simulation checkpoint*).
2. Performing a warm start from the last recorded state.

.. figure:: ../_static/images/melcor.png
    :align: center
    :scale: 6 %

    The MELGEN/MELCOR execution process

These two steps are performed periodically, according to the restart frequency specified in the MELCOR input configuration.

MELGYM introduces **intermediate steps** between restart dumps and warm starts, during which control functions or other user-defined components can be modified.

This allows simulations to be **discretised** into small time intervals, inserting intermediate steps where the model is continuously adapted.

.. figure:: ../_static/images/timeline.png
    :align: center
    :scale: 20 %

    MELGYM introduces additional restart steps to enable fine-grained control over MELCOR elements

.. note:: The elements that MELGYM can modify depend on the overwrite capabilities of the MELCOR input. See the `MELCOR Code Manual <https://www.sandia.gov/MELCOR/publications/>`_ for more information.

Thus, MELGYM modifies the input model every few simulation cycles. After each warm start, the model resumes from the last saved state and continues under the new configuration defined by the **external controller**.

.. figure:: ../_static/images/restart.png
    :align: center
    :scale: 20 %

    Control actions modify the MELCOR model just before each warm start

The following image summarizes the main functions and the low-level agent-environment interaction.

.. figure:: ../_static/images/functions.png
    :align: center
    :scale: 20 %

    Low-level diagram of the agent–environment interaction loop

Environments
============

Every MELGYM environment inherits from the ``MelcorEnv`` class. To function properly, users should consider the following:

- The CFs to be controlled must be specified when creating or registering the environment.
- EDF records to be used as observations must be defined in the MELCOR input file.
- Environment-specific parameters, such as the action space, must be set in the environment constructor.
- Users must define the reward function and the termination/truncation conditions, as they are specific to each environment.
- The EDF dump and restart frequencies must be consistent with the ``control_horizon``, and properly defined in the MELCOR input file.

.. tip:: Refer to :ref:`api` for a detailed definition of the `MelcorEnv` class and its methods.

Reinforcement learning integration
==================================

Formulation
-----------

Reinforcement Learning (RL) algorithms are particularly useful for learning control policies through interaction between an agent and a simulated environment.

In RL, an *agent* interacts with a dynamic process—called the *environment*—over a discrete sequence of time steps :math:`\mathcal{T} = \{0,1,2,...\}`. In MELGYM, the environment corresponds to a MELCOR simulation with which an agent interacts.

.. figure:: ../_static/images/mdp-melgym.png
    :align: center
    :scale: 9 %

    The Partially Observable Markov Decision Process implemented in MELGYM

At each time step :math:`t`, the agent receives an *observation* :math:`o_t \in \mathcal{O}`, representing a subset of variables that define the current *state* :math:`s_t \in \mathcal{S}`. Based on this observation, it selects an *action* :math:`a_t \in \mathcal{A}`, which modifies the MELCOR model. The simulation then proceeds, generating a new state :math:`s_{t+1}` and a *reward* :math:`r_t \in \mathbb{R}` that evaluates the outcome of the transition, thereby guiding the learning process.

The agent follows a *policy* function :math:`\pi`, such that :math:`a_t \sim \pi(\cdot|s_t)`. The goal is to find an optimal policy :math:`\pi^*` that maximizes expected cumulative reward:  
:math:`G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}`,  
where :math:`\gamma \in [0,1]` is a discount factor weighting future rewards.

When RL is combined with neural networks, it yields *deep reinforcement learning* (DRL). These methods employ parameterized policies :math:`\pi_\theta`—such as neural networks with weights :math:`\theta`—and use gradient-based optimization to approximate the optimal policy.

RL control in MELGYM
--------------------

MELGYM enables the integration of DRL-based controllers into MELCOR simulations. This is achieved via the intermediate *restarts* described earlier.

.. figure:: ../_static/images/mdp.png
    :align: center
    :scale: 20 %

    Agent-environment interaction loop

EDF outputs from MELCOR are parsed into observations used by the agent to determine the next control action and calculate the associated reward. Actions involve modifying control elements permitted by the MELGYM environment. The set of observed variables, available actions, and the reward and termination conditions are defined by each specific environment.

.. note:: MELGYM adheres to the standard `Gymnasium <https://gymnasium.farama.org/>`_ interface, and its environments implement typical methods such as *reset*, *step*, and *render*. The agent implementation depends on user preferences. The `Stable-Baselines3 <https://stable-baselines3.readthedocs.io/en/master/>`_ library is a well-tested option that is highly recommended.

.. tip:: After this introduction, head to section :ref:`examples` for a practical guide on using a DRL controller with MELGYM.