.. currentmodule:: chi

.. _Dataset_1 : https://github.com/DavAug/chi/blob/main/docs/source/getting_started/code/dataset_1.csv
.. _Pints: https://github.com/pints-team/pints
.. _ArViz: https://python.arviz.org/en/stable/
.. _Issue: https://github.com/DavAug/chi/issues

**********************************
Fitting mechanistic models to data
**********************************

In the previous tutorial, :doc:`mechanistic_model`, we have seen how we can
implement and simulate treatment response models in Chi. For example, we can simulate
the time course of drug concentration levels by 1. implementing a model in SBML;
2. setting the route of administration and the dosing regimen; 3. simulating the
drug concentration for a given set of model parameters.

.. literalinclude:: code/3_fitting_models_1.py
    :lines: 112-146

.. raw:: html
   :file: images/3_fitting_models_1.html

This functionality to simulate treatment responses is pretty cool in its own
right! For example, it can help us to study nonlinearities in the treatment response
dynamics and to optimise dosing regimens to target a desired treatment response.

However, at this point, the simulated treatment responses have little to do with
real treatment responses. To describe *real* treatment
responses, i.e. treatment responses that we may observe in clinical practice, we need to somehow connect
our model to reality.

The most common approach to relate models to real treatment responses is to
compare the model predictions to measurements. Below, we have prepared an example
dataset with drug concentration measurements. These drug concentrations were
measured after repeatedly adminstering 2 mg of a drug every 24 hours.
You can download the dataset from the following link: Dataset_1_.

.. csv-table:: Drug concentration measurements
   :file: code/dataset_1.csv
   :widths: 4, 12, 12, 12, 12, 12, 12, 12, 12
   :header-rows: 1

The dataset contains one column that identifies measured individuals (``ID``),
two columns that specify measurement times and dose administration times
(``Time`` and ``Time unit``), three columns that specify measured values
(``Observable``, ``Value``, ``Observable unit``), and three columns that specify
dose administrations (``Dose``, ``Duration``, ``Dose unit``).

Downloading the file and saving it in the same directory as the Python script,
we can visualise the measurements by executing the below script

.. literalinclude:: code/3_fitting_models_1.py
    :lines: 176-195

.. raw:: html
   :file: images/3_fitting_models_2.html

The figure shows the drug concentration measurements as blue scatter points.
The treatment response dynamics indicated by the measurements
is similar to the simulated treatment response in the previous code block.
But looking more closely at the magnitude of the values,
it appears that the measured values are much smaller
than the simulated ones. We can therefore conclude that, at this point, our
model does not provide an accurate description of the measured treatment
response.

To find a better description of the treatment response, we have two options:
1. we can try to find parameter values that improve the proximity of the
model output to the measurements; or 2. we can define a new mechanistic model and see
whether this new model is able to describe the measurements better. This tutorial
will be about the former and will detail how we can find better model parameters
for a given model structure.

Estimating model parameters from data: Background
*************************************************

Before we can try to find better parameter values that describe the observed
treatment response, we first need to agree on what we mean by
"*better*" for the relationship between the mechanistic model output and the measurements.
An intuitive notion of "better" is "closer", quantified by the distance
between the measurements and the model output,
i.e. the difference between the measured values and the
simulated values. Then the model parameters that best
describe the measurements would be those parameter values that make the mechanistic
model output perfectly match the measurements, resulting in distances of 0 ng/mL
between the model output and the measurements at all measured time points.
However, as outlined in Sections 1.3 and 1.4 of the
:doc:`quick_overview`, measurements of treatment responses are imprecise and
noisy, and will therefore not perfectly represent the treatment response dynamics.
Consequently, if we were to match the model outputs to measurements perfectly,
we would end up with an inaccurate description of the treatment response
that is corrupted by measurement noise.

One way to overcome this limitation is to change our notion of "better" and incorporate the measurement
process into our computational model of the treatment response. This makes explicit
that we do not expect the mechanistic model output to match
the measurements perfectly. In Chi, the measurement process can be captured
using :class:`chi.ErrorModel` s. Error models promote the single value output
of mechanistic model simulations to a distribution of
values. This distribution characterises a
range of values around the mechanistic model output where measurements may be
expected.
We can use this measurement distribution in two ways: 1. for simulation; and 2. for
parameter estimation. For simulation, the distribution
can be used to sample measurement values and
imitate the measurement process of real treatment responses, see
Section 1.3 in the :doc:`quick_overview` for an example. For parameter estimation,
the distribution can be used to quantify the likelihood with which the observed
measurements would have been generated by our model,
see Section 1.4 in the :doc:`quick_overview`.

Formally, we denote the measurement distribution by :math:`p(y | \psi, t, r)`,
where :math:`y` denotes the measurement value, :math:`\psi` denotes the model parameters,
:math:`t` denotes the time point of the measurement, and :math:`r` denotes the adminstered
dosing regimen. With this measurement distribution in place, we can quantify the
likelihood with which any given set of measurements would have been generated by the
model. For example, the likelihood of a measurement :math:`y_1` at time :math:`t_1`
under dosing regimen :math:`r^*` is defined by the value of the probability density
of the measurement distribution evaluated at the measurement,
:math:`p(y_1 | \psi, t_1, r^*)`. Note that this
likelihood depends on the choice of model parameters, :math:`\psi`. The model
parameters with the maximum likelihood are
the parameter values that "best" describe the measurements.

.. note::
    The measurement distribution, :math:`p(y | \psi, t, r)`, is defined
    by the mechanistic model output and the error model. To illustrate this, let
    use denote the simulated drug concentration values of the
    the 1-compartment PK model by :math:`c(\psi, t, r)`. By definition of
    the model, the drug concentration values, :math:`c`, are
    a function of the model parameters, :math:`\psi = (a_0, k_a, k_e, v)`, the time,
    :math:`t`, and the dosing regimen, :math:`r`.

    1. If we choose a :class:`chi.GaussianErrorModel` to describe the difference
    between the model output and the measurements, we assume that measurements
    are distributed according to a Normal distribution around the model output

    .. math::
        p(y | \psi, t, r) = \frac{1}{\sqrt{2\pi \sigma ^2}}\mathrm{e}^{-\big(y - c(\psi, t, r)\big)^2 / 2\sigma ^2},

    where :math:`\sigma` defines the width of the distribution. For ease of notation,
    we extend the definition of the model parameters to include :math:`\sigma`,
    :math:`\psi = (a_0, k_a, k_e, v, \sigma)`.

    We can see that the mechanistic model output
    defines the mean or Expectation Value of the measurement distribution.

    2. If we choose a :class:`chi.LogNormalErrorModel` to describe the difference
    between the model output and the measurements, we assume that measurements
    are distributed according to a lognormal distribution around the model output

    .. math::
        p(y | \psi, t, r) = \frac{1}{\sqrt{2\pi \sigma ^2}}\frac{1}{y}\mathrm{e}^{-\big(\log y - \log c(\psi, t, r) + \sigma / 2\big)^2 / 2\sigma ^2}.

    One can show that also for this distribution the mechanistic model output defines the mean
    or Expectation Value of the measurement distribution.

    The main difference between the two distributions is the shape. The
    :class:`chi.GaussianErrorModel` is symmetrically distributed around the
    model output, while :class:`chi.LogNormalErrorModel` is scewed in such a
    way that measurements can never assume negative values. To visualise these
    differences, we recommend simulating many measurements with different
    error models, similar to Section 1.3 in :doc:`quick_overview`. But instead
    of choosing different times, sample all measurements at the same time. You
    can then histogram the samples, using for example ``go.Histogram``, as used
    in Section 1.4.2 in :doc:`quick_overview`, to visualise the shape of
    the probability density.


Assuming independence of measurements, the likelihood for a dataset with :math:`n` measurements,
:math:`\mathcal{D}=((t_1, y_1), (t_2, y_2), \ldots, (t_n, y_n), r)`, is given
by the product of the individual likelihoods

.. math::
    p(\mathcal{D}| \psi ) = \prod _{j=1}^n p(y_n | \psi, t_n, r),

where *independence* refers to the assumption that measurements are
independently and identically distributed according to our model of the
measurement process (this does not have to be the case, and is especially unlikely
to be the case when our model fails to describe the measurement process accurately).

With this likelihood in place, Chi makes it straightforward to run numerical optimisation algorithms in order
to derive the maximum likelihood estimates that best describe the measurements,
see Section 1.4.1 in the :doc:`quick_overview`.

Alternatively, Chi also supports Bayesian inference of the model parameters:
Bayesian inference is conceptually different from maximum likelihood estimation
as it does not seek to find a single set of model parameters that "best"
describes the observed measurements. Instead, Bayesian inference acknowledges
the fact that noisy measurements leave uncertainty about which model parameters
best describe the data, and therefore focuses instead on deriving
a distribution of parameter values which
are all consistent with the observed measurements, see Section 1.4.2 in :doc:`quick_overview`
for a more detailed discussion. This
distribution is derived from the likelihood using Bayes' rule

.. math::
    p(\psi| \mathcal{D} ) = \frac{p(\mathcal{D}| \psi )\, p(\psi)}{p(\mathcal{D} )},

where :math:`p(\psi)` denotes the prior distribution of the model parameters.
The prior distribution quantifies the baseline distribution of parameter values
before the parameter estimation, making it possible to inform the inference results
with otherwise hard-to-quantify knowledge of feasible or likely parameter values prior to the
inference.

Defining the log-posterior
**************************

In Chi, we can derive posterior distributions from measurements
using Markov chain
Monte Carlo (MCMC) algorithms implemented in the open source Python package
Pints_.
In Sections 1.4.1 and 1.4.2 in the :doc:`quick_overview`,
we showed in some detail how we can define (log-)posterior distributions,
:class:`chi.LogPosterior`, for this purpose. Here, we want to show
how we can use the :class:`chi.ProblemModellingController` to simplify and automate
this process as much as possible.

The tricky bit when implementing log-posteriors for treatment response models
is often that log-posteriors do not only
depend on the treatment response measurements, :math:`((t_1, y_1), (t_2, y_2), \ldots, (t_n, y_n))`,
but also on the administered dosing regimen, :math:`r`.
This can make it tedious to define log-posteriors,
especially when parameters across measurements of multiple
individuals with different dosing regimens are inferred, as dosing regimens of the
model will have to be manually set to the dosing regimens administered during the measurement
process. A mismatch between the dosing regimens would render the inference results invalid.

To eliminate this source of error, we have implemented the :class:`chi.ProblemModellingController`.
The :class:`chi.ProblemModellingController` facilitates the construction of
log-posteriors and reduces the workflow to a simple 4-step approach:

- 1. Definition of the mechanistic model
- 2. Definition of the error model
- 3. Definition of the measurements
- 4. Definition of the prior distribution

In the below code block, we illustrate this workflow for the above drug
concentration dataset, Dataset_1_.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 651-688

The first four blocks in the code define the individual components of the
log-posterior: the mechanistic model, the error model, the data, and the prior
distribution. Note that the administration of the dosing regimen is set
before passing the mechanistic model to the :class:`ProblemModellingController`.

The prior distribution defines marginal distributions for the parameters, and
is implemented using Pints_. In Bayesian inference, we can use the prior
distribution to bias the inference
results towards feasible areas of the parameter space. In this
case, we have little prior knowledge about feasible parameter ranges, so we
choose relatively uninformative prior
distributions (below will be an illustration of the prior distribution).
Note that we seem to be specifying marginal prior distributions for only
4 of the 6 model parameters. This is because we fix the values of the initial drug
amounts to ``0`` prior to the inference
in the lines below, reflecting our knowledge that the subject had no prior
exposure to the drug before starting the trial. This reduces the number of model
parameters from 6 to 4. The fixing of model parmaters
is optional, but can sometimes improve the inference results when some model
parameters are already well understood.

For the remaining 4 parameters, only positive values make biological sense, so
we choose prior distributions that focus on positive values. For
two model parameters, the volume of distribution and the scale parameter,
negative or zero values are particularly bad as they will break the simulation
(1. a volume of zero causes a division by zero error; and 2. the lognormal distribution
is only defined for positive sigma). We therefore use ``pints.LogNormalLogPrior``
to constrain those parameters to strictly positive values.

In the final block of the code, we define the log-posterior. In the first line,
we specify the mechanistic model and the error model. In the next line, we set
the dataset. Note that we need to use the ``output_observable_dict`` to map the
output variable of the model, ``global.drug_concentration``, to the Observable name
in the dataset, ``Drug concentration``. Other specifications are not required, and
dosing regimens are automatically set, when the dosing regimen related columns,
``Dose`` and ``Duration``, are present in the dataset. In the following line, we
fix the initial drug amounts to ``0`` using :meth:`ProblemModellingController.fix_parameters`.
You can use this method to fix any parameters of the model.
In the last two lines, we set the log-prior and implement the log-posterior using
the :class:`ProblemModellingController.get_log_posterior` method.

Inferring the posterior distribution
************************************

With this :class:`chi.LogPosterior` in place, we can infer the posterior
distribution using any MCMC algorithm of our choice. Recall that inference, in this context,
means the reconstruction of the posterior distribution from the
:class:`chi.LogPosterior`, which defines the log-pdf of the posterior distribution
up to an unknown constant shift

.. math::
    \log p(\psi| \mathcal{D} ) = \log p(\mathcal{D}| \psi ) + \log p(\psi) + \text{constant}.

By comparison to above Bayes rule, you will find that :math:`\text{constant} = -\log p(\mathcal{D})`,
so the constant shift may seem not so 'unkown' after all. However, *unknown* here
does not mean that we cannot write down an expression for it, it
refers to the fact that for most treatment response models
:math:`p(\mathcal{D})` is (practically) impossible to evaluate, as
evaluating :math:`p(\mathcal{D})` requires the numerical integration of the
likelihood-prior product over the full parameter space,
:math:`p(\mathcal{D}) = \int \mathrm{d} \psi \, p(\mathcal{D}, \psi ) = \int \mathrm{d} \psi \, p(\mathcal{D}| \psi )\, p(\psi)`.
This renders the value of the constant shift for all intents and purposes unknown.

The unknown shift makes it very difficult to make statements about the absolute probability
of parameter values from the :class:`chi.LogPosterior` alone. However, the uknown shift does allow for relative comparisons of
probabilities as the shift is the same for all parameter values -- a fact exploited by MCMC algorithms to circumvent the limitation
of the partially known log-posterior. MCMC algorithms use
the relative comparison of parameter probabilities to generate random samples from the
posterior distribution, opening a gateway to reconstruct the distribution. The
more random samples are generated, the closer the histogram over the samples will
approximate the original posterior distribution. In fact, one can show that the histogram
will converge to the posterior distribution as the number of samples approaches
infinity. This makes it possible for MCMC algorithms
to reconstruct any posterior distribution from a :class:`chi.LogPosterior`.

To illustrate this, let us run an MCMC algorithm to infer the above
defined posterior distribution of the 1-compartment PK
model.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 692-698

In the code block, we use an MCMC algorithm implemented in Pints_, called
``pints.HaarioBardenetACMC``. For technical reasons that we will discuss below,
we run the algorithm three times for 20,000 iterations.
Note that we use the :class:`chi.SamplingController` to set the number of
runs, the number of iterations, and to run the sampling algorithm. The
:class:`chi.SamplingController`  can also do other things, such as running the
chains in parallel, but we will not go into this additional functionality in this
tutorial, and refer instead to the API reference.

Executing the above code block will spawn a response of the below form.
The left most column indicates the current
iteration of the MCMC algorithm. The other columns show the total number of
log-posterior evaluations and the fraction of accepted MCMC proposals of each
chain. The target acceptance ratio depends on the details of the MCMC algorithm.
For the ``pints.HaarioBardenetACMC`` the target is ``0.23``.

.. code-block:: bash

    Using Haario-Bardenet adaptive covariance MCMC
    Generating 3 chains.
    Running in sequential mode.
    Iter. Eval. Accept.   Accept.   Accept.   Time m:s
    0     3      0         0         0          0:00.0
    1     6      0         0.5       0.5        0:00.0
    2     9      0.333     0.667     0.333      0:00.0
    3     12     0.25      0.75      0.5        0:00.0
    20    63     0.714     0.571     0.476      0:00.0
    40    123    0.756     0.61      0.561      0:00.0
    60    183    0.738     0.475     0.59       0:00.0
    80    243    0.716     0.407     0.642      0:00.1
    100   303    0.693     0.406     0.653      0:00.1
    120   363    0.736     0.421     0.686      0:00.1
    .
    .
    .

When the three runs of the algorithm terminate, the generated samples are returned in
form of an ``xarray.Dataset``.
We can visualise the samples using the code
block documented at the end of this section (we move the code block to the end
to avoid disrupting the flow of the tutorial with less relevant code snippets).

.. raw:: html
   :file: images/3_fitting_models_3.html

The left column of the
figure shows the histogram over the samples across the three runs
in orange, as well as the probability density of the prior distribution in black.
The first row shows the
samples of the absorption rate, the second row shows the samples of the elimination
rate, the third row shows the samples of the volume of distribution, and the
fourth row shows the samples of the scale parameter of the error model, sigma.
The right column of the figure shows the same
samples, this time visualised over the iterations at which they were drawn.
The samples from the different runs are illustrated in different
colours: run 1 (green); run 2 (red); run 3 (blue).

The posterior distribution
^^^^^^^^^^^^^^^^^^^^^^^^^^

The orange distribution is the result of the inference -- the posterior distribution.
It contains all parameter values that are consistent with the drug concentration
measurements and our prior knowledge, assigning each set of parameter values
with a probability of being the data-generating parameter values. Notably, the
figure shows that our prior knowledge and Dataset_1_ are insufficient to
conclude on a single set of parameter values (see orange distribution).
Instead, the measurements only
allow us to refine our understanding of feasible parameter values. For example,
we can see in the second row of the figure that the marginal posterior distribution
substantially differs from the marginal prior distribution. This is because the
drug concentration measurements contain important information about the elimination rate, rendering
rates above 1.5 1/day or below 0.25 1/day as extremely unlikely for the
model of the treatment response. This is in stark contrast to the relatively wide
range of model parameters that we deemed feasible prior to the inference
(see black line). However, the measurements are not conclusive enough
to reduce the distribution of feasible elimination rates to a single value. Similarly,
for the volume of distribution (row 3) and the error scale parameter
(row 4), the measurements lead to substantial updates relative to the
prior distribution.
In comparison, the measurements appear less informative about the absorption rate
(see row 1), given that the marginal posterior distribution of
the absorption rate is almost identical to its prior distribution.
We will have a closer look at an intuitive understanding of why the measurements
contain little information about
the absorption rate below. The take-away from this discussion is that inferring
distributions of parameter values consistent with the measurements is in many
ways more informative than just estimating a single set of model parameters,
as it can help to quantify the feasibility of different parameter values and the
overall remaining level of uncertainty.
In a Bayesian setting, inferring posterior distributions also
allows us to supplement the information contained in the measurements by prior
knowledge of feasible model parameters, facilitating the inference of parameter
values when treatment response measurements are limited.

Assessing convergence
^^^^^^^^^^^^^^^^^^^^^

Similar to most numerical inference
algorithms, MCMC algorithms have practical
limitations, making them not equally well suited to all inference problems.
Some MCMC algorithms are, for example, only well suited to estimate model
parameters when the total number of parameters is small, while other MCMC algorithms only
work when the curvature of the posterior surface is not too extreme.
One way to test the suitability of an MCMC algorithm for a particular inference
problem is the use of the :math:`\hat{R}` metric.
We will motivate the metric below and provide an intuitive explanation.
Technical details are left to interested readers to explore on their own.

Let us begin this section by revisiting the right column in the figure above. The column
shows the samples from the three MCMC algorithm runs at each
iteration. For early iterations of the algorithm,
the samples from the MCMC runs look quite distinct -- each run appears to sample
from a different area of the parameter space. In contrast, at later iterations
the MCMC runs are harder to distinguish and sample from the same area of the parameter space.

Intuitively,
it does not really make sense for the samples from the MCMC runs to look different
-- after all, we use the same MCMC algorithm to sample from the same posterior distribution.
The histogram over the samples *should* therefore be identical within the limits of
sampling variation. However, if
we plot the samples from each of the three runs separately,
we find that the histograms actually look quite different. For illustrative purposes,
we focus on the absorption rate in the code block.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1025-1066

We select the samples of the absorption rate from the ``xarray.Dataset`` using
the name of the parameter in the mechanistic model,
``samples['dose.absorption_rate'].values``. The returned object is a numpy array
of shape ``(n_runs, n_iterations)``. We can select the samples from
run n using the index n-1, e.g. for run 1: ``samples['dose.absosption_rate'].values[0]``.
In addition, we subsample the MCMC samples in the above code block, which we can,
for now, regard as a cosmetic choice.

.. raw:: html
   :file: images/3_fitting_models_4.html

We can see that the histograms over the samples from the three runs look very
different. This seems contradictory to the orange distribution which we presented
above as the histogram over the MCMC samples across the runs. But also, how can it
makes sense for the three histograms to differ when the
posterior distribution is the same? -- It does not, and in fact, it tells us that the three histograms in the figure above do not represent the posterior
distribution. Although, we did not disclose it until now, the orange
distribution, respresenting the posterior distribution, is only based on the
second half of each MCMC run!

So why is it crucial that we only choose the second half of each MCMC run,
and is the particular choice of the *second* half important? The answer comes
back to a common limitation of all MCMC algorithm which we can see in the right
column of the figure presented earlier: MCMC algorithms generate samples
from the posterior distribution conditional on the latest generated sample.
For some MCMC algorithms, this conditioning has little influence on sequential samples
because the internal sampling strategy is advanced enough to
sufficiently decorrelate subsequent samples. But for
many MCMC algorithms the conditioned sample substantially influences the sampled value. That
means that if the last samples of two MCMC runs are far away from each other, the following
samples of the runs will also differ substantially from each other, see for example
the first 5000 iterations of the elimination rate in the second row of the figure
in the previous section.

At the start of an MCMC run, there is no "last sampled" parameter value,
and we have to manually provide a starting point to the MCMC algorithm.
The :class:`chi.SamplingController`
automatically samples these starting points from the prior distribution, to
reduce the number of manual steps during the inference in Chi.
This means that the above runs of the MCMC algorithm start from three
different positions in parameter space. These starting points have little
to do with the posterior distribution, and are potentially far away from the area of
the parameter space that is interesting for the posterior distribution.
It therefore makes sense that during the early iterations of the runs, the sampled
values of the MCMC runs do not agree. Fortunately, MCMC algorithms are constructed
in such a way that their samples are, at least in theory, guaranteed to converge
to the posterior distribution, meaning that runs of MCMC algorithms starting from
different areas in parameter space can be expected to converge to the same area.
In practice, numerical limitations may nevertheless
make it impossible, or very time-consuming for samples from MCMC algorithms not
suited to the inference problem to
converge to the posterior distribution. Starting multiple runs from different
areas in parameter space therefore provides an effective way of testing whether
a given MCMC algorithm is suited for the inference of the problem at hand. If
the runs do not converge, the MCMC algorithm either needs more iterations to converge, or
is not suited for the inference problem, in which case the MCMC samples cannot
be expected to represent the posterior distribution. If the runs do converge,
we can be confident that the inference results are correct (although there is always
a chance for edge cases where this convergence test fails to identify the
limitations of an algorithm).

So, starting multiple MCMC runs from different locations in parameter space is
a good idea to gain confidence in the inference results. We recommend running the
MCMC algorithm from 3-5 different starting points, randomly sampled from the prior distribution.
On the one hand, a large number of starting points will test the suitability
of the MCMC algorithm more thoroughly, but each run also comes at a computational
cost. 3-5 runs therefore provide a good tradeoff. In addition, starting points randomly sampled
from the prior usually make sure that the starting points are not too close together,
so that the suitability of the MCMC algorithm is indeed tested. At the same time, it also
ensures that the starting points are not extremely far away from each other, located in areas
of the parameter space that we do not even deem relevant for the inference result.
So, even if the MCMC algorithm would encounter problems in these extreme areas of parameter space,
it may not be very relevant for our inference problem. Randomly sampling from the
prior distribution therefore provides a good tradeoff between well dispersed
and not too extreme starting points.

In conclusion, the initial iterations of the MCMC runs can be used to establish the
validity of the inference results when the starting points of different runs are sufficiently
dispersed. At the same time, these initial samples have little to do
with the posterior distribution, as outlined above.
It is therefore common practice to exclude these early samples
from the inference result. To identify which samples should be discarded, and which
should be included in the result, the earlier presented
"trace plot", visualising the samples of the different runs at each iteration
can be useful. We can see, for example, that somewhere around the 8000s iteration the
three runs of the MCMC algorithm converge. We therefore choose the 10,000s iteration
as the cut-off for the "warm-up" (2000 extra iteration for good measure).
Below, we plot the histogram over the samples
from the three chains, this time using only the samples from the second half of
each run.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1073-1114

.. raw:: html
   :file: images/3_fitting_models_5.html

We can see that the histograms over the samples are now in much better agreement!

This visual assessment of the convergence already gets us very far, but for those
that would prefer more quantitative metrics to assess the convergence of MCMC
runs, we recommend the open source library ArViz_. ArViz_ implements a number
of widely established metrics to assess the convergence of MCMC algorithms,
including the :math:`\hat{R}` metric and the effective sample size. We can estimate
both values (and many more) using the ``az.summary`` function. In the below code
block, we first estimate the values for all MCMC samples, and then estimate
the values only for the samples post warm-up.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1121-1130

The return of ``az.summary`` is a ``pandas.DataFrame``, containing the estimated
values, as illustrated below

.. csv-table:: Summary: all samples
   :file: code/3_fitting_models_summary_1.csv
   :widths: 19, 9, 9, 9, 9, 9, 9, 9, 9, 9
   :header-rows: 1

.. csv-table:: Summary: samples post warm-up
   :file: code/3_fitting_models_summary_2.csv
   :widths: 19, 9, 9, 9, 9, 9, 9, 9, 9, 9
   :header-rows: 1

The summary output contains the estimates of the mean value, the standard deviation,
and many other values. For the assessment of the convergence, we would like to focus
on the :math:`\hat{R}` values in the right-most column of the dataframes, the ``r_hat`` columns.
We can see that the summary provides one estimate for each parameter. For the
samples post warm-up, the estimates are all ``1.00``, while the estimates for all
samples assume larger values. Loosely speaking, the :math:`\hat{R}` metric compares
the variance of samples across runs to the variance of samples within runs. If
samples from different runs are very different, the variance across runs is much larger
than the variance of samples within the same run, leading to an estimate :math:`>1`.
On the flip side, if the inter-run variance is the same as the intra-run variance,
the samples across runs look the same and the :math:`\hat{R}` metric returns a
value of :math:`1`. As a result, :math:`\hat{R}` estimates of :math:`1` indicate
convergence, while estimate :math:`>1` indicate that the runs have not yet fully converged.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 701-1018

Assessing convergence: Summary
******************************

1. Run the MCMC algorithm multiple times from different starting points. The :class:`chi.SamplingController` automatically samples starting points from the prior distribution. Recommended number of runs is 3-5.
2. Visually assess the convergence of the runs to the same area of parameter space using trace plots i.e. sample values on the y-axis and iteration at which each sample was obtained on the x-axis.
3. Discard the samples where the runs have not converged as "warm-up". If the runs do not converge, the MCMC algorithm has to run for more iterations until they converge, or the algorithm may not be suited for the inference problem.
4. Calculate the :math:`\hat{R}` value using the remaining samples. We recommend the ``az.rhat`` function implemented in ArViz_. Values :math:`<1.01` indicate good convergence. Larger values may indicate problems with the inference.
5. (Optional) For MCMC runs that require a large number of iterations for convergence, it can make sense to subsample the samples. Often, 1000-3000 samples are sufficient to represent a distribution. For ``n_iterations`` post warm-up, keeping every ``n_iterations // 1000`` th sample may therefore be sufficient.

For more details, please refer to the previous section.

Comparing model fits to data
****************************

Let us conclude this tutorial by comparing our model fit to the drug concentration
measurements. The simplest way to do this is to just focus on the means of the
posterior distributions, which we can extract from the summary dataframes presented
in Section 3.3.2.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1138-1177

.. raw:: html
   :file: images/3_fitting_models_6.html

In the above code block, we first fix the initial amount parameters of the mechanistic
model using the :class:`chi.ReducedMechanisticModel` class. This class is a wrapper
for mechanistic models that implements the :meth:`ReducedMechanisticModel.fix_parameters`
method. This method works analogously to the :meth:`ProblemModellingController.fix_parameters`
and fixes model parameters to constant values. In the next lines, we extract the
dosing regimen from the :class:`ProblemModellingController`. The controller formats
the dosing regimens from the dataset, Dataset_1_, into a useful format for the :class:`chi.PKPDModel`, returning the
dosing regimens in form of a dictionary, using the
IDs of the individuals as keys. This makes it possible to access the dosing regimen
of the individual with ``'ID 1'`` in Dataset_1_ using the key ``'1'``.
In the remaining lines leading up to the plotting, we extract the
estimates of the parameter means from the summary dataframe.

We can see that our fit describes the measurements reasonably well. However, it is
a little bit unsatisfying that the uncertainty of the parameter estimates is not
reflected in the model fit at all. To resolve this shortcoming of just using the
parameter means to plot the model fit, Chi implements the :class:`chi.PosteriorPredictiveModel`,
which defines the posterior predictive distribution of the measurement
process

.. math::

    p(y | \mathcal{D}, r, t) = \int \mathrm{d} \psi \, p(y | \psi, r, t)\, p(\psi | \mathcal{D}).

The posterior predictive distribution averages our model of the measurement
distribution over the parameter values that we found to be consistent with the data.
Before discussing this in more detail, let us take a look at how the posterior predictive
distribution compares to the model fit where we just used the means of the parameters.
To this end, we sample measurements repeatedly from :math:`p(y | \mathcal{D}, r, t)`
and estimate the mean and some of its percentiles at each time point.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1184-1206

.. raw:: html
   :file: images/3_fitting_models_7.html

In the first few lines of the code block, we implement the
:class:`chi.PosteriorPredictiveModel` by first retrieving the
:class:`chi.PredictiveModel` from the problem modelling controller, and then
combining it with the inferred posterior samples.
The :class:`chi.PredictiveModel` defines the measurement distribution,
:math:`p(y | \psi, r, t)`, for the purpose of simulating measurements. Note that
:math:`p(y | \psi, r, t)` is the same distribution as the one we used to construct the
likelihood in Section 3.1.
The important difference is that for the likelihood the values of :math:`(y, t, r)`
are fixed to the measurements, while for the :class:`chi.PredictiveModel`, the times
and the dosing regimen can be chosen at will to simulate any measurements of interest.

In the following 2 lines of the code block, we sample measurements from the posterior predictive distribution --
1000 samples at each simulated time point. The samples are returned as a ``pandas.DataFrame``,
which we reshape to a numpy array of shape ``(n_samples, n_times)`` in the next few lines.
This reshaping makes it very easy to calculate the mean value and the percentiles
of the samples at each time point, as demonstrated in the final lines of the code block.

The figure visualises the estimated mean and percentiles of the posterior predictive
distribution on top of our earlier fit. The code to reproduce the figure is documented below.
The area between the percentile estimates marks the region of drug concentration values
which can be expected to be occupied by 90% of all measurements,
assuming, of course, that our model correctly describes the measurement process.
The estimated percentiles are more jittery than the plot with the mean parameter values
because we use sampling to estimate the percentiles. As a result, the estimates
are subject to sampling error, which can be reduced by increasing the number of
samples, ``n_samples``. However, for the purpose of this analysis, the jittering
is not substantial enough to warrant a larger number of samples.

Plotting the posterior predictive distribution provides two important insights
that the fit with the means of the parameters cannot provide: 1. it can quantify
the uncertainty associated with the treatment response. This uncertainty is partially
due to measurement noise, but it also is due to the residual uncertainty in the
model parameters, as quantified by the posterior distribution. 2. It is
able to better reflect nonlinearities between model parameters and the model output
which are otherwise neglected when plotting the fit with the means of the model parameters.
We can see an indication of this in the figure by
noting that the fit with the parameter means is not the same as the mean of the
posterior predictive distribution. This difference is a result of the, in general,
nonlinear dependence of treatment response models on their parameters, meaning that the average
treatment response under parametric uncertainty is **not** the same as the treatment response
predicted with the mean parameters. In fact, these two predictions can differ substantially from
each other, especially when predictions are made for future times or for previously
unexplored dosing regimens.

.. literalinclude:: code/3_fitting_models_2.py
    :lines: 1208-1256

This concludes this tutorial. If you have any feedback or suggestions for
improvement, or would like to report any typos, mistakes or bugs, please do reach out to us, for example
by creating an Issue_. We are looking
forward to hearing from you!

Reference to ErrorModel, LogPDF and PredictiveModel API
*******************************************************

.. autosummary::

    ErrorModel
    GaussianErrorModel
    LogNormalErrorModel
    MultiplicativeGaussianErrorModel
    ConstantAndMultiplicativeGaussianErrorModel
    ReducedErrorModel
    LogLikelihood
    LogPosterior
    ProblemModellingController
    SamplingController
    PredictiveModel
    PosteriorPredictiveModel