I thoroughly enjoyed reading this ** stunning** work by Susanne Still. The generality of the theoretical framework developed herein, as well as its uniqueness due to accounting for ** partial observability**, helped set the spark for many ideas in my mind. I recently watched a talk (it was so good!) by Jinghui Liu, a graduate student in the lab I'm coadvised in (FakhriLab@MIT), and her talk about topological dynamics and bosonic phases on the surface of a cell membrane got me to thinking about treating topological defects (in wave propagation on the cell membrane) as effective particles in a nonequilibrium statistical mechanics setup. Re: this paper, can we think about the information processing capacity of a cell from the abstract topological sense?

A theoretical framework for the thermodynamics of memories interacting with partially observable systems demonstrates that **minimizing the lower bound of the dissipation** of such an information engine leads to an ** optimal data representation strategy** when available knowledge and system manipulability are limited.

A particle is trapped in a box with a partition in the middle. If an agent who is observing the system has knowledge of which half of the box the particle lies in, it can perform work extraction (of amount \(~kT\ln 2\)) by isothermal expansion. However, models such as this **assume** that **(1) all agent choices are optimal** and **(2) all relevant degrees of freedom are observable**. This heavily **limits** the discussion of information engines to cases where all captured information can be converted into useful work. Most of the time **in real systems, agents must act on partial knowledge** and predict quantities relevant to work extraction from the limited available data. Modeling this more realistic aspect of information engines would be useful for systems such as **biomolecular machines** and **engineered nanotechnology**.

The following **conventions** are used:

- The
**cost of information acquisition and decision making are included in the energy accounting**(otherwise information can be viewed as fuel supplied from the outside). - The engine is
**allowed to make use of temperature differences**. - Energy flows
*into*a system are**positive**.

The information engine contains the following **components**:

- A
**partially observable system**, with microstate denoted by a random variable \(~Z\) with realizations \(~z \in Z\) - An
**agent**implemented by another physical system which turnes measurements into a stable memory denoted by a random variable \(~M\) with realizations \(~m \in M\). The memory is used to decide on a work extraction protocol. - A
**work extraction device**that allows the agent to couple useful energy out of the system.

Each **cycle** runs as follows:

- From \(~t_0^i \rightarrow t_1^i\): The
**agent performs measurement and writes it into memory**. This protocol changes external control parameters on the memory as a**f(observable data)**. During this process the engine is connected to a heat bath at temperature \(~T\), and the average amount of work done on the memory is \(~\langle W_M \rangle\), while the average amount of heat dissipated is \(~\langle -Q_M \rangle\). - From \(~t_1^i \rightarrow t_2^i\), the
**temperature of the heat bath is adjusted**to temperature \(~T'\). - From \(~t_2^i \rightarrow t_3^i\),
**work extraction**occurs via a protocol that is a**f(agent's memory state)**. During this process, the average amount of work extracted from the system is \(~\langle -W_E \rangle\) and the average heat absorbed from the heat bath is \(~\langle Q_E \rangle\). After this process, the memory no longer has any exploitable correlations with the system. - From \(~t_3^i \rightarrow t_0^{i+1}\), the
**system is prepared for the next cycle**, by a**protocol which is invariant across cycles**. This process is assumed to not require any work. In the example of the Szilard engine, this was removing the partition and re-inserting it into the middle of the box.

The **free energy change** at a time \(~t\) depends on the energy and entropy averaged over the joint distribution over the system states and memory states: \(~F_t = \langle E_t (m,z) \rangle_{p_t (m,z)} - kTH_t\), where the **Shannon entropy** is defined as: \(~H_t = - \langle \ln(p_t (m,z)) \rangle_{p_t (m,z)}\)

We further establish that:

$$\Delta F_M \equiv F_{t_1} - F_{t_0} = \Delta E_M (= W_M + Q_M) -kT\Delta H_M$$ $$\Delta F_E \equiv F_{t_3} - F_{t_2} = \Delta E_E (= W_E + Q_E) -kT\Delta H_E$$

The system can be decomposed into **observable (write-able into memory)** and non-observable components: \(~Z = (X, \bar X)\). It can also be decomposed into **manipulable (relevant for work extraction)** and non-manipulable components: \(~Z = (Y, \bar Y)\). The **mutual information \(~ I(X,Y) \geq 0\)** in order to extract work.

The system cannot be manipulated in a way that changes anything but \(~y \in Y\), i.e. the state of the non-manipulable components \(~ \bar y \in \bar Y \) is unaffected by the state of the manipulable components and the state of the memory during the work extraction period.

$$p_{t_2}(\bar y \mid y,m) = p_{t_3}(\bar y \mid y,m) = p(\bar y \mid y,m)$$

The **stochastic mapping (i.e. the data representation) \(~ p(m \mid x)\) from observable data to memory state** is independent of unobservable data:

The **marginal distributions for the system are invariant** to all changes performed on the system. Thus \(~ \forall k \in \{1,2,3,4\}\)

The **preparation step introduces a hidden variable \(~v\)** into the system, which if discovered, the system appears in a nonequilibrium state that can be exploited during work extraction.

The **marginal probability distribution of the memory ** derives only from the statistical average over measurement outcomes.

The **agent's ability to predict the quantities relevant to work extraction from the memory ** derives from a statistical average over measurement outcomes:

utilizing that if a measurement outcome is given, the memory adds no new relevant information: \(~p(y \mid m, x) = p(y \mid x)\)

- At \(~t_0\), the system and memory are uncorrelated: \(~p_{t_0}(m,z) = p_{t_0}(m)p_{t_0}(z) = p(m)p(z) \)
- At \(~t_1\),
**after the memory is constructed, the memory possesses useful correlations with the system:**\(~p_{t_1}(m,z) = p_{t_1}(m \mid x, \bar x)p_{t_1}(z) = p(m \mid x)p(z) \) - Since
**the uncertainty about the system decreases, the entropy decreases**by the amount of mutual information captured in the memory about the observable data: $$\Delta H_M = H[M,X] - H[M] = -I[M,X] $$ - This process happens at a temperature \(~T\), so
**the free energy change associated with memory construction**is \(~\Delta F_M = W_M + Q_M + kTI[M,X]\) - By the Second Law, the work required to construct the memory must be greater than or equal to the resulting free energy change of the memory: \(~ W_M \geq \Delta F_M\). Thus,
**operating a memory requires, at minimum, a dissipation proportional to the amount of information retained**: \(~ -Q_M \geq kTI[M,X]\)

- At \(~t_2\), the beginning of work extraction,
**the correlations between the system and memory may provide useful information for predicting quantities relevant to work extraction**: \(~ p_{t_2}(m,z) = p_{t_2} (\bar y \mid y, m) p_{t_2}(y \mid m) p_{t_2}(m) = p(\bar y \mid y, m) p(y \mid m) p(m) \) - At \(~t_3\), the end of the work extraction protocol,
**all such correlations between manipulable system components and the memory are then gone**and thus: \(~ p_{t_3}(m,z) = p_{t_3} (\bar y \mid y, m) p_{t_3}(y) p_{t_3}(m) = p(\bar y \mid y, m) p(y) p(m) \) - This again
**increases uncertainy about the system, and thus the entropy of the joint system increases**by an amount equal to the mutual information between the memory and manipulable system components $$\Delta H_E = H[Y] - H[Y \mid M] = I[M,Y]$$ - This process occurs at temperature \(~T'\), so
**the free energy change associated with work extraction**is \(~\Delta F_E = W_E + Q_E - kT'I[M,Y] \) - By the second law, one can only extract an amount of work less than or equal to the decrease in free energy of the system during the extraction process: \(~W_E \geq \Delta F_E\) (remember that both quantities are negative). Thus, the
**amount of heat absorbed that can be absorbed by the system and turned into work is bounded proportionally to the amount of information retained in the agent's memory that's relevant to the system components pertinent to work extraction**: \(~Q_E \leq kT'I[M,Y] \). This holds if the relevant quantities are fully observable. - However, generally, the system components that are observable by the agent do not provide full information regarding the components relevant to work extraction. As such, all of the energertic cost of running the memory may not be recovered.
**Out of the information \(~I_{mem} = I[M,X]\) captured in the memory, only some bits \(~I_{rel} = I[M,Y] \) are useful for prediction. The rest, \(~I_{irrel} = I_{mem} - I_{rel} \geq 0 \), is irrelevant information.** - So, we can actually
**relax the upper bound on heat absorption: \(~Q_E \leq kT'I[M,X]\).** - Does this imply that more heat can be absorbed and converted into work when we don't have full information? It cannot be the case, because full work extraction can only occur in the ideal case when we have full information about relevant quantities in the system, or at least when \(~Y \subset X\). So how to think about this relaxed upper bound?
- At this point, it may be helpful to recall that the mutual information of two random variables is $$I[X,Y] = D_{KL}(P_{(X,Y)} || P_X \otimes P_Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \ln(\frac{p(x,y)}{p(x)p(y)}) $$ i.e. that the mutual information signifies the loss in correlations between two random variables. A fun related but unrelated theory is that the KL divergence between probability distributions of the forward and reverse trajectories of a process define the arrow of time [cite Crooks].

If the information engine is connected to a heat bath of constant temperature for an entire cycle, it dissipates an average amount of heat \(~ -Q = -Q_M - Q_E \geq kTI_{irrel} \). **The irrelevant information retained in the memory sets a lower limit on the dissipation.** This lower bound is zero only when no irrelevant information is retained.

Let's talk about an example now to make calculations more concrete. Consider the Szilard engines below. In the left box, knowledge of the \(~x\) coordinate of the particle provides no information about the \(~y\) coordinate (i.e. which half of the box the particle is located in). However, in the right box, **introducing geometric constraints within the box creates correlations between the degrees of freedom**. As such, an agent observing the \(~x\) coordinate would have a non-trivial probability distribution for the \(~y\) coordinate, and thus would be able to engage in predictive inference to perform useful work.

Now note that **the coarse-graining strategy chosen for data representation in the agent's memory will affect how much work can be extracted** from the setup. For example, representing the particle's \(~x\) coordinate with a 2-state (left) or 3-state (right) memory affect's the agent's capacity for predictive inference. The wall will be moved to the side believed to be empty with higher probability. This will be **incorrect (and result in compression instead of expansion) with probability \(~q(m)\). Accordingly, leaving a fractional volume \(~\rho (m) V\) unused is a good strategy.**

Thus, given \(~m\), we extract on average an amount of work: $$kT'(1-q(m))\ln (\frac{V - \rho (m)V}{\frac{V}{2}}) + kT'q(m)\ln (\frac{\rho (m)V}{\frac{V}{2}})$$ This is maximized when \(~\rho(m) = q(m)\). This leads to a total average extracted work, which saturates the bound derived earlier from the second law, because an isothermal transformation leaves the average energy of an ideal gas unchanged, so \(~-W_E = Q_E\): $$-W_E = kT' \sum_m p(m) [\ln 2 + (1-q(m))\ln (1-q(m)) + q(m)\ln (q(m))]$$ $$=kT'(H[Y] - H[Y \mid M])$$ $$=kT'I[M,Y]$$

**For a 3-state memory**, \(~m \in \{1, 2, 3\}\), and $$~q(m) = \left\{\begin{array}{ll} 0 & \quad m = 1, 3 \\ \frac{1}{2} & \quad m = 2 \end{array} \right.$$ Thus, \(~I[X,Y] = \frac{2}{3} \ln 2\) and at most \(~kT \frac{2}{3} \ln 2\) amount of work can be extracted at a cost of \(~kT \ln 3\) to run the memory. Therefore**the lower limit on dissipation is \(~ -Q \geq kT'I_{irrel} = kT(\ln 3 - \frac{2}{3} \ln 2) \approx 0.64kT\).****For a 2-state memory**, \(~m \in \{1, 2\}\), and $$~q(m) = \frac{1}{6}, m = 1,2$$ Thus, \(~I[X,Y] = \frac{5}{6} \ln 5 - \ln 3\) and at most \(~kT [\frac{5}{6} \ln 5 - \ln 3]\) amount of work can be extracted at a cost of \(~kT \ln 2\) to run the memory. Therefore**the lower limit on dissipation is \(~ -Q \geq kT'I_{irrel} = kT(\ln 6 - \frac{5}{6} \ln 5) \approx 0.45kT\).**

We therefore see that the less-detailed 2-state memory is more thermodynamically efficient because the cost of running a 3-state memory is relatively high.

However, **given access to two temperatures, the 3-state memory can become more efficient**. Let's say we form the memory at a temperature \(~T\) and extract work at a temperature \(~T' > T\). A 3-state memory becomes advantageous when:
$$\alpha \equiv \frac{T'}{T} > \alpha^* = -\frac{cost(2SM) - cost(3SM)}{gain(2SM)-gain(3SM)} = \frac{\ln 3 - \ln 2}{\ln 3 + \frac{2}{3} \ln 2 - \frac{5}{6} \ln 5} \approx 1.847$$
While an isothermal information engine can at best only recover the energy required to run the memory, **an information engine operated at two temperatures can produce non-negative work output**. The earlier lower bound on dissipation generalizes to \(~-Q \geq k(TI_{mem} - T'I_{rel})\). There are some key assumptions in the use of two temperatures:

- The
**heating and cooling of the information engine do not destroy the correlations**between the memory and the system. **No additional degrees of freedom are unlocked**at higher temperatures.- The heating and cooling steps together
**do not result in a net influx of heat.**

To find **the best strategy for data representation in memory**, the agent must:
$$\min_{p(m,x)} [I[M,X] - \alpha I[M,Y]]~~ s.t. ~\sum_m p(m\mid x) = 1$$
**\(~ \alpha \) ** determines the relative importance of detail (finer coarse-graining) in the data representation, and **can also be interpreted as a Lagrange multiplier that controls the tradeoff between conciseness and relevance**. The generalized Szilard information engine presented above can be run in a process akin to a Carnot process (wow I'm surprised that part of TD is useful XD - it is the most boring part for me).

- The
**data representation step can be an isothermal transformation**at temperature \(~T\), implementable with \(~ \Delta E_M = 0\), implying \(~W_M = - Q_M = kTI_{mem}\). - The
**work extraction protocol**can be implemented as follows in order to respect the assumptions stated earlier:- Isolate the box from the heat bath and perform and
**isentropic (adiabatic and reversible) compression of the entire box**, taking it from volume \(~V \rightarrow V'\), and raising the temperature to \(~T'\). For such a process, \(~ \frac{V}{V'} = (\frac{T'}{T})^{\frac{d}{2}} \), where \(~d\) is the number of degrees of freedom of the box. - Connect the engine to a heat bath at temperature \(~T'\) and
**extract work isothermally**by moving the partition to the most probable empty direction, leaving the optimized fractional volume \(~ \rho (m) V\). - Isolate the box and perform isentropic expansion from \(~V' \rightarrow V\), which lowers the temperature to \(~T\).
- Remove the partition and re-insert it in the middle.

- Isolate the box from the heat bath and perform and

Thus the net work produced from the above process is \(~ -W_E - W_M = k(T'I_{rel} - TI_{mem})\), which saturates our lower bound on dissipation. The engine has an efficiency:
$$\eta = 1 - \frac{T}{T'} \frac{I_{mem}}{I_{rel}} = \eta_c - \frac{T}{T'} \frac{I_{irrel}}{I_{rel}}$$
where \(~ \eta_c = 1 - \frac{T}{T'}\). **The efficiency of the engine is only non-negative when \(~ \frac{I_{rel}}{I_{mem}} \geq \frac{T}{T'}\).** So the strategies to increase efficiency are to increase \(~ I_{rel}\) or increase \(~ T'\), the former of which is easier since it simply requires picking the smartest data representation strategy given the available data.

This method can be applied to many different systems, one of which is the **Boltzmann machine**, a machine learning algorithm in which input patterns drive the neural network from a parameter-dependent equilibrium state \(~q_{\theta}\) to a nonequilibrium state \(~ p\). The change in free energy during this process \(~ \Delta F = kTD_{KL}[p || q_{\theta}]\) is dissipated during the relaxation process involved in predicting labels on new patterns. Finding the optimal parameters that minimize \(~ D_{KL}[p || q_{\theta}]\) minimizes the lower bound on average dissipation encountered during prediction.

This paper focuses on optimized an information engine for energy efficiency, but **a more realistic information engine would undergo multivariate optimization, accounting for other variables such as speed, accuracy, robustness**. There are tradeoffs to all of these factors. This work can also be applied to quantum systems, and I will have to read the references cited in the paper to get a sense of how :)