Thermodynamic Cost and Benefit of Memory

Summary and Illustrations by Farita Tasnim



I thoroughly enjoyed reading this    stunning work by Susanne Still. The generality of the theoretical framework developed herein, as well as its uniqueness due to accounting for    partial observability, helped set the spark for many ideas in my mind.    I recently watched a talk (it was so good!) by Jinghui Liu, a graduate student in the lab I'm coadvised in (FakhriLab@MIT), and her talk about topological dynamics and bosonic phases on the surface of a cell membrane got me to thinking about treating topological defects (in wave propagation on the cell membrane) as effective particles in a nonequilibrium statistical mechanics setup. Re: this paper, can we think about the information processing capacity of a cell from the abstract topological sense?


  In a Nutshell

A theoretical framework for the thermodynamics of memories interacting with partially observable systems demonstrates that   minimizing the lower bound of the dissipation of such an information engine leads to an    optimal data representation strategy when available knowledge and system manipulability are limited.


  Szilard Engine

A particle is trapped in a box with a partition in the middle. If an agent who is observing the system has knowledge of which half of the box the particle lies in, it can perform work extraction (of amount  \(~kT\ln 2\)) by isothermal expansion. However, models such as this assume that (1) all agent choices are optimal and (2) all relevant degrees of freedom are observable. This heavily limits the discussion of information engines to cases where all captured information can be converted into useful work. Most of the time in real systems, agents must act on partial knowledge and predict quantities relevant to work extraction from the limited available data. Modeling this more realistic aspect of information engines would be useful for systems such as biomolecular machines and engineered nanotechnology.

Szilard Engine


  General Information Engine Setup

The following conventions are used:

The information engine contains the following components:

Each cycle runs as follows:

  Free Energy Changes

The free energy change at a time \(~t\) depends on the energy and entropy averaged over the joint distribution over the system states and memory states: \(~F_t = \langle E_t (m,z) \rangle_{p_t (m,z)} - kTH_t\), where the Shannon entropy is defined as: \(~H_t = - \langle \ln(p_t (m,z)) \rangle_{p_t (m,z)}\)

We further establish that:

$$\Delta F_M \equiv F_{t_1} - F_{t_0} = \Delta E_M (= W_M + Q_M) -kT\Delta H_M$$ $$\Delta F_E \equiv F_{t_3} - F_{t_2} = \Delta E_E (= W_E + Q_E) -kT\Delta H_E$$


  System Decomposition

The system can be decomposed into observable (write-able into memory) and non-observable components: \(~Z = (X, \bar X)\). It can also be decomposed into manipulable (relevant for work extraction) and non-manipulable components: \(~Z = (Y, \bar Y)\). The mutual information \(~ I(X,Y) \geq 0\) in order to extract work.


  System Manipulation

The system cannot be manipulated in a way that changes anything but \(~y \in Y\), i.e. the state of the non-manipulable components \(~ \bar y \in \bar Y \) is unaffected by the state of the manipulable components and the state of the memory during the work extraction period.

$$p_{t_2}(\bar y \mid y,m) = p_{t_3}(\bar y \mid y,m) = p(\bar y \mid y,m)$$


  Data Representation

The stochastic mapping (i.e. the data representation) \(~ p(m \mid x)\) from observable data to memory state is independent of unobservable data:

$$ p_{t_1}(m \mid z) = p_{t_1}(m \mid x, \bar x) = p_{t_1}(m \mid x) \equiv p(m \mid x)$$


  Marginal Distributions

The marginal distributions for the system are invariant to all changes performed on the system. Thus \(~ \forall k \in \{1,2,3,4\}\)

$$ p_{t_k}(z) = p(z), ~~~p_{t_k}(y) = \sum_{\bar y} p_{t_k}(y, \bar y) = p(y), ~~~p_{t_k}(x) = \sum_{\bar x} p_{t_k}(x, \bar x) = p(x)$$

The preparation step introduces a hidden variable \(~v\) into the system, which if discovered, the system appears in a nonequilibrium state that can be exploited during work extraction.

$$p_{t_0}(y) = \sum_v p(y \mid v) p(v)$$

The marginal probability distribution of the memory derives only from the statistical average over measurement outcomes.

$$p_{t_k}(m) = \sum_x p(m \mid x) p(x)$$



The agent's ability to predict the quantities relevant to work extraction from the memory derives from a statistical average over measurement outcomes:

$$p_{t_2}(y \mid m) = \sum_x p(y \mid x) p(m \mid x) p(x)$$

utilizing that if a measurement outcome is given, the memory adds no new relevant information: \(~p(y \mid m, x) = p(y \mid x)\)


  Thermodynamic Cost of Memory


  Thermodynamic Gain from Memory


  Lower Bound on Dissipation

If the information engine is connected to a heat bath of constant temperature for an entire cycle, it dissipates an average amount of heat \(~ -Q = -Q_M - Q_E \geq kTI_{irrel} \). The irrelevant information retained in the memory sets a lower limit on the dissipation. This lower bound is zero only when no irrelevant information is retained.


  An Illustrative Example

Let's talk about an example now to make calculations more concrete. Consider the Szilard engines below. In the left box, knowledge of the \(~x\) coordinate of the particle provides no information about the \(~y\) coordinate (i.e. which half of the box the particle is located in). However, in the right box, introducing geometric constraints within the box creates correlations between the degrees of freedom. As such, an agent observing the \(~x\) coordinate would have a non-trivial probability distribution for the \(~y\) coordinate, and thus would be able to engage in predictive inference to perform useful work.


Szilard Engine

Now note that the coarse-graining strategy chosen for data representation in the agent's memory will affect how much work can be extracted from the setup. For example, representing the particle's \(~x\) coordinate with a 2-state (left) or 3-state (right) memory affect's the agent's capacity for predictive inference. The wall will be moved to the side believed to be empty with higher probability. This will be incorrect (and result in compression instead of expansion) with probability \(~q(m)\). Accordingly, leaving a fractional volume \(~\rho (m) V\) unused is a good strategy.


Szilard Engine

Thus, given \(~m\), we extract on average an amount of work: $$kT'(1-q(m))\ln (\frac{V - \rho (m)V}{\frac{V}{2}}) + kT'q(m)\ln (\frac{\rho (m)V}{\frac{V}{2}})$$ This is maximized when \(~\rho(m) = q(m)\). This leads to a total average extracted work, which saturates the bound derived earlier from the second law, because an isothermal transformation leaves the average energy of an ideal gas unchanged, so \(~-W_E = Q_E\): $$-W_E = kT' \sum_m p(m) [\ln 2 + (1-q(m))\ln (1-q(m)) + q(m)\ln (q(m))]$$ $$=kT'(H[Y] - H[Y \mid M])$$ $$=kT'I[M,Y]$$

We therefore see that the less-detailed 2-state memory is more thermodynamically efficient because the cost of running a 3-state memory is relatively high.


  Access to Two Temperatures

However, given access to two temperatures, the 3-state memory can become more efficient. Let's say we form the memory at a temperature \(~T\) and extract work at a temperature \(~T' > T\). A 3-state memory becomes advantageous when: $$\alpha \equiv \frac{T'}{T} > \alpha^* = -\frac{cost(2SM) - cost(3SM)}{gain(2SM)-gain(3SM)} = \frac{\ln 3 - \ln 2}{\ln 3 + \frac{2}{3} \ln 2 - \frac{5}{6} \ln 5} \approx 1.847$$ While an isothermal information engine can at best only recover the energy required to run the memory, an information engine operated at two temperatures can produce non-negative work output. The earlier lower bound on dissipation generalizes to \(~-Q \geq k(TI_{mem} - T'I_{rel})\). There are some key assumptions in the use of two temperatures:



To find the best strategy for data representation in memory, the agent must: $$\min_{p(m,x)} [I[M,X] - \alpha I[M,Y]]~~ s.t. ~\sum_m p(m\mid x) = 1$$ \(~ \alpha \) determines the relative importance of detail (finer coarse-graining) in the data representation, and can also be interpreted as a Lagrange multiplier that controls the tradeoff between conciseness and relevance. The generalized Szilard information engine presented above can be run in a process akin to a Carnot process (wow I'm surprised that part of TD is useful XD - it is the most boring part for me).

Thus the net work produced from the above process is \(~ -W_E - W_M = k(T'I_{rel} - TI_{mem})\), which saturates our lower bound on dissipation. The engine has an efficiency: $$\eta = 1 - \frac{T}{T'} \frac{I_{mem}}{I_{rel}} = \eta_c - \frac{T}{T'} \frac{I_{irrel}}{I_{rel}}$$ where \(~ \eta_c = 1 - \frac{T}{T'}\). The efficiency of the engine is only non-negative when \(~ \frac{I_{rel}}{I_{mem}} \geq \frac{T}{T'}\). So the strategies to increase efficiency are to increase \(~ I_{rel}\) or increase \(~ T'\), the former of which is easier since it simply requires picking the smartest data representation strategy given the available data.


  Closing Thoughts

This method can be applied to many different systems, one of which is the Boltzmann machine, a machine learning algorithm in which input patterns drive the neural network from a parameter-dependent equilibrium state \(~q_{\theta}\) to a nonequilibrium state \(~ p\). The change in free energy during this process \(~ \Delta F = kTD_{KL}[p || q_{\theta}]\) is dissipated during the relaxation process involved in predicting labels on new patterns. Finding the optimal parameters that minimize \(~ D_{KL}[p || q_{\theta}]\) minimizes the lower bound on average dissipation encountered during prediction.

This paper focuses on optimized an information engine for energy efficiency, but a more realistic information engine would undergo multivariate optimization, accounting for other variables such as speed, accuracy, robustness. There are tradeoffs to all of these factors. This work can also be applied to quantum systems, and I will have to read the references cited in the paper to get a sense of how :)


  Original Paper