Entropy appeared in physics, in the context of thermodynamics. However, after Boltzmann, it became a crucial concept for the statistical mechanics and for the whole physics. Since Shannon’s paper of 1948  , information entropy was the basis of information theory on which current IT technology relies. It is really impressive that physical and informational entropies are essentially the same thing, and probably this common essence is so far not completely and properly understood. Physicist John Archibald Wheeler claimed in his speculations  : “It from bit” (all things physical are information-theoretic in origin). Namely, many open problems raised by modern physics will probably be clarified in a more general framework where informational concepts will be the main part of new physical theories. Entropy is directly related to information, probability, complexity, and time. Boltzmann’s investigations were aimed at reducing ther- modynamical entropy to mechanical concepts, by integrating Newton’s me- chanics with statistical distributions. The focus of this reduction was the deter- mination of a microscopical representation of entropy through which explaining the arrow of time implied by thermodynamics. This explanation was the so called H-Theorem that Boltzmann introduced in 1872  , showing that physical time is oriented as the decrease of the H-function (the entropy, apart from addi- tive and multiplicative constants and the change of sign). A satisfactory proof of this property was never completely found by Boltzmann.
In this paper, we prove a discrete version of the H-Theorem by using basic physical and informational concepts. Namely, Boltzmann’s difficulties to prove the H-Theorem are due to the information-theoretic nature of this theorem, more than fifty years before the foundation of Information Theory.
The present paper is mainly of methodological nature. In this perspective any discipline based on information-theoretic concepts may benefit from the analy- sis developed in the paper as it concerns with the link between the physical and the informational perspectives of entropy.
2. From Physical to Informational Entropy
In 1824, Sadi Carnot wrote a book about the power of fire in producing energy  . It was known that heat can provide mechanical energy. Moreover, as it was experimentally later shown by Joule  , mechanical energy can be entirely transformed into heat (first law of thermodynamics). However, it is impossible entirely transforming heat into mechanical energy. In fact, this is a way to formulate the second law of thermodynamics, which is also equivalent to the impossibility of spontaneous processes where heat passes from a colder body to a hotter body. Carnot introduced some ideal heat engines, called reversible engines, reaching the optimal efficiency in heat-work transformation. In this way, he proved a theorem giving an evaluation of the heat quantity that necessarily cannot be transformed into mechanical energy by heat engines. Namely, when an engine takes a quantity of heat from a heat source (boiler) at constant temperature , then a quantity has to be released to a colder body (condenser) at temperature , and only the difference can be transformed into mechanical work. For reversible engines working between , the released heat quantity reaches the minimum (positive) value, and the following equation holds  :
therefore, if we denote by the heat quantity when is the unitary temperature and we obtain:
corresponds to a thermodynamical quantity, later called by Clausius  the entropy (of ), corresponding to the minimum heat quantity that necessarily a heat engine, working between temperatures and 1 has to release (to the unitary condenser) and that cannot be transformed into mechanical work.
The famous formulation of the second law of thermodynamics, by means of entropy , asserts that in a closed system (that does not exchange energy with the external environment) the entropy cannot decrease in time.
In this scenario, in 1870s years, Ludwig Boltzmann started a research aimed at explaining the second law of thermodynamics in terms of Newtonian mechanics    . The main question was: “Where does time arrow came from?”. In fact, in mechanics all the laws are symmetric with respect to time and the same equations tell us what happens in the future, but also what happened in the past. In no equation there is an explicit indication about the direction of events in time (the first Chapter of Wiener “Cybernetics”  is devoted to the different notions of time in Newtonian mechanics and in biology).
The first step of Boltzmann’s project was a mechanical formulation of entropy. This formulation can be found by starting from the fundamental law of ideal gases, where is the pressure, the volume, the absolute (Kelvin) temperature, the number of gas moles, and is the gas constant:
If we pass from the gas moles to the number of molecules in the gas (by the relation where is the Avogadro constant), we get an equi- valent formulation, where is now called the Boltzmann constant:
Now, let us assume that the gas takes some heat by expanding from a volume to a volume . Then, the quantity of this heat is given by:
and by expressing according to Equation (3), we get:
Let assume to start from a unitary volume . If in Equation (6) , and is moved to the left member, then we obtain:
that, according to Carnot’s equation (2), gives:
where expresses the number of possible ways of allocating the mole- cules in volume cells. The passage from constant to constant and from moles to molecules, accordingly, is crucial to the microscopic reading of the formula (very often this is not adequately stressed when Boltzmann’s argument is analyzed).
We can assume that the gas is spatially homogeneous, that is, the same number of molecules are in any volume cell, so that spatial positions of cells do not matter. Therefore, an allocation of molecules in volume cells, is completely determined by the (molecule) velocities allocated in each volume cell , apart from multiplicative constants (factorials expressing the indiscernibility of molecules and cells), which in logarithmic terms are additive constants that are omitted. In conclusion, velocities partition the molecules of gas in a number of different velocity classes: the intervals (with a fixed increment value, and the numbers of molecules having velocities in the centered intervals, respectively). Hence, all the number of allocations of n molecules in the volume V corresponds to the number of different ways W that n molecules can be distributed into m different velocity classes (the number of different micro-states associated to a given thermo- dynamic macro-state). Whence, the final famous equation is obtained:
Equation (10) is reported in Boltzmann’s tomb in Wien. In this form, the equation was later given by Max Planck, who followed Boltzmann’s approach in his famous conference on December 14, 1900, from which Quantum Theory emerged  . The importance of this microscopic formulation of Entropy is the statistical approach that stemmed from it and that became crucial in the whole physics of twentieth century.
The computation of was obtained from Boltzmann by means of the so called Maxwell-Boltzmann statistics, related to the “Wahrscheinlichkeit” prin- ciple, and can be described by the following deduction. Namely, Let us consider all the molecules within the same velocity class as undistinguishable, then the different distributions of molecules into the velocity classes are given by the following multinomial expression:
therefore, by Equation (10):
that is, by using Stirling approximation  , and we get:
if, for , we express by means of , that is, , then we get (remember that ):
A discrete version of function can be defined as:
hence the equations above show that, apart from additive and multiplicative constants, , therefore the second law of thermodynamics, asserting the non-decrease of entropy (for closed systems) is equivalent to the law of non- increasing function.
This was the first time that probability become an essential conceptual framework for physical representations. Surely, Maxwell’s anticipations on gas velocity distribution and on the microscopic interpretation of temperature in terms of average velocity  were fundamental to Boltzmann’s speculations, but in his approach the probabilistic setting is intended as an essential key for explaining what otherwise was impossible to explain. The forthcoming physics will be strongly based on probability and almost all the important achievements of quantum physics would have been impossible without an extensive and innovative use of probability  .
If we consider the following informational entropy introduced by Shannon  for a discrete probability distribution (its extension to continuous probability distributions is obtained by replacing sum with integral):
then, it results from equations above that coincides with the thermody- namical entropy apart from a multiplicative constant.
The next Boltzmann’s result was the “H-Theorem”, which he tried to prove since 1872. However, all his attempts of finding a satisfactory proof of this theorem were not completely successful. The reason of these failures is the information-theoretic character of H-Theorem. Therefore, we need to enter in this theory that started in 1948 with the famous Shannon’s paper  “A mathe- matical theory of communication”.
3. Shannon’s Entropy
Shannon’s idea was entirely probabilistic. In fact, we can measure the infor- mation of an event by means of a function of its probability , because the more rare an event is, the more we gain information when it occurs. Moreover, this function has to be additive for events that are independent, that is:
if are the respective probabilities of , this means that the function has to verify:
and the simplest functions satisfying this requirement are the logarithmic functions. Therefore, if we use the logarithmic in base 2, denoted by , as it is customary in Information Theory, then we can define the information of an event with probability as:
With this probabilistic notion of information, given a discrete probability distribution , also called by Shannon an information source (a set where a probabilities are associated to its elements), information entropy ex- presses the probabilistic mean of the information quantities associated to the events , with respect to the probability distribution .
An anecdote reports that when Shannon asked John von Neumann to suggest him a name for the quantity , then von Neumann promptly answered: “Entropy. This is just entropy”, by adding that with this name the success of information theory was quite sure, because only few men knew exactly what entropy was.
Another remarkable fact is the paradoxical nature of entropy, due to its intrinsic probabilistic nature. This paradox is apparent at the beginning of Shannon’s paper  . In fact, Shannon says that he is searching for a measure of the Entropy (information) or uncertainty of events. But how can we consider these notions as equivalent? It would be something like searching for a measure of knowledge or ignorance. Can we reasonable identify a measures for opposite concepts? The answer to these questions can be found in the intrinsic orien- tation of events in time. When an event can happen with a probability , we can measure its information by a function of its a priori uncertainty , but after it happens, we gain, a posteriori, the same amount of information asso- ciated to , because a priory uncertainty was transformed into a gain of cer- tainty. The same kind of opposite orientation is always present in all informa- tional concepts and it is often a source of confusion when the perspective of considering them is not clearly defined  .
4. Boltzmann’s H-Theorem
A concept related to entropy is the entropic divergence between two probability distributions :
The following proposition (see  for a proof) is a basic property of entropic divergence.
Proposition 1. for any two probability distributions.
In an ideal gas the collisions between molecules are elastic, that is, the sum of the kinetic energies of two colliding molecules does not change after collision.
Proposition 2. In an isolated ideal gas the variance of velocity distributions is constant in time.
Proof. If the gas is isolated then its temperature is constant, and as Maxwell proved  its temperature is uniquely related to its average velocity. Therefore, the average kinetic energy is constant in time, and being the gas ideal, its whole kinetic energy is constant, that is, also the second order momentum of velocity distribution is constant. Therefore a fortiori the variance is constant too.
Let denote the normal probability distribution ,
and be the Shannon entropy of a probability distribution (here is the continuous version of Kullback-Leibler divergence). The fol- lowing propositions are proven in  , but we report their proofs for reader’s convenience.
Proof. (  ).
Proposition 4. Normal distributions are those for which entropy reaches the maximum value in the class of probability distributions having the same variance.
Proof. (  ). Let denote any probability distribution of variance .
( by Proposition 1)
Proposition 5. In an isolated ideal gas, for each cartesian component, the velocity distribution, when normalized as a probability distribution, tends, in time, to reach the normal distribution (of a given variance).
Proof. The proposition is a consequence of the central limit theorem  . In fact, if molecules collide randomly, the velocity of each molecule is a random variable. Let us consider the velocity intervals (of radius ). Then, a boolean random vector can be associated to each gas molecule and to any time point taking values in a set of discrete time steps, in such a way that, if at time molecule has velocity in the interval , then and , for every . In this way, the velocity distribution is the the sum . Therefore, a large number of random variables are cumulated in the cartesian components of velocity distribution.
From the previous propositions Boltzmann’s H-theorem follows.
Proposition 6. (H-Theorem) In an isolated ideal gas the function cannot increase in time.
Proof. By Proposition 5, velocities tend to distribute according to a normal law, and they keep the variance constant, according to Proposition 2. But, according to Proposition 4, this distribution is that one having the maximum entropy in the class of all probability distributions with a given variance.
4.1. Pythagorean Recombination Game
A simple experimental validation of the above proposition can be obtained with a simple game, which we call Pythagorean Recombination Game (PRG), based on the following rules  . Chose a “large” multiset of random numbers (numbers can occur in with repetitions) and execute, at each step, the following operations:
1) Randomly choose two numbers in ;
2) Randomly split into and , in such a way that ;
3) Randomly split into and , in such a way that ;
4) Replace in the pair with the pair , .
The idea of PRG is that of representing a two-dimensional gas where any number is a velocity and the rules 2) and 3) define a collision axis and the exchange the velocity component with respect to this axis. If we play this game for a sufficiently number of steps and we compute the H-function at each step, we can easily check that H approaches to a minimum value, and at same time, the distribution of numbers (velocities within some intervals) clearly appro- ximates to a distribution of freedom degree 2. This distribution corresponds to that of a random variable where both follow a normal distribution.
Proposition 7. In Pythagorean Recombination Game H-function does not increase and the distribution of velocity components tend to a normal distri- butions as the game goes on.
Proof. All the hypotheses of Proposition 6 are verified in PRG. In fact, collisions are elastic, because after each “collision” the kinetic energy remain unchanged. Moreover, the system is isolated because no energy is exchanged with the external world, and the numbers of colliding velocities and steps can be assumed to have values for which “large number laws” are acting in the process of velocity recombinations during the game.
The proposition above was experimentally verified by a suitable Matlab program implementing PRG (for suitable numerical parameters). Figure 1 shows the shape of velocity distributions after only 1000 steps. A detailed description of this experiment, with the data and the obtained curves can be found in  .
4.2. Entropy Generalizations and Applications
The schema of H-function definition from Boltzmann’s formula is well known studied and presented in a lot of papers of physical and infor- mational entropy. But there is an hidden aspect that has never properly inve- stigated. The H function that is proportional to Shannon entropy coincides, up to Stirling approximation, with:
Figure 1. Left: Velocities after 1000 steps of PRG (50 collisions per step, 4000 particles, range of initial random velocities 500 - 600, level of approximation 2 ( ). The shape of distribution of two freedom degrees appears. Right: H-function along the steps.
Here we briefly suggest a possible generalization of entropy that could be useful in the analysis of entropy for finite structures, that is something very relevant in the context of Machine Learning approach and computational geno- mics investigations   . First, let us give a proposition about digital repre- sentability.
Proposition 8 (Digital Information Bounds) The total number of digits necessary to encode objects as strings, over an alphabet of digits, verifies the following condition:
Proof. We present a simple geometric proof of this proposition. Assume to place the object we want to encode over digits along a top line at level 0 (encoding with a segment of unitary lengths). Under it, we place shorter lines such that the line of level has the length of the line above it minus . We can depict this by means of the following line arrangement (here lengths are not proportional to the values):
------------------------------------------------ Level 0 ( objects)
----------------------------------------- Level 1 ( objects)
-------------------------------- Level 2 ( objects)
Now, the minimum number of digits that we need for encoding objects is given by the sum of the lengths of all the lines of the arrangement. In fact, first line counts the initial number of digits that we need to assign at least one digit to each of the objects. The second line assigns digits to the objects that need at least two digits to be encoded. Then, iteratively, at the step we add the last number of digits for the objects encoded with strings of maximum length. More formally we have that:
where, for , , .
Therefore, the exact number of digits is given by Equation (21), while the condition stated by the proposition follows from the fact that the number of digits used from level 0 to level is surely less than , while the number of digits used from level 0 to level is surely greater than minus the missing parts (the gap with respect to n) of levels from 1 to , and this missing part provides a sum less than (as it easily
results by evaluating ). Therefore, the
number of digits used from level 0 to level is greater than , so the condition is proved.
The proof of H-Theorem given in the present paper substantially differs from classical proofs of the same type. In particular, Jaynes  developed a reinter- pretation of Statistical Mechanics in terms of information-theoretic concepts. In this analysis key concepts are the principle of maximum-entropy in the probability estimation (MEP) and Shannon’s Asymptotic Equipartition Principle (AEP). In  a proof of H-theorem is given where the informational approach to physical analysis provides (in a restricted case) important consequences about the relationship between Boltzmann and Gibbs approaches to the microscopic representation of thermodynamical states. However, a number of important differences can be individuated between papers   and the analysis of the present paper. First, in both Jaynes’ papers the theoretical setting is the classical continuous framework of physics. Moreover, the proof of H-theorem in  is not purely informational, but rather a physical proof where physical concepts are informationally reinterpreted, and the conclusions do not directly follow from properties of informational entropy. Other main differences are in the focus and in the perspectives, also due to the dates of papers   . In fact, no efficient Montecarlo methods were available at that times, which could give simple numerical evidences of H-function behavior as coherent with Boltzmann intuition (exactly in the opposite direction of  , where Gibbs’ formulation is preferred). Contrarily, the Pythagorean Recombination Game given in Section 4.1 easily shows that the arrow toward the H-decrease is completely informa- tional and corresponds exactly to Boltzmann’s H-function trend. The discrete representation of H-function provides in a very simple way the essence of the theoretical proof and of its numerical validation. Although Boltzmann’s intui- tion was surely based on a discretization of thermodynamical states, starting from Equation (9), the physical tradition prevailed in Boltzmann’s search for physical arguments supporting the H-Theorem. Therefore, Boltzmann’s revo- lution was in the crucial role of statistical distributions. Discreteness appeared as a theoretical necessity at the birth of Mechanical Statistics, but it becomes a key ingredient of physical modeling when Max Plank found the solution to the black body irradiation formula (starting from Boltzmann’s equation)  , where the discrete nature of quanta appears linked to the energy pro- bability distribution. In conclusion, both probability and discreteness have crucial roles in the comprehension of conceptual roots of statistical mechanics and were the conceptual seed of the subsequent development of the whole modern physics. This is the reason for which discrete proofs of H-Theorem provide a more general comprehension of the relationship between physical and informational entropy.
In many applications, where entropy is the empirical entropy computed from observed frequencies, the probability estimation is not the main aspect of the problem   . For example, when we deal with genomic information as obtained by -mer sequences of genomes   other crucial aspects emerge such as: i) efficient methods for computing entropy, which is com- putationally prohibitive when increases, and ii) the use of random genomes for determining anti-random aspects that concern with the biological functions emerging during evolution. In conclusion, the discrete and computational per- spectives are ingredients that are missing in classical discussions about infor- mation and physical entropies. But, exactly these perspectives are urgently re- quested in many contexts of application, where entropic concepts promise to disclose new possibilities of interpretation, especially in biological and medical fields. The next section wants just to give a broad idea of a possible discrete approach to entropic generalizations that, when combined with suitable algo- rithms, could provide useful tools for data analysis and interpretation  .
In  , a biological complexity measure, called biobit, is introduced for any genome , where two entropic components are combined: 1) the entropic component and 2) the anti-entropic component . The com- ponent is the empirical entropy of genomes computed with respect to the word length ( is the length of ), whereas the anti-entropic component is the gap between and , where results to be an upper bound to any empirical entropy of a genome long (corre- sponding to a suitable Kullback-Leibler divergence). It results that along evolu- tion the biobit measures of genomes increase. In a sense, this trend seems to contradict the law of increase for physical entropy. In fact, genomes are gigantic molecules that have to obey to the laws of physics, therefore, their biological complexity shows an evolutionary trend toward an opposite direction. The way of escaping from this paradox is given by the intrinsic symbolic nature of informational entropy of genomes, which transcends the physical limits of molecules. In fact, even if individual genomes are physical objects, the genomes of species are abstract symbolic sequences keeping their identity beyond their physical instances into individual genomes.
From the above Proposition it follows that the average length of strings encoding objects is between and , therefore can be assumed as (a good approximation to) the average amount of digital information that is intrinsically related to a set of objects. Now, if we go back to Equation (19), in the light of proposition above, we discover that Shannon Entropy is proportional to the minimal digital information of the set of permutations of objects into classes that leave unchanged the cardi- nalities of these (disjoint) classes. This phenomenon can be easily generalized in many ways, by providing new simple methods for computing other kinds of entropies over finite structures.
Let us shorty outline a general notion of Structural Entropy directly related to Equation (19). Let us consider a finite structure (of some type) defined over a set , called the support of . Let us consider the class of 1-to-1 functions from to . The subet of is the set of iso-confi- gurations of sending in itself (or in an equivalent structure with respect to a suitable notion of structural equivalence). Therefore, we define the Structural Entropy of (of base ) as the average digital information of the isoconfigurations (see Proposition 8, is set cardinality):
The above definition tells us that Boltzmann’s H-function corresponds to the Structural Entropy of a partition of undistinguishable objects into distinct sets, when is the class all -permutations. Topological entropy  is defined, following a similar idea, as the logarithm of a suitable set cardinality, but according to a limit process making hard its computational evaluation. New notions of structural entropies over finite structures, with efficient algorithms for computing them, are surely of great interest in all the cases where classical information entropy requires to be modified, for coping with the specific needs of many applicative contexts  .
The arrow of time is a consequence of central limit theorem plus the infor- mational law of maximum entropy of normal distributions (deduced from Kulback-Leibler entropic divergence positivity). For this reason, time irreversi- bility emerges when physical systems are complex enough and “large numbers” phenomena can act on them. This complexity, which is typical in living systems, is the origin of the intrinsic direction of biological time. Here a discrete version of H-theorem is proved, which shows the informational mechanisms on which it is based.
The proof of H-Theorem by means of concepts from probability and information theory has an interest going beyond the technical aspects of the proof. Namely, the focus is not on the result in itself (today many demonstrative approaches are known  ), but it is in the discrete and informational approach here outlined, which is mainly aimed at motivating and suggesting investigations toward the interplay between information and energy in their specific me- chanisms of interaction. This ambit remains mostly dense of enigmas, often related to concepts of quantum physics or of complex systems. In contexts where emergence phenomena occur, classical interpretations of the second laws seem to contradict the mechanisms of complexity, evolvability and autonomy related to new paradigms of informational inference and of distributed autonomous agent computations. The genomic laws given in  are an indication toward more comprehensive ways of considering entropic concepts, where the infor- mational perspective seems to provide new keys for the analysis of complex phenomena.
The ahutor wants to express his gratitude to Andreas Holzinger who encouraged him to write the present paper.
 Boltzmann, L. (1877) über die Beziehung zwischen dem Zweiten Hauptsatze der mechanischen Wärmetheorie und der Wahrscheinlichkeitsrechnung resp. den Sätzen über das Wärmegleichgewicht, Sitzungsber. Kais. Akad. Wiss. Wien Math. Naturwiss. Classe, 76, 373-435.
 Sharp, K. and Matschinsky, F. (2015) Translation of Ludwig Boltzmann’s Paper “On the Relationship between the Second Fundamental Theorem of the Mechanical Theory of Heat and Probability Calculations Regarding the Conditions for Thermal Equilibrium’’. Entropy, 17, 1971-2009. https://doi.org/10.3390/e17041971
 Holzinger, A., Hörtenhuber, M., Mayer, C., Bachler, M., Wassertheurer, S., Pinho, A.J. and Koslicki, D. (2014) On Entropy-Based Data Mining. In: Holzinger, A. and Jurisica, I., Eds., Knowledge Discovery and Data Mining, Springer, Berlin, 209-226.
 Manca, V. (2016) Infogenomics: Genomes as Information Sources. In: Arabnia, H. and Tran, Q., Eds., Emerging Trends in Applications and Infrastructures for Computational Biology, Elsevier, Morgan, 317-324.
 Abhishek, S., Asok, R. and Shalabh, G. (2009) An Information-Theoretic Measure for Anomaly Detection in Complex Dynamical System. Mechanical Systems and Signal Processing, 23, 358-371. https://doi.org/10.1016/j.ymssp.2008.04.007