Half in Jest - Weblog

quantification of bitstring randomness

an uncomplicated method by which to quantify bitstring randomness based on comprised repetition

20180112

The maximum number of transitions from 0 to 1 or from 1 to 0 within a bitstring of a given length, n, equals n minus 1. There is also an ideal number of transitions, which is the maximum divided by 2. his would be expressed as follows:

T_max = n - 1, T_ideal=T_max / 2

A bitstring with exactly T_max transitions would be an alternating 0-1-0-[..] or 1-0-1-[..] bitstring, with an absolutely predictable order. A bitstring with zero transitions, i.e., 0-0-0-[..] or 1-1-1-[..], will also have absolute predictability.

Bitstrings with exactly T_ideal transitions will be the most unpredictable, because each pair of bits has exactly the same probability of containing a transition or not.

In order to create an expression that will return the randomness of a bitstring, by these terms, the ratio of the actual number of transitions to the ideal will be taken into account.

This ratio will have a certain distance from 1, showing how far from ideal it is. This is squared in order to accentuate predictability at larger distances from 1 and for symmetry about 1. This value is subtracted from 2, and the base-2 logarithmic value of this is taken in order to generate an output value that will be closer to 0 when the input approaches 1 and an output of 1 when the input approaches 2, where the higher value shows a higher randomness. The full expression is as follows:

R = log_2(2 - (1 - T / T_ideal) ^ 2)

In order to grasp what shape this expression produces for proportion of transitions in a sufficiently large bitstring, the following function is constructed:

$f(x) = log_2(2 - (1 - x / T_ideal) ^ 2), {x in Z : x in [0, T_max]}$

When this function is plotted over the range of possible transitions for a given bitstring, the following curve arises, with values of 0 given for bitstrings with the least ideal proportions of transitions and a value of 1 given for the ideal proportion of transitions, 0.5:

plot

However, in merely using this expression, it can be found that clearly predictable bitstrings can be easily constructed which would be shown to have very low predictability, based on the score given, for example:

1100110011001100

This bitstring would be given the score of 0.99, a very high score for a bitstring with perfect predictability. The fact that almost a perfect number of transitions exist in this bitstring causes it to be judged as highly unpredictable, although these transitions are placed very predictably from each other.

In order to create a program that would judge this, justly, as the predictable bitstring that it is, another approach must be taken. I have proposed a way to evaluate bitstrings such that all ways of organizing it consistently will be examined, and the lowest score achieved will be assigned to this bitstring as a more correct score.

The bitstrings are broken up into segments of variable length, and each bit is taken and placed into a new bitstring in the order that they appear in the given segments. All segment lengths ranging from 1 to the length of the bitstring are tested. However, it is unnecessary to test beyond segment lengths of the length of the bitstring divided by 2, because these rearrangements are the bitstring itself. For example, the bitstring above would be tested in segments of all lengths ranging from 1 to 8.

This bitstring will be separated into segments of a given length (e.g., 2). The first bit of each segment will be taken, and then the second bit of each segment will be taken. The resulting concatenated bitstring is evaluated. Following the 2-length example, the process is outlined in the following image:

bitstring rearrangement diagram

In this example, the bitstring is evaluated to the score of 0.00, because there are a maximum number of transitions possible in this consistent arrangement.

Similarly, using this same type of evaluation on a longer bitstring, the binary representation of the ASCII string This is an encoded ASCII string.:

0101010001101000011010010111001100100000011010010111001100100000011000010110111000100000011001010110111001100011011011110110010001100101011001000010000001000001010100110100001101001001010010010010000001110011011101000111001001101001011011100110011100101110

If evaluated without rearrangement, this bitstring would be given a score of 0.99. However, the consistency in ASCII character binary representations allows for a less random bitstring to be created when bits are taken in order from segments of length 8:

0000000000000000000000000000000011110110110111111101111101111110011111111111111111100000111111111001001000000000000010000111000001100100010010100000001100001101100000000101101111000000001001110001001001001110000011000101011100110110100101101001111101001010

Because of the lower randomness of this bitstring, a lower score of 0.92 is given. The lowest score generated from the computation of randomness scores for rearrangements of any possible length—also known as the rearrangement parameter—is assigned to the bitstring.

If we define a bitstring as an n-tuple, for example:

bitstring = (0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0)

A set of all indices within this n-tuple—also the set rearrangement parameters—is defined:

S = {i in Z : 1 <= i <= n}

Then the bistring can be rearranged using the process described above using a mathematical formula. The actual rearrangement of the bistring, given in the form of a piecewise function is as follows:

I found a hack that allows the index formula in the first piece to be somewhat simplified (note the change in conditionals):

In order to calculate the number of transitions in each rearrangement, the following function is used:

Here, I have used a sumation of the negated equality comparison of each pair within the bistring to give the number of transitions within the bitstring. A randomness score is then taken for each rearrangement, and the minimum value generated from rearrangement for any i in S, is calculated:

The future applications of this development are currently unclear, but I contend that the scores these bitstrings are given by this program are correlated to the compressibility of the bitstring, where lower scores correlate to higher compressibility.

I am sure that a program could be constructed which would more accurately represent the randomness of a given bitstring. However, I have not yet found a more holistic method. This is something I hope to investigate further.

I have created a much faster C implementation of this program for those who find this useful. It performs exactly the same function but much more quickly, and it is more portable.