forked from grantrostig/cpp_by_example
-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathmain.cpp
More file actions
177 lines (140 loc) · 8.54 KB
/
main.cpp
File metadata and controls
177 lines (140 loc) · 8.54 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
/* Copyright Grant Rostig, License Boost 1.0
Bad name, should be "permutations" (not "combinations"), also the selection is considered to be "with replacement"
*/
#include <cassert>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <unordered_map>
constexpr int STRING_LEN{49};
auto s(auto i) {return i/2.09;} // s: Scale down the value by some factor with goal of having all requencies sum to 100% or probability of 1, or thereabouts.
std::unordered_map<char, double> char_occurance_freqs = { // First 26 values are the letter frequencies in English (in percentage), but we added upper case and more, so number must be scaled down with s().
{'a', s(8.167)}, {'b', s(1.492)}, {'c', s(2.782)}, {'d', s(4.253)}, {'e', s(12.702)},
{'f', s(2.228)}, {'g', s(2.015)}, {'h', s(6.094)}, {'i', s(6.966)}, {'j', s(0.153)},
{'k', s(0.772)}, {'l', s(4.025)}, {'m', s(2.406)}, {'n', s(6.749)}, {'o', s(7.507)},
{'p', s(1.929)}, {'q', s(0.095)}, {'r', s(5.987)}, {'s', s(6.327)}, {'t', s(9.056)},
{'u', s(2.758)}, {'v', s(0.978)}, {'w', s(2.360)}, {'x', s(0.150)}, {'y', s(1.974)},
{'z', s(0.074)},
{'A', s(8.167)}, {'B', s(1.492)}, {'C', s(2.782)}, {'D', s(4.253)}, {'E', s(12.702)},
{'F', s(2.228)}, {'G', s(2.015)}, {'H', s(6.094)}, {'I', s(6.966)}, {'J', s(0.153)},
{'K', s(0.772)}, {'L', s(4.025)}, {'M', s(2.406)}, {'N', s(6.749)}, {'O', s(7.507)},
{'P', s(1.929)}, {'Q', s(0.095)}, {'R', s(5.987)}, {'S', s(6.327)}, {'T', s(9.056)},
{'U', s(2.758)}, {'V', s(0.978)}, {'W', s(2.360)}, {'X', s(0.150)}, {'Y', s(1.974)},
{'Z', s(0.074)},
{'2', s(0.999)}, {'3', s(0.999)}, {'4', s(0.999)}, {'5', s(0.999)}, {'6', s(0.999)},
{'7', s(0.999)},
};
int main() {
// std::unordered_map<char, double> const letterProbabilities;
double character_set_radix{static_cast<double>(ssize(char_occurance_freqs))};
double log_x{log2(character_set_radix)};
double possible_permutations_with_replacement = pow(character_set_radix, STRING_LEN); // Total possible combinations without considering probabilities
double total_frequency{0.0};
for (auto const & pair : char_occurance_freqs) {
total_frequency += pair.second;
}
double entropy{0.0}; // Shannon entropy to be calculated, apparently only an estimate.
for (const auto& pair : char_occurance_freqs) {
double probability = pair.second / total_frequency;
// letterProbabilities[pair.first] = probability;
assert(probability > 0.0);
entropy -= probability * log2(probability);
}
double total_entropy = STRING_LEN * entropy;
double effective_permutations = pow(2, total_entropy);
std::cout << "Radix or number of chars in set: " << character_set_radix << std::endl;
std::cout << "Total Frequency of chars: " << total_frequency << ". Should 100." << std::endl;
std::cout << "Bits needed for radix: " << log_x << std::endl;
std::cout << "Entropy per char in random English text frequency on average:" << entropy << " bits" << std::endl;
std::cout << "Total entropy for char permutations of length "<< STRING_LEN <<": "<< total_entropy << " bits" << std::endl;
std::cout << "Effective number of char permutations: " << std::scientific << std::setprecision(8) << effective_permutations << std::endl;
std::cout << "Total possible permutations " << character_set_radix <<"^"<<STRING_LEN<<" is: " << possible_permutations_with_replacement << std::endl;
std::cout << "Total possible permutations 256 bits: 2^256: " << std::pow(2,256) << std::endl;
std::cout << "\nWARNING, my math could be completely wrong/naive. Feedback appreciated." << std::endl;
std::cout << "###" << std::endl;
return 0;
}
/* AI output: Understanding the Problem https://copilot.microsoft.com/chats/XAKYk1ptFdGrKmz2Ghf3H
Prompt by grostig: Write a c++ program that calculates number of possible values created by concatenating 4 alphabetic characters
where the probability of each character is that found in common english words.
Answer: We aim to calculate the number of possible values created by concatenating 4 alphabetic characters, where each character's
probability corresponds to its frequency in common English words. This isn't just a straightforward combinatorial calculation due to
the varying probabilities of each letter.
Frequencies of letters in English. These frequencies determine the likelihood of each letter appearing:
Letter Frequency (%)
A 8.167
B 1.492
C 2.782
D 4.253
E 12.702
F 2.228
G 2.015
H 6.094
I 6.966
J 0.153
K 0.772
L 4.025
M 2.406
N 6.749
O 7.507
P 1.929
Q 0.095
R 5.987
S 6.327
T 9.056
U 2.758
V 0.978
W 2.360
X 0.150
Y 1.974
Z 0.074
These frequencies can be converted into probabilities by normalizing them so that their sum equals 1.
Approach to the Problem: While there are 26^4 = 456,976
possible combinations of 4-letter strings using the English alphabet, the effective number of combinations considering the letter
probabilities is different. Some combinations are exceedingly rare due to the low probabilities of certain letters.
+Calculate the total number of unique 4-letter combinations, considering the probabilities.
+Assess the effective number of combinations, taking into account the frequency distribution.
+Implement a C++ program that performs these calculations.
The concept of the effective number of combinations can be approached using entropy.
Shannon entropy provides a measure of the uncertainty or surprise associated with random variables—in this case, the letters.
The formula for entropy 𝐻 is:
𝐻 = − ∑ (𝑝 of 𝑖) log(2𝑝 of 𝑖) Where 𝑝 of 𝑖 is the probability of letter 𝑖.
For a sequence of 4 letters, the total entropy:
𝐻 of 𝑡𝑜𝑡𝑎𝑙 = 4 * (𝐻 of letters)
The effective number of combinations:
𝑁 effective = 2^(𝐻 of 𝑡𝑜𝑡𝑎𝑙)
Let's proceed to write a C++ program that calculates:
The entropy of the letter distribution.
The total entropy for 4-letter combinations.
The effective number of combinations.
see above fixed/supplemented/editied by Grant Rostig.
Interpretation of Results
Entropy per letter (~4.176 bits): On average, each letter contains about 4.176 bits of information.
Total entropy (~16.704 bits): A 4-letter combination contains about 16.704 bits of information.
Effective number of combinations (~98,109): This is the number of combinations that effectively carry unique information, considering
the letter probabilities.
Total possible combinations (456,976): This is the maximum number of combinations without considering letter frequencies.
This means that due to the uneven letter distribution in English, the effective number of combinations is significantly less than the
total possible combinations.
Visualizing the Concept
Imagine a forest where each tree represents a 4-letter combination. In a uniformly distributed forest, all trees are equally likely,
spread out evenly across the landscape. However, in our English letter frequency forest, certain areas (combinations with common
letters like 'E' and 'T') are densely packed, while others (combinations with rare letters like 'Q' and 'Z') are sparse.
This uneven distribution reduces the effective diversity of our forest, similar to how the effective number of combinations is less
than the total possible combinations.
Further Exploration
Most Probable Combinations: We can modify the program to list the most probable 4-letter combinations. This could be useful in
cryptography or linguistics.
Random Word Generation: Using the probabilities, we can simulate the generation of random 4-letter words that mimic English word
patterns.
Impact on Data Compression: Understanding letter frequencies is crucial in compression algorithms like Huffman coding, where common
letters are assigned shorter codes.
Implications and Reflections
Delving into letter probabilities isn't just an academic exercise—it has real-world applications in:
Cryptography: Frequency analysis can help decode encrypted messages by exploiting the commonality of certain letters.
Linguistics: Analyzing letter and word frequencies aids in understanding language patterns and evolution.
Information Theory: Concepts like entropy are foundational in measuring information content and designing efficient communication
systems.
Isn't it fascinating how the subtle nuances of letter frequencies can unveil such depth in our understanding of language and
information? This exploration scratches the surface of the intricate dance between probability, language, and computation.
*/