Uncovering Cicada Wiki

Introduction to statistical analysis in the context of Liber Primus[]

This article is intended to not only explain how we know Liber Primus does not utilize some of the more highly suggested ciphers that often immediately get shot down, but also to give you the data and help interpret some of it. For more information on the methods outlined as a whole, please see the resources section at the bottom of the page. If you are just here for the data and already know what everything means, suffer and scroll. Your pain brings me joy.

What is frequency analysis?[]

Letter Relative frequency in the English language
Texts(%) Dictionaries(%)
A 8.2 7.8
B 1.5 2.0
C 2.8 4.0
D 4.3 3.8
E 12.7 11.0
F 2.2 1.4
G 2.0 3.0
H 6.1 2.3
I 7.0 8.6
J 0.15 0.21
K 0.77 0.97
L 4.0 5.3
M 2.4 2.7
N 6.7 7.2
O 7.5 6.1
P 1.9 2.8
Q 0.095 0.19
R 6.0 7.3
S 6.3 8.7
T 9.1 6.7
U 2.8 3.3
V 0.98 1.0
W 2.4 0.91
X 0.15 0.27
Y 2.0 1.6
Z 0.074 0.44

Frequency analysis is one of many techniques used in cryptography to decipher encrypted messages. It relies on the fact that certain letters or combinations of letters occur more frequently in a given language. By analyzing the frequency of letters or patterns in the encrypted text, it is possible to make educated guesses about the corresponding letters or patterns in the original message.

The process involves counting the occurrences of each letter or pattern in the encrypted text and comparing them to the expected frequencies in the language being used. For example, in English, the letter 'e' is the most commonly used letter, so if a particular letter appears most frequently in the encrypted text, it is likely to correspond to 'e' in the original message. Letter frequency varies by language, however since we can assume the plaintext of the unsolved portion of Liber Primus will be in English, we are using it as our point of reference from here on out.

What is a bigram? Why won't everyone shut up about IOC?[]

This is how we arrive at bigrams and IOC/IC- also known as Index of Coincidence. IC is a measure used to determine how likely it is that a pair of two letters(also known as a bigram or doublets, if the letters are the same) that when randomly selected, are going to be the same from a given text.

You may also see the term 'n-grams' floating around. The concept is the same as bigrams, except instead of two characters it could be referring to any number of characters(trigram, quadgram).

This is a list of character-1-grams:

      A,N,  I,N,S,T,R,U,C,T,IO,N,  C,O,M,M,A,N,D,  Y,O,U,R,  O,W,N,  S,E,L,F

We could also have character bigrams, word-1-grams, word bigrams, etc:

      (A,N),(N,I),(I,N),(N,S),(S,T),(T,R),(R,U),(U,C),(C,T),(T,IO),(IO,N),(N,C),(C,O),(O,M),(M,M),(M,A),(A,N),(N,D),(D,Y),(Y,O),(O,U),(U,R),(R,O),(O,W),(W,N),(N,S),(S,E),(E,L),(L,F) 
      AN,  INSTRUCTION,  COMMAND, YOUR, OWN, SELF
      (AN,INSTRUCTION), (INSTRUCTION,COMMAND), (COMMAND,YOUR), (YOUR,OWN), (OWN,SELF)

Why is this important? How do we know my idea won't work?[]

Because letter occurrence is not random, this is useful to determine patterns in text that can inform us on what encryption method has been used(or, in this case, which ones haven't). Frequency analysis and IOC can be effective against substitution ciphers such as Caesar and Vigenere, where each letter is replaced by another letter or symbol, but it becomes less effective with more complex methods of encryption and modern encryption methods are designed to be resistant to being broken in this way. Things also get slightly more complicated when you remember shifts need to be applied to the runes themselves and not their Latin character equivalents. However, we can still glean some interesting things through its use.

The reason we know Liber Primus is not encrypted using Caesar, Vigenere, Rot13, etc. is because of the distribution of characters. With Caesar or Rot13 for example, the distribution of characters would remain the same as normal English even though the most frequently occurring letter will be different as mentioned above. With Vigenere there is the Kasiski–Kerckhoff Method which to make a long, boring, mathematical explanation brief, utilizes the bigrams mentioned earlier to determine the length of the key used and from there decrypt the ciphertext.

Methods of statistical analysis that haven't been done on Liber Primus are few and far between nowadays, but at least once a year someone crawls out of the woodwork with a new one. I will do my best to keep this page as up to date as possible, but do keep in mind the best resource for real time solving updates is our Discord server. The author of this article would like to thank everyone there for their hard work doing nerd things on Liber Primus so that they can steal it and make a mid article about it.

Bigram Distribution[]

Bigram table

Bigram distribution of Liber Primus as a whole

Analysis of bigrams shows the only deviation from random text. It reveals a lower than expected number of doublets. This can be seen in the image as the blue line across the diagonal. They are not completely absent, but much lower than expected. This is likely a direct result of the utilized key/cipher.

See below table. The chapters are separated by artwork, as it is commonly believed that encryption method will change with each section of marginalia.

LP's bigram distribution compared to a random text
Pages Number of
runes
IC Number of
doublets
Doublet occurrence
rate
Random ciphertext None None 1 None 3.45%
Cross 0-2 729 0.988 4 0.549%
Spirals 3-7 1145 1.004 6 0.524%
Branches 8-14 1729 0.999 9 0.520%
Möbius 15-22 1903 1.000 10 0.525%
Mayfly 23-26 1021 0.993 11 1.078%
Wing/Tree 27-32 1433 0.991 13 0.907%
Cuneiform 33-39 1680 0.996 12 0.714%
Spiral/Branches 40-53 3008 1.001 18 0.598%
Hollow 54-55 308 0.980 3 0.977%
Total 0-55 12956 0.999 86 0.663%

The doublet occurrence rate of random text is 1/29 = 3.45%. IC of plaintext English is expected to be 1.73.

Low doublet counts point traditionally to some form of autoclave or autokey cipher. Alternatively, there is the option that 3301 created their own cipher or used autokey as well as another encryption method. The rest of the bigrams are distributed relatively evenly as seen in the image on the right.

Not only does this show that Liber Primus is not completely random, but it also points directly away from ciphers mentioned in the first section of this article.

Word repeats[]

Ciphertext in latin D-J-U-B-E-I B-M-R-N-M O-U-N-W-M O-F-L-E-ING I-M-ING-Y-A
Indices 23-11-1-17-18-10 17-19-4-9-19 3-1-9-7-19 3-0-20-22-21 10-19-21-26-24
Starting position 6555, 12950 5448, 12001 6985, 8016 7393, 12385 10671, 12764
Difference between

starting positions:

6395 6533 1031 4992 2093
Ciphertext in runes ᛞᛄᚢᛒᛖᛁ ᛒᛗᚱᚾᛗ ᚩᚢᚾᚹᛗ ᚩᚠᛚᛟᛝ ᛁᛗᛝᚣᚪ
Words containing

sequence

ᛒᚠ-ᛞᛄᚢ-ᛒᛖᛁ-ᚫᚠ

ᚳᛠᛁᛗᚳᛉ-ᛞᛄᚢ-ᛒᛖᛁ

ᛒᛗᚱᚾᛗᚻᛗᛁᚾᚪᛞ

ᛗᛁᛄᛒᛗᚱᚾᛗ

ᛠᛈᛄᛞᚾᛟᚩᚢᚾᚹᛗ

ᚩᚢᚾᚹᛗ

ᚩᚠᛚ-ᛟᛝ

ᚩᚷᛗ-ᚠᛚᛟᛝᚦᛠ

ᛗᚠᛝᛉᛞ-ᛗᛝᚣᚪᛝᚠᛉᛁᛟᚷᛚ

ᛏᛝ-ᛗᛝᚣᚪ

Chapters Wing_Tree

Hollow

Möbius

Spiral_Branches

Wing_Tree

Cuneiform

Wing_Tree

Spiral_Branches

Spiral_Branches

Hollow

Pages 27 (second row)

55 (last two words)

22 (row 4, word 2)

48 (penultimate row, first letters)

28 (third last row, last word)

33 (row 4, third word)

30 (row 5, words 2 and 3)

52 (row 9, third word)

43 (last row, word 1 and 2)

54 (row 6 word 5)

Starting position

in chapter

28, 302 1845, 2361 458, 56 866, 2745 1031, 116


This is interesting to note as repeated words or phrases are often used in cryptanalysis to determine a repeating key or key length.

Analysis of n-gram frequencies[]

This section contains some short analysis of the n-gram frequencies. Here we compare the n-gram frequency of random text to those of the unsolved pages. Counted unique n-grams is the number of unique n-grams in the text, number of repeated n-grams is the number of n-grams that appear at least twice in the text, and the total number of repeated n-grams is what we get when we add up the total number of n-grams that appear more than once in the text.

---------------------------------------------

LP text:

Counted unique bigrams: 840

Number of repeated bigrams: 837

Total number of repeated bigrams: 12952

Random text:

Mean counted unique bigrams: 840.9997

Std counted unqiue bigrams: 0.01731790980459247

Mean number of repeated bigrams: 840.9963

Std number of repeated bigrams: 0.060714989911882546

Mean total number of repeated bigrams: 12954.9966

Std total number of repeated bigrams: 0.0582103083654433

---------------------------------------------

LP text:

Counted unique trigrams: 9945

Number of repeated trigrams: 2508

Total number of repeated trigrams: 5517.0

Random text:

Mean counted unique trigrams: 10050.1294

Std counted unqiue trigrams: 38.91237406841171

Mean number of repeated trigrams: 2433.5345

Std number of repeated trigrams: 31.238108293397026

Mean total number of repeated trigrams: 5337.4051

Std total number of repeated trigrams: 67.1163422274337 (edited)

---------------------------------------------

LP text:

Counted unique quadgrams: 12825

Number of repeated quadgrams: 127

Total number of repeated quadgrams: 255

Random text:

Mean counted unique quadgrams: 12835.0622

Std counted unqiue quadgrams: 11.02112204632541

Mean number of repeated quadgrams: 117.2225

Std number of repeated quadgrams: 10.925364696430046

Mean total number of repeated quadgrams: 235.1603

Std total number of repeated quadgrams: 21.93048116002018

---------------------------------------------

Resources[]