353 points by brig90 21 hours ago | 113 comments
The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?
I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built a Markov transition matrix to visualize cluster-to-cluster flow.
It’s not translation. It’s not decryption. It’s structural modeling — and it revealed some surprisingly consistent syntax across the manuscript, especially when broken out by section (Botanical, Biological, etc.).
GitHub repo: https://github.com/brianmg/voynich-nlp-analysis Write-up: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...
I’m new to the NLP space, so I’m sure there are things I got wrong — but I’d love feedback from people who’ve worked with structured language modeling or weird edge cases like this.
cedws 2 minutes ago
My completely amateur take is that it's an elaborate piece of art or hoax.
patcon 19 hours ago
I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!
https://patcon.github.io/polislike-opinion-map-painting/
Painted groups: https://t.co/734qNlMdeh
(Sorry, only really works on desktop)
[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...
brig90 19 hours ago
khafra 3 hours ago
loxias 10 hours ago
This ain't your parents' "factor analysis".
staticautomatic 18 hours ago
patcon 15 hours ago
DonaldFisk 52 minutes ago
I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.
You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...
minimaxir 21 hours ago
The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.
brig90 21 hours ago
Appreciate you calling that out — that’s a great push toward iteration.
thih9 18 hours ago
Does it make sense to check the process with a control group?
E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?
flir 11 hours ago
awinter-py 14 hours ago
tetris11 21 hours ago
Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis
brig90 21 hours ago
And yes to the cross-cluster reference idea — I didn’t build a similarity matrix between clusters, but now that you’ve said it, it feels like an obvious next step to test how much signal is really being captured.
Might spin those up as a follow-up. Appreciate the thoughtful nudge.
lukeinator42 20 hours ago
tetris11 19 hours ago
jszymborski 21 hours ago
(Before I get yelled out, this isn't prescriptive, it's a personal preference.)
minimaxir 18 hours ago
jszymborski 17 hours ago
I'd add that just because you can achieve separability from a method, the resulting visualization may not be super informative. The distance between clusters that appear in t-SNE projected space often have nothing to do with their distance in latent space, for example. So while you get nice separate clusters, it comes at the cost of the projected space greatly distorting/hiding the relationship between points across clusters.
tomrod 20 hours ago
empath75 19 minutes ago
Avicebron 21 hours ago
brig90 21 hours ago
I didn’t re-map anything back to glyphs in this project — everything’s built off those EVA transliterations as a starting point. So if "okeeodair" exists in the dataset, that’s because someone much smarter than me saw a sequence of glyphs and agreed to call it that.
us-merul 21 hours ago
The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.
I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.
veqq 20 hours ago
us-merul 20 hours ago
philistine 11 hours ago
It's not a mental issue, it's just a rare thing that happens. Voynich fits the whole bill for the work of a naive artist.
cronopios 6 hours ago
DonaldFisk 1 hour ago
It also applies to a range of natural phenomena, e.g. lunar craters and earthquakes: https://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Ne...
So the fact that word frequencies in the Voynich Manuscript follow Zipf's law doesn't prove it's written in a natural language.
poulpy123 1 hour ago
riffraff 8 hours ago
Not a recent hoax/scam, but an ancient one.
It's not like there weren't a ton of fake documents in the middle age and renaissance, from the donation of Constantine to Preserve John's letter.
GolfPopper 14 hours ago
renhanxue 12 hours ago
Edward Kelly was born over a hundred years later, so him "being at the right time" seems to be a bit of a stretch.
emmelaich 12 hours ago
Which is worse actually. Kelly may have semi-erased an existing valuable manuscript.
renhanxue 12 hours ago
emmelaich 11 hours ago
[0] https://manuscriptroadtrip.wordpress.com/2024/09/08/multispe...
codesnik 17 hours ago
Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.
I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.
veqq 16 hours ago
A lot of work's been done here. There are believed to have been 2 scribes (see Prescott Currier), although Lisa Fagin Davis posits 5. Here's a discussion of an experiment working off of Fagin Davis' position: https://www.voynich.ninja/thread-3783.html
pawanjswal 11 hours ago
brig90 11 hours ago
That second part wasn’t super important though — this was more about learning and experimenting than trying to break new ground. Really appreciate the kind words, and hopefully it sparks someone to take it even further.
frozenseven 10 hours ago
bdbenton5255 12 hours ago
brig90 12 hours ago
Appreciate the nudge — always fascinating to see where people take this kind of thinking.
PaulDavisThe1st 12 hours ago
psychoslave 4 hours ago
On the other hand, it's a bit wild to build a whole city next to volcanos that are definitely going to wake up in less than a few centuries, to begin with.
poulpy123 1 hour ago
mach5 2 hours ago
Tade0 38 minutes ago
user32489318 18 hours ago
gthompson512 11 hours ago
brig90 11 hours ago
Clustering by sentence or page would be interesting too — I haven't gone that far yet, but it’d be fascinating to see if there’s consistency across visual/media sections. Appreciate the insight!
bpiroman 14 hours ago
bpiroman 14 hours ago
Nursie 12 hours ago
There's also a very long thread about it here - https://www.voynich.ninja/thread-2318.html - that seems to go from "that's really interesting, let's find out more about it" to "eh, seems about the same as other revelatory announcements about Romance, Hebrew etc"
marcodiego 17 hours ago
mellow_observer 6 hours ago
Pecularities in Voynich also suggest that one to one word mappings are very unlikely to result in well described languages. For instance there's cases of repeated word sequences you don't really see in regular text. There's a lack of extremely common words that you would expect would be neccessary for a word based structured grammar, there's signs that there's at least two 'languages', character distributions within words don't match any known language, etc.
If there still is a real unencoded language in here, it's likely to be entirely different from any known language.
munchler 16 hours ago
raverbashing 6 hours ago
Mapping words 1:1 is not going to lead you anywhere (especially for a text that has stood undecoded for so long time)
It kiiiinda works for very close languages (think Dutch<>German or French<>Spanish) and even then.
brig90 17 hours ago
The challenge (as I understand it) is that the vocabulary size is pretty massive — thousands of unique words — and the structure might not be 1:1 with how real language maps. Like, is a “word” in Voynich really a word? Or is it a chunk, or a stem with affixes, or something else entirely? That makes brute-forcing a direct mapping tricky.
That said… using cluster IDs instead of individual word (tokens) and scoring the outputs with something like a language model seems like a pretty compelling idea. I hadn’t thought of doing it that way. Definitely some room there for optimization or even evolutionary techniques. If nothing else, it could tell us something about how “language-like” the structure really is.
Might be worth exploring — thanks for tossing that out, hopefully someone with more awareness or knowledge in the space see's it!
marcodiego 17 hours ago
quantadev 17 hours ago
Maybe a version of scripture that had been "rejected" by some King, and was illegal to reproduce? Take the best radiocarbon dating, figure out who was King back then, and if they 'sanctioned' any biblical translations, and then go to the version of the bible before that translation, and this will be what was perhaps illegal and needed to be encrypted. That's just one plausible story. Who knows, we might find out the phrase "young girl" was simplified to "virgin", and that would potentially be a big secret.
edoceo 13 hours ago
quantadev 13 hours ago
tough 8 minutes ago
18 hours ago
GTP 18 hours ago
brig90 17 hours ago
rossant 18 hours ago
adzm 16 hours ago
quantadev 17 hours ago
Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.
ck2 18 hours ago
https://arstechnica.com/science/2024/09/new-multispectral-an...
but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol
Avicebron 18 hours ago
Even if it was "just" an (extraordinarily wealthy and precocious) child with a fondness for plants, cosmology, and female bodies carefully inscribing nonsense by repeatedly doodling the same few characters in blocks that look like the illuminated manuscripts this child would also need access to, that's still impressive and interesting.
glimshe 21 hours ago
lolinder 20 hours ago
ahmedfromtunis 19 hours ago
That said, I just watched a video about the practice of "speaking in tongues" that some christian congregations practice. From what I understand, it's a practice where believers speak in gibberish for certain rituals.
Studying these "speeches", researches found patterns and rhythms that the speakers followed without even being aware they exist.
I'm not saying that's what's happening here, but maybe if this was a hoax (or a prank), maybe these patterns emerged just because they were inscribed by a human brain? At best, these patterns can be thought of as shadows of the patterns found in the writers mother tongue?
InsideOutSanta 20 hours ago
People often assert this, but I'm unsure of any evidence. If I wrote a manuscript in a pretend language, I would expect it to end up with language-like patterns, some automatically and some intentionally.
Humans aren't random number generators, and they aren't stupid. Therefore, the implicit claim that a human could not create a manuscript containing gibberish that exhibits many language-like patterns seems unlikely to be true.
So we have two options:
1. This is either a real language or an encoded real language that we've never seen before and can't decrypt, even after many years of attempts
2. Or it is gibberish that exhibits features of a real language
I can't help but feel that option 2 is now the more likely choice.
neom 19 hours ago
buildsjets 12 hours ago
tonymillion 15 hours ago
CamperBob2 19 hours ago
It's harder to generate good gibberish than it appears at first.
cubefox 19 hours ago
vehemenz 17 hours ago
userbinator 12 hours ago
InsideOutSanta 19 hours ago
veqq 20 hours ago
There's certainly a system to the madness, but it exhibits rather different statistical properties from "proper" languages. Look at section 2.4: https://www.voynich.nu/a2_char.html At the moment, any apparently linguistic patterns are happenstance; the cypher fundamentally obscures its actual distribution (if a "proper" language.)
andoando 20 hours ago
Shud less kee chicken souls do be gooby good? Mus hess to my rooby roo!
edoceo 13 hours ago
Loughla 16 hours ago
int_19h 16 hours ago
vehemenz 17 hours ago
lolinder 17 hours ago
tough 6 minutes ago
poulpy123 1 hour ago
As far as I know it's just gibberish since it doesn't follow the statistics of the known languages or cyphers of the time.
himinlomax 19 hours ago
The age of the document can be estimated through various methods that all point to it being ~500 year old. The vellum parchment, the ink, the pictures (particularly clothes and architecture) are perfectly congruent with that.
The weirdest part is that the script has a very low number of different signs, fewer than any known language. That's about the only clue that could point to a hoax afaik.
andyjohnson0 21 hours ago
I have no background in NLP or linguistics, but I do have a question about this:
> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants
This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?
Once again, nice work.
brig90 21 hours ago
veqq 20 hours ago
https://www.voynich.ninja/thread-4327-post-60796.html#pid607... is the main forum discussing precisely this. I quite liked this explanation of the apparent structure: https://www.voynich.ninja/thread-4286.html
> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA
That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.
This site is a great introduction/deep dive: https://www.voynich.nu/
brig90 20 hours ago
akomtu 20 hours ago
nine_k 21 hours ago
<quote>
Key Findings
* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
* Cluster 3 has high diversity and flexible positioning — likely a root content class
* Transition matrix shows strong internal structure, far from random
* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
Hypothesis
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
</quote>
brig90 21 hours ago
gchamonlive 21 hours ago
InsideOutSanta 20 hours ago
I don't see how it could be random, regardless of whether it is an actual language. Humans are famously terrible at generating randomness.
nine_k 19 hours ago
InsideOutSanta 19 hours ago
I wouldn't assume that the writer made decisions based on these goals, but rather that the writer attempted to create a simulacrum of a real language. However, even if they did not, I would expect an attempt at generating a "random" language to ultimately mirror many of the properties of the person's native language.
The arguments that this book is written in a real language rest on the assumption that a human being making up gibberish would not produce something that exhibits many of the properties of a real language; however, I don't see anyone offering any evidence to support this claim.
timonofathens 7 hours ago
cookiengineer 16 hours ago
brig90 16 hours ago
My main goal was to learn and see if the manuscript behaved like a real language, not necessarily to translate it. Appreciate the link — I’ll check it out (once I get my German up to speed!).
Nursie 11 hours ago
0points 8 hours ago
So, sorry but you are not busting any bubbles today.
ablanton 20 hours ago
https://www.researchgate.net/publication/368991190_The_Voyni...
Reubend 19 hours ago
For more info, see https://www.voynich.ninja/thread-3940-post-53738.html#pid537...
krick 17 hours ago
Yet 10 years later I still hear that the consensus is that there's no agreeable translation. So, what, all this mandaic-gypsies was nothing? And all coincidences were… coincidences?
mellow_observer 6 hours ago
So far none of these ideas have been shown to be applicable to the full text though. What you would expect with a real translation is that the further you get with your translation, the easier it becomes to translate more. But with the attempts so far is that we keep seeing that it becomes more and more difficult to pretend that other pages are just as translatable using the same scheme you came up initially. It eventually just dies a quiet death
cookiengineer 16 hours ago