392 points by RyanShook 1 week ago | 139 comments
graypegg 1 week ago
If you map it onto a hilbert curve, the X and Y axis mean nothing, but visually points that are close together in the sorted list, will be visually close together in the output image.
Since the first part of an ISBN is the country, then the second part is the publisher, and the third part is the title, with a check sum at the end, I would remove the checksum and sort them each as a big number. (no hyphens)
You should end up with "islands", where you see big areas covered by big publishing countries, with these "islands" having bright spots for the publisher codes.
Bonus points for labeling these areas!
I set up something a while ago [1] for an interview that does this with weather data. It makes the seasons really obvious since they're all grouped together.
[0] https://en.wikipedia.org/wiki/Hilbert_curve
[1] https://graypegg.com/hilbert (https://github.com/graypegg/hilbertcurveplayground code if anyone wants to go for the prize using this! Please at least mention me if you decide to reuse this code, but I can't stop ya lol)
abetusk 1 week ago
n2d4 1 week ago
The worry I have with Hilbert curves is that they make the result look like there are distinct "squares" of data [0] when really this is just an artifact of how Hilbert curves work. In that sense, the current visualization is more useful, because it's straightforward to identify the location of each country in it.
[0] https://raw.githubusercontent.com/jakubcerveny/gilbert/maste...
graypegg 1 week ago
And yeah that’s true! you end up with squares with Hilbert curves. But those squares are all « related » data. Then those squares are related to the squares near it. Zoom out more and that grouping of squares is related to the neighbouring macro-squares etc etc.
Basically the square shape is a positive. Kind of like how charting the derivative lets you see how random/related information is, grouping into these squares gives you a visualization of pattern-ness, rather than any specific measurement.
n2d4 1 week ago
But this is also true in Hilbert curves across the boundaries of the "squares" that I mentioned. The two center pixels in the top row are much more distant than any two pixels would be in a snake pattern.
NooneAtAll3 1 week ago
2D neighbourhood is better than 1D one
> The worry I have with Hilbert curves is that they make the result look like there are distinct "squares" of data
that's the point, tho? instead of distinct lines of taken ISBNs in a row, you get distinct squares if taken ISBNs in a row - much more noticeable
WillAdams 1 week ago
A visualization using LoC or even Dewey Decimal would be far more useful, esp. if it also linked to public domain and copyright-free repositories/lists, say an interactive and visual version of John Mark Ockerbloom's:
est31 1 week ago
WillAdams 1 week ago
One can't use ISBNs alone to create a hierarchical listing of texts which is useful for anything beyond browsing by language/publisher/order in which the ISBN was generated.
A visual and interactive representation of books by LoC or some other cataloging system would actually be useful.
PaulHoule 1 week ago
Years later I was working at the library and got a little bit steamed because South End Press was reusing ISBN's after books went out of print which was allowed but, I think, lame.
One of my strategies for researching a topic is looking a few up in the OPAC, finding them in the stacks, and finding more books on the topic in those areas. (In the Library of Congress system, machine vision could be under QA56 with the rest of computer science or around TA1630, thus "areas".)
From time to time I've thought about trying to replicate the feel of this with some kind of UI given that our library moved a lot of the collection into deep archives and we have a very fast 'Borrow Direct' service with other peers)
convolvatron 1 week ago
MarceColl 1 week ago
Finnucane 1 week ago
NoMoreNicksLeft 1 week ago
The vast, vast majority have only been released as dead-tree versions. They have none of those. The books they scan may have an ISBN, but the scans do not have them. Like all Project Gutenberg books, their books have no ISBNs at all. From a strict point of view, they've released new editions of these books.
nickelpro 1 week ago
What you've described is that the archived content can be mapped to multiple ISBNs. It's clear the only element of concern here is the content itself. The failure to preserve a particular binding or printer's choice of typeface is irrelevant.
Failing to recognize this requires an almost malicious level of pedantry
jameshart 1 week ago
Indeed a bigger problem is that it’s much harder to know which areas of the grid are never going to light up because the ISBN has not been used.
nickelpro 1 week ago
Lighting up the entire grid is still the goal, you're describing the problem of ensuring the right set of squares is illuminated for each piece of archived content. One is a problem of archiving the content, the other is a problem of bookkeeping.
NoMoreNicksLeft 1 week ago
Hardly worthless... often times, the edition of the book matters as much as the title. Steven King wrote two books named The Stand, and one isn't anything like the other. He pulled a Lucas pretty early on.
He's hardly the only author to ever do this. But it's not just authors either. Editors, collectors, translators all make their mark, and give you works that though they might be slightly different to you, the differences actually matter to the rest of us. It's not that you're ignorant that offends me, it's the arrogance about a subject you seem to know so little about that makes it difficult to tolerate.
There is no pedantry here, just a desire to actually preserve books and to organize them.
nickelpro 1 week ago
Then those two texts would map to different ISBNS, or perhaps each maps to multiple different ISBNs, it doesn't matter. That some texts exist with the same title but different content is similarly irrelevant.
The content is all that matters. Two different bodies of content, two different entries in the archive. Each entry may map to one or more ISBN numbers.
> the differences actually matter to the rest of us
The only differences that matter are what matters to the archive that made the blog post. Your concerns are for entirely different things, which is fine, but don't say the OP's concerns or initiatives are impossible or ill-suited based on a criteria you're projecting onto them.
mmooss 1 week ago
Are you saying they actively remove ISBN numbers from scans? If I downloaded one of the books, it wouldn't have an ISBN?
Why? That seems like a bunch of extra processing per book, makes it harder for users to specifically identify a book, and probably does nothing for legality. Also, can people search by ISBN?
Tomte 1 week ago
No, he‘s playing the pointless „well, actually a scan of a book is a different thing from the book itself“ game.
NoMoreNicksLeft 1 week ago
nickelpro 1 week ago
> From a strict point of view, they've released new editions of these books.
And this is clearly a semantically worthless distinction from the point of view of the archive.
When different editions have different content, archiving those differences in that content may matter (arguably not for simple typographical corrections, printing errors, etc). When different ISBNs have identical content, it is totally irrelevant to the goals of the archive.
edflsafoiewq 1 week ago
> Until now, the only options to shrink the total size of our collection has been through more aggressive compression, or deduplication. However, to get significant enough savings, both are too lossy for our taste. Heavy compression of photos can make text barely readable. And deduplication requires high confidence of books being exactly the same, which is often too inaccurate, especially if the contents are the same but the scans are made on different occasions.
Finnucane 1 week ago
omoikane 1 week ago
The image contains 1000*800 pixels at 2500 ISBNs per pixel, so it's visualizing 2e9 ISBNs. ISBN-13 contains 12 digits plus one check digit, so we might have expected the image to be 500 times bigger/denser than the current image. The fact that it's at its current size suggests that only ISBNs with 978 and 979 prefixes are included, and since the bottom half is more sparse, that probably corresponds to the new 979 range.
skrebbel 1 week ago
saithound 1 week ago
Find the interactive visualiser by scrolling down, and switch it to "Files in Anna's Archive [md5]". This will highlight the location of the green pixels in grey.
Muehe 1 week ago
- Right-click the image and select "Inspect".
- Add a new CSS hue-rotate filter to the element:
element {
max-width: 100%;
margin: 0 auto;
filter: hue-rotate(-90deg);
}
Usually I use "filter: saturate(100);", but that didn't really work well for this image. You might have to adjust the rotation degree though, -90 worked best for me.superzamp 1 week ago
Finnucane 1 week ago
glimshe 1 week ago
wayathr0w 1 week ago
If self-destruction is a necessary premise here, is that really a good thing?