remix logo

Hacker Remix

The Tragedy of Google Books (2017)

193 points by lispybanana 9 hours ago | 87 comments

philipkglass 8 hours ago

These Google scans are also available in the HathiTrust [1], an organization built from the big academic libraries that participated in early book digitization efforts. The HathiTrust is better about letting the public read books that have actually fallen into the public domain. I have found many books that are "snippet view" only on Google Books but freely visible on HathiTrust.

If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.

The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.

[1] https://www.hathitrust.org/

dredmorbius 3 hours ago

HathiTrust is a fine example of a repository which is in theory useful but in practice all but useless.

Participation is limited to tertiary academic institutions, and possibly only four-year (rather than two-year) ones. This excludes local (city/county) libraries, as well as primary/secondary (grammar / middle / high school in the US) libraries.

Even public-domain records cannot be downloaded in whole, but rather can be saved one page at a time as PDFs. I'm pretty sure that those interested in more useful archival will and/or have created automated tools to do so, but HathiTrust remains the most notable point-of-access for such works, and the additional generation of conversion and republication further degrades the quality of original-publication formats. (It's less a problem for regenerated works from OCR'd or manually-converted documents, but those of course lose all the characteristics of original publication.)

And of course, many materials still under copyright are not accessible to the general public at all, no matter how obscure. I'd run into a case of this some months back trying to get a date attribution of an Alan Watts lecture which had been posted to HN:

<https://news.ycombinator.com/item?id=41231047> (thread).

And my request still stands. Anyone with an academic affiliation who can check <https://catalog.hathitrust.org/Record/000678503> and see how it relates to this post (<https://news.ycombinator.com/item?id=41230841>) would have my gratitude.

acidburnNSA 4 hours ago

Hathitrust has been absolutely transformative for me, as an amateur nuclear enterprise historian.

yonran 7 hours ago

> Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

I was at Google in 2009 on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief (https://docs.justia.com/cases/federal/district-courts/new-yo...) complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then. Instead of one company selling out-of-print but in-copyright books, or multiple organizations, no one is allowed to sell them today.

Since then, of course, Brewster Kahle launched an e-library of copyrighted books without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.

chambers 6 hours ago

I wish the contradiction you spotted was clear on their Wikipedia page. It demonstrates how far back IA's management troubles go, and how their clean image was maybe just an image.

For me, I became concerned when they fibbed about why the Internet Archive Credit Union was liquidated. IA alleged it was shut down due to onerous regulations, but the government said IA actually never lived up to their goal of allowing local, low-income folk to sign-up for their service. https://ncua.gov/newsroom/press-release/2016/internet-archiv...

mastazi 6 hours ago

This is an insightful comment and I thank you for sharing it but, after having looked at the brief you linked

> a jealous amicus brief that the Authors’ Guild settlement would not grant him access to publishing orphan works too

that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position. I do agree with you on this part though

> the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

Edit: I also agree with you that the way the IA subsequently created its e-library was not ideal.

yonran 4 hours ago

> that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position

What I meant by “jealous” is that the Internet Archive’s interest was not to improve author notification or to protect foreign authors; it was to provide a competing service under similar or better terms than Google was able to negotiate without spending the time and money that Google did litigating. Kahle wanted what was in Google’s settlement.

And what I meant by “Kahle was wrong” is not that every argument that his lawyers thought up was false; I think the agreement was later amended to fix some issues. My point is that Kahle’s theory of change was wrong. He thought that when the settlement was rejected, then Google would push Congress to create an orphan works law which the Internet Archive could use to publish old books too. As he wrote in his op-ed, “We need to focus on legislation to address works that are caught in copyright limbo. … We are very close to having universal access to all knowledge. Let's not stumble now.” https://www.washingtonpost.com/wp-dyn/content/article/2009/0... As it turns out, the rejection of the class action settlement did not cause Congress to create an orphan works law. In retrospect, we would have been more likely to get an orphan works law if Google had been allowed to set up a proof of the concept, making the monopoly on orphan works temporary.

cxr 54 minutes ago

There's such a weird tone to your posts. It's as if they're meant to give the impression that Kahle had a substantial, if not singlehanded, influence over the outcome. In reality, his input probably didn't have even the impact that Kahle himself hoped for and the appropriate adjective to describe the effect is probably "negligible", if at all. It was a class action lawsuit with extremely dubious underpinnings where over 6,000 people wrote in to ask that they not be considered part of the class.

lokar 6 hours ago

I would say it’s much worse then “not ideal”, they may have poisoned the well for decades to come.

adastra22 5 hours ago

Maybe permanently, as societal stances on these sorts of issues tend to solidify over time. In a couple of generations the very idea of a library may be confined to history thanks to IA :(

jamiek88 7 hours ago

That pandemic library was a huge, obvious over step by him.

It will have consequences far beyond the immediate lawsuit too.

The very concept has basically been iced for a generation and the net is only getting more locked down not less.

shkkmo 3 hours ago

> In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

Maybe. I think that is a pretty optimistic view of congress and our political process. I would argue that having a powerful, rich company with a monopoly to lose would have made passing such a law less likely, not more.

I do think we would have been better off with a Google monopoly on unpublished unclaimed books than with the lack of access we have today.

The article says:

> You’d get in a lot of trouble, they said, but all you’d have to do, more or less, is write a single database query. You’d flip some access control bits from off to on. It might take a few minutes for the command to propagate.

If it's so easy, I'm suprised nobody has done it and accepted the consequences. It seems one of the largest single positive impacts any person could make on the world. Once it's released, it'll never go back in the box. A modern Pandora.

caseysoftware 7 hours ago

I worked at the Library of Congress on their Digital Preservation Project, circa 2001-2003. The stated goal was to "digitize all of the Library's collections" and while most people think of books, I was in the Motion Picture Broadcast and Recorded Sound Division.

In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.

Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.

Cool space. I wish I'd worked there longer.

caseysoftware 7 hours ago

Also.. it was fun learning the answer to "what is the work?"

If you have an LP or wire spool recording, the audio is the key, obvious work. But then you have the album cover, the spool case, and the physical condition of the media. Being able to see an album cover or read a reporter's notes/labeling is almost as important as the audio.

ForHackernews 6 hours ago

Is the Library of Congress really beholden to copyright laws? I guess I assumed as the national deposit library they had a special exemption to copy any damn thing they pleased for archival purposes.

If they don't have that prerogative, they probably should, and Congress should legislate that to be the case.

aspenmayer 1 hour ago

The Library of Congress and its staff determine fair use exceptions in certain contexts so I’m not sure who could find fault with them, as they could simply authorize it before or after the fact, from what I understand.

Zigurd 8 hours ago

O'Reilly, for whom I've been a lead author and co-author, did this: https://www.oreilly.com/pub/pr/1042

They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.

card_zero 8 hours ago

> A complete list of available titles is at www.oreilly.com/openbook

Exciting!

Follows link

Link no longer exists, gets O'Reilly front page instead

"Introducing the AI Academy, Help your entire org put GenAI to work"

Thanks O'Reilly.

stvltvs 7 hours ago

Looks like Openbook stuff is still there, just homeless. I had to do a web search to find it. For example:

https://www.oreilly.com/openbook/make3/book/

blacksmith_tb 6 hours ago

Yes, I see it all with

https://www.google.com/search?q=site%3Aoreilly.com+inurl%3Ao...

So it seems like it mainly lost the overview page?

tourmalinetaco 10 minutes ago

It looks as though they killed the page sometime between June 7th and June 26th, although the page on June 7th seem to try to redirect to “https://oreilly.janrainsso.com/static/server.html?origin=htt...

https://web.archive.org/web/20240607220047/http://www.oreill...

Definitely perplexing, I can’t find the reason to kill what appears like a simple HTML page unless they’ve killed the project entirely.

ToucanLoucan 7 hours ago

The original dream of the internet: Information, freely available to any who want it.

The new dream of the internet: Some information, that aligns with the values of our advertisers, delivered via an LLM that sometimes makes shit up.

MollyRealized 8 hours ago

It's okay, I'll just check the Wayb--shit

tourmalinetaco 10 minutes ago

Wayback Machine has been working for the past few days, look: https://web.archive.org/web/20240607220047/http://www.oreill...

microtherion 5 hours ago

It's somewhat ironic that, while the individual books are still accessible, their index pages https://www.oreilly.com/free and https://www.oreilly.com/openbook both redirect to some AI propaganda these days, with no links to the books left.

A third party page still has links to some (possibly all) of the books: https://zapier.com/blog/free-oreilly-press-books/