I don't get this article -- I do a lot of academic research and love online materials. When I have to access a book that is physical only, half an entire afternoon gets wasted. Online books are orders of magnitude faster to get, to search within, and you can highlight and annotate to your heart's delight.
Everyone agrees rare books and manuscripts need to be kept, which the article admits, so that's not an issue.
So ultimately the question is "regular books" and the crux of the article's argument seems to be:
> One glaring problem with libraries’ increasing their reliance on digital texts is the relatively large number of errors in the Google Books corpus. As archivist Paul Conway has shown, “the imperfection of digital surrogates is an obvious and nearly ubiquitous feature of Google Books.” But, even overlooking the outright errors in the scanned pages, my point is that “content”—what books contain—goes far beyond words in a particular order and beyond page images of a single representative copy. As we detach texts from books, we enable a certain range of procedures while disabling others. Online editions can offer up texts for investigation and discovery, but they are not replacements for the material they represent.
This makes no sense to me. I've never seen an "error" in a Google Books page image. The OCR for sure, but who cares, and that can always be redone. And in terms of "enabling a certain range of procedures while disabling others", online enables a ton, while I'm at a total loss of what is "disabled". Again -- for rare books and manuscripts, of course. A beautiful gilt binding, absolutely. But for just your everyday workhorse books that make up 99+% of a library's content... I just totally and utterly fail to see anything being lost at all. (And so many original covers/bindings are replaced with generic library covers/bindings when acquired, so it's not even that.)
Storing books digitally may make them more available in the present, but it will also likely mean that they're be less available at some point in the future. Physical libraries with physical books have proven they can last for centuries, sometimes for millennia. For all we know, none of our digital storage media lasts significantly longer than decades, and even to pull that off, you have to constantly invest in copying and duplicating your data. Hard drives of the seventies, floppy disks, countless numbers of tape formats, CDs have all gone obsolete or are in the process of getting obsolete and are only accessible with dedicated hardware that would be very expensive to replicate from scratch, if even possible. A book does not have to be decoded to be read; even microfilm is more secure than an SSD or spinning HD when it comes to readability in the future in a low-tech / no-tech / replicated tech scenario. When the grid comes down, tap water will become scarce and digital written literature will just vanish.
> Storing books digitally may make them more available in the present, but it will also likely mean that they're be less available at some point in the future. Physical libraries with physical books have proven they can last for centuries, sometimes for millennia.
I don’t k kw about paper books lasting millennia. If you want to preserve books for millennia, you should carve them onto clay tablets and then bake them. The resulting baked clay tablets are pretty durable, even compared to the best paper.
Ask a real historian or research librarian about all the really important books that we know existed, but do not have, because there was no Devout Brothers of Document Preservation faithfully maintaining copies of them through the centuries.
> The beauty of digital libraries is (if they are free) endless copies can be made. That's where real longevity stems from.
Unfortunately this is demonstrably not the case with libraries’ current digital offerings. Not only are endless copies not allowed to be made, they often are more costly and have a limited amount of borrowings allowed of them.
A lot of libraries both public and private are failing at this. Unfortunately I've asked a few librarians and IT folks in the same libraries, the consensus seems to be that it does not make sense financially for the libraries. Yet. I hope there will be change along this but we are not there yet
If copyright was more reasonable, we could also have avoided the whole hullabaloo about Roald Dahl: those who want censored versions could have them, and those who want the originals could have those. I think we'd see a rise in small-run book-on-demand style publishing, which I have always had a soft spot for.
Neither does Sci-Hub. I'm certainly willing to advocate for copyright reform through legislative action, for instance, but of all the hills I could die on it's not the one I'd choose above all others with respect to monetary fines or imprisonment.
First of all, this is what the Library of Congress is for. Sure it's not a bad idea to have one library with physical copies, or maybe even 2–3 or even 10, but we don't need thousands of them.
But second, no, physical libraries generally don't last for centuries and certainly almost never for millenia. They get sold off, torn down, they burn down, and so on.
On the other hand, you can keep an entire copy of Library Genesis on a home computer if you want. The sheer number of digital copies means none of this is going anywhere.
And if the grid goes down and society is collapsing, then a collection of a good proportion of the world's books on your hard drive powered by a generator or solar is going to be infinitely more valuable than some physical library you have to hike for days to get to with a chance that raiders will kill you on the way. ;)
There are two types of research. There's the "academic" style where you know what you want and you want to get it, where online materials shine.
Then there's the "browsing" type of research where you have an idea of what you want, and you go to a book on that, and you look at the books around it on the shelves; that's something you get from a library (or bookstore) that doesn't really happen online.
The "browsing" type of research is extremely important for academic research as well. I love texts being available online, but I've also found many new sources and leads to chase by physically browsing library stacks.
I do most of my materials hunting online now, and I think I'm pretty good at it, but for discovery/serendipity it is still inferior to going into the stacks. A big problem with digital searching is that even if everything you could want is digitized, the search interface channels you in ways you can't fully appreciate.
Advocates of going all digital argue that a sufficiently capable browsing interface could reproduce the experience of physical browsing. Even if that were true, it doesn't mean the resources would ever be allocated to make it happen. Once physical stacks are out of sight and mind, few will have any idea or care about what has been lost.
But you do get that. Better actually, because you can approach it on multiple axis (something the real world is physically incapable of): if the author has published multiple books you can click that axis and follow along it, and if not you not Amazon lists similar and related books.
For non-fiction, the books around it will just be either the authors other books, or books that are very close in the dewey-decimal system. Again, trivial to get online.
It's trivial to browse titles online if there's an interface for it.
In fact, I find browsing related titles on Amazon to be a tremendously quicker way to find related books than organzed by Dewey Decimal or similar at a library -- because collaborative recommendations, number of reviews, and actual review content are such strong signals that are totally absent in a library.
There you show the possibility of having selected an activity among many possible.
Your expression suggests /to seek/, ex IE sag, "to track", which is not /to search/, ex Lat circare, "look around" - to explore.
Parts of the exploration (the complementary part of exploitation) is benefited by having a large amount of material in front of you, browsable. And having thousands of volumes at hand and immediately remains invaluable; electronic experience remains different.
I disagree -- electronic experience isn't different.
You can build any kind of interface for electronic "looking around" or "exploring" that you like.
While in a physical library, you are limited entirely to a single fixed physical layout.
There is nothing a physical library can do for exploring that can't be replicated on a computer. But a computer can add all sorts of incredible improvements.
And why would you only want to have thousands of volumes at hand for immediate access, when you could have millions instead?
Oh but you mustn't - because it is, it physically is. To simulate it, you would have to build a virtual reality system with a staggering resolution and FOV. Because a sauna and a jacuzzi are "not different" only to one perspective - and what is highlighted is the other one. You may say that one can get contented with one of them ways - and we are replying that it is not necessary. ...You can have both.
> And why would you only want to have thousands of volumes at hand for immediate access, when you could have millions instead
First of all, "Real" book repostitories can be in the order of the hundreds of thousands. Bookstores of that magnitude still exist (though rare), and libraries of the magnitude of millions are available.
Contextually, practically instead: this side of the Atlantic many publishers work through collections - so, when you wanted to e.g. search through the Poetry collection of publisher Alpha, you will have in front of you a few hundreds or thousands items, publication #1 to #123 or #1 to #1234 for the whole publishing effort, curated by experts within the publishing house. I.e. you will have the complete set in front of you. (This is more valid for special locations than in general - in libraries, the Dewey system tends to break it.)
Frankly, I don't want my taxpayer dollars or university tuition subsidizing the miniscule number of people who both want the "romance" of the stacks and actually use them regularly, when there's zero practical benefit over online versions.
I'd much rather those billions of dollars being spent on many thousands of libraries' physical collections went to paying teachers and professors instead, for instance.
You may have the wrong idea about what physical collections cost and how much money would be saved by replacing them with digital collections (the number is negative).
Source is I run a library. We pay outrageous prices for eBooks which 1) aren't actually owned, 2) can't be bought used or donated, 3) may expire after a fixed term or number of circulations, and 4) generally have to be managed within vendor-specific eBook lending platforms which in the not-that-long run are going to cost any library more than a set of shelves.
[Edit] As for whether this makes sense for libraries, we have to meet the demands and expectations of our users whether it makes sense or not. The hope (HOPE) is that given enough time and changes in the market, some of the terms under which we now acquire and pay for digital resources will become more favorable to libraries, but that could be a long way away. What I can tell you is that there are probably about 0 universities (in the US) who are actually prioritizing physical collections development over digital in spite of the costs.
Aside: your writing style is confusing and very hard to follow.
Collections are great, but from your example there are thousands of books on something rather obscure that very few libraries are going to have. Further organzing all the books in such a collection is trivial - if you do it online.
> I just totally and utterly fail to see anything being lost at all.
It sounds like you don't find ordinary physical books to be valuable. I am of the opposite opinion. Having them digitized and available online is a good thing, but losing access to the physical copies is a real loss to a number of people. The two primary things I can think of at the moment are...
People who don't have access to the internet or have machines on which they can read electronic books. There are lots of such people.
People who find that reading a physical book is simply a better reading experience than reading on the screen. I am one of those. Reading on a computer or tablet is fine for consulting reference material, but if I want to actually read, quite a lot is lost by reading it on the screen. I've tried reading on screens, and it's just so uncomfortable that I can't do it and be happy about it. Which means that if I didn't have a physical copy, I'd probably just not read the book.
It doesn't have to be that way. If I was an ereader developer, there are a lot of things easily done to make it more physical book-like:
1. use scanned images, rather than reflowable text
2. render fonts that have subtle differences from one 'a' to the next 'a'
3. don't line the text up perfectly on the line
4. add an ink blotch here and there, and maybe a coffee ring
5. scan a blank page of real paper and use that as a background
6. scan a couple dozen of (5) and randomly use them
7. make the reader the size and shape of a pocketbook. I'm amazed that nobody does this. After all, they are "pocket" books, and ereaders don't fit in a pocket.
8. use a high res screen. While you don't really notice it consciously, a retina image of text is much easier on the eyes
9. make the ereader easily held and manipulated with one hand. Amazon's used to be like this, but then their "improved" models turned left and went over the embankment. The Kindle3 is the pinnacle of their ereaders
10. make an ereader display two facing pages at once. You know, like a book
11. make a larger ereader for the usual size of larger, non-pocketbook books. Even put a hinge between the two pages, so it opens and reads like a book
12. add the scent of paper to it. P.S. Some printed books smell like vomit. Don't emulate that.
13. have a "stack" of previous pages visited. So I can flip to another page, and reliably flip back to where I was
14. have a way to flip through the book, i.e. show pages sequentially, lingering on each for a second or so. Maybe have the flip speed sensitive to how hard you press the button
It's always going to be cheaper for a library to lend internet/devices for reading than maintain a large physical collection for lending books.
And then there's print-on-demand. You don't need a library for that. That's awesome that you personally love reading on a physical page, but that's not a reason to warehouse hundreds of millions of physical books that get checked out once a decade.
Well, I'm hardly alone in my preferences, but my point was to answer the question "what is being lost". I gave two examples of real things that are being lost.
That loss may or may not be economically required -- but it's still a loss either way.
> It's always going to be cheaper for a library to lend internet/devices for reading
The economics of this may differ depending on where you are, but I'm 99% sure that this isn't the case where I am.
Virtually everwhere has cheap enough mobile data in 2023 to read books with, so internet is not an issue.
And I think you may be severly underestimating the cost of maintaining library-sized collections of books, in labor and utilities and maintenance and real estate. Consumer digital devices, especially things like e-ink readers, cost almost nothing compared to that.
> Virtually everwhere has cheap enough mobile data in 2023 to read books with, so internet is not an issue.
The expense wouldn't be in the communications. It would be in replacing the broken devices. And with the program I'm familiar with, there was no communications involved anyway. You get the ebook preloaded on the device.
> I think you may be severly underestimating the cost of maintaining library-sized collections of books, in labor and utilities and maintenance and real estate.
I'm very familiar with these costs in the library system where I live, and I am privy to the figures it was looking at when developing their own ebook program.
I know they vary a lot depending on where you are, though, so it wouldn't surprise me at all to learn that there are areas where it pencils out very differently.
However, as I've said, even if the ebook program saved big money, something important is still lost without physical books.
Are you sure about that? I've talked with librarians and stakeholders and it seems to be the opposite. There's no stable scaled infrastructure in place yet for books via lending internet/devices and print-on-demand and more. The library serves as that actually instead, especially in cities.
Here’s an issue I ran into recently where digitization fell flat.
Wanted images of a particular issue of a regional newspaper from the 1940s. I am in the US State where it was published and it was one of the top three newspapers at the time, so (I thought) no problem. The State library even sent me a copy of the digital archived copy of that issue.
But the scanning had been done from a bound copy of the newspaper and done with a (state-of-the-art at the time) low resolution scanner. Two or three words of each line (right side of right page, left side of the left-facing page) were missing…not just distorted, missing. Contacted them to get the original and they said they didn’t have the original physical copies of those issues from the 1940s, just those scans. (Library of Congress says the State library have the only complete physical collection.)
So I am dealing with lacunae (you know, the stuff archeologists wrestle with when texts are missing from 2000 year old burnt papyrus) from newspapers that are less than 100 years old. With 2000 year old burnt papyrus, new tech can maybe fix the problem. But there is no original source material to work with for this 1940s newspaper.
In addition, any early scanned photos from that issue run the gamut from bad to totally unusable. No future AI would be able to reliably reproduce anything from many of the photo scans…there is simply too little signal left. A trivial rescan issue with today’s scanning tech, but what can you do when the original is unavailable? The content in question relates to art and artists, so the published graphic images are important.
Early digitizations often don’t meet the needs of researchers. If the physical source material was tossed and unavailable for rescan, no matter how recently published, some text fragments and images are more “lost” than the burnt papyri of Egypt.
Physical archived copies of core documents are still important, even/especially bulky “ephemera”.
The public libraries I know request a form of subscription to access digitized material, involving the use of public identity - linked to the accessed books. As opposed to "access the shelves, pick the item, read it in place".
But some will deny personal data treatment, particularly in terms of matching accessed material and identity. So, general use of electronic documents through libraries will be off-limit to an important class of people. (Those who regard private intellectual activity as near top in the list of rightfully strictly private matters by principle.)
The public library I go to (NYPL) requires a proof of identity (driver's license, etc.) to get a library card which is required to check out material for use on-premises in a reading room.
There is only a tiny fraction (< 1%) of material that any random person can just walk in and browse.
And this is the case with all reference libraries that I'm aware of. (Edit: meant "research" library, mistyped "reference".)
Is there a research library (e.g. more than a million volumes) in Europe that you can browse most/all the collection of without even having to provide an ID card?
In my experience, Europe tends to require mandatory ID for things in general much more often than the US does.
And I find I remember next to nothing I read in digital books, even eink books, whereas I can constantly recall stuff from physical books I've read. Plus, the act of phyiscally reading it does something, and I can generally remember roughly where I found something without need for a search.
Physical books just for me are a no brainer when I can get them. They're orders of magnitude better than anything digital for research, even if it isn't as 'efficient' and time gets 'wasted' (I'd also put forth an argument it isn't necessarily wasted).
> The OCR for sure, but who cares, and that can always be redone.
And that's only an issue at all for books which were scanned from print copies. Books which were uploaded in digital formats by the publisher won't have that issue.
The main problem with digitization is that it facilitates censorship and forgery.
We are already seeing a lot of this; especially in DRM’ed collections.
If everyone just downloaded the entire library locally (this is completely feasible with consumer drives), and used some sort of snapshotting rsync, then this problem would go away.
DRM prevents that, even for some public domain works.
I agree that the article lacked structure. I'm curious do you still do research at your local libraries? As in physically go out to the library even if you're still using online/digital materials there.
Yes, instead of technical limitations of OCR, we’ll have inscrutable language models making up random shit, and people too lazy to look at the primary source material.
I foresee a day when the expectation for any peer-reviewed article in a major journal will be to provide proof that every citation was validated against a physical copy of the material cited, not just an online version of the citation. Give the grad students something useful to do…get a free flight to the appropriate physical repository, tour around the campus, meet some friends…
...Because if it meant "asking bots instead of reading",
not only of course we want the real, original material to be available - and much intellectual activity will still involve reading, analyzing and studying it,
but also, whenever actual artificial general intelligences will be available, they will be trained on said texts.
Indeed, that is part assumption on my part the OCR will get better with a GPT like setup, but, as I said on another comment, I have trouble believing OCR won't be one of the better applications, as "is this an A" is much more straightforward than most of what GPTs have been doing.
I'm assuming that: OCR is something that's in a GPT's wheelhouse, and that OCR is something that can get much better with training. What is fact and fiction is hard on the internet for humans, look at how we have to teach youngsters how to not cite Wikipedia. But the difference between a T and an A seems much more cut and dry? I'm thinking OCR errors can be fixed / trained much easier than most domains.
Keep in mind there's one other problem with digitization: a lack of ownership. The law does not recognize any concept of digital sale: all digital copies are licensed property of the publisher[0], and it is not possible to legally sell them. This gives publishers the ability to retroactively wipe libraries' catalogs on the basis that they own them.
Yes, this isn't a problem with public domain works; but those have bigger preservation problems. DRM is a self-inflicted wound.
[0] I resisted the urge to write "author" here as American copyright is assignable. It shouldn't be, because this assignability is usually demanded up-front by publishers.
> The law does not recognize any concept of digital sale
The law does recognize a digital sale, including the right to resell and make archival copies. It turns out you can't trust the publisher's lawyers to give you good legal advice.
For example you don't need a license to run software you have legally acquired a copy of. Black letter law already gives you that right in the copyright act[1].
Capitol Records v. ReDigi says otherwise. In this case, a company called ReDigi claimed to have a way to digitally resell files without "copying" them. They even engineered their software to progressively delete the file on the seller's computer as bits are copied to the buyer.
The courts rejected that argument, because the bits are still not moved. They are just copied and deleted very quickly. Even if the file isn't in "two places at once", shredding the original and replacing it with a copy is still copyright infringement. In copyright law, 1 + -1 != 0 .
I will give you that the RAM Copy Doctrine[0] is a dead letter when it comes to mere execution of a computer program. But let's look at the actual black-letter law you're citing: 17 USC 117(b). It does not say that you can always sell computer programs, it just says that if you can sell them, then any backup or archival copies you make have to be sold (or destroyed) along with the original. Nothing in here actually mandates that originals have to be legally transferable.
In the realm of digital goods, the way that you acquire software is with a license agreement - because at the very least the basic act of digitally delivering you the goods requires some kind of contractual relationship between you and the service selling you the product. That provides enough consideration for them to start taking rights away, because courts let you contract away fair use or first sale in exchange for other rights[1]. So you don't have a basis to sell the program in question, because the agreement under which the product is delivered says you promise not to sell the program.
In the physical world this is harder to argue because most stores do not actually make you sign things in order to buy computer programs. There are already very well established rules about how sales work on physical goods in absence of a contractual agreement, too. You can use shrinkwrap licensing and there are court cases that respect that, but that's something the publishers had to fight tooth and nail in order to get. Digital was born with handcuffs.
For what it's worth, the Internet Archive is fighting in court with a bunch of book publishers to change this by establishing a legal basis for unlicensed digital lending. I do not expect them to win, even though I want them to win. The big hurdle with this is, again, that the courts will not allow you to make use of first sale digitally even if you slap DRM on the process.
[0] The result of MAI Systems Corp v. Peak Computer Inc; and legally binding in the 9th Circuit. Under this doctrine, any copy made even just turning on the computer or loading a program in memory in order to execute it is an infringement of the copyright owner.
[1] This is also part of the reason why D&D's Open Gaming License was an utter piece of trash, even before they made it worse. It provided little to no rights in exchange for you dropping your fair use rights in exchange.
This story is from last summer, but it's still as relevant as ever. Things like this have been happening for a long time in libraries.
This is from last month:
Northern Vermont University, Castleton University, and Vermont Technical College are consolidating into one university, Vermont State University, by July 1st. As part of the consolidation, administrators announced they will be donating most of the library’s physical collection in favor of a digital-only library and repurposing the library building.
- most libraries going the “modern” route the author doesn’t like, because that’s what most people really need from a library
- some libraries keeping the “old school” approach for aesthetic and conservation reasons
- hopefully one or two “super preserver” libraries that are stockpiling and preserving books just to be an exhaustive source of in case e.g the collapse of civilization or the occasional importance of some physical copy to future researchers.
Does anyone know if #3 is happening?
I’m totally fine with this, it seems the only practical approach going forward. And I say this as someone who loves physical books and probably owns more than I’ll ever be able to read (over 10k at this point, I have a problem).
Legal deposit libraries are relatively rare - they can be e.g. one every 10 or 50 mil ppl, and sometimes the mandated time span of retention can be very limited.
It is not granted that this practice will increase instead of decreasing or transforming.
Aren't the various national libraries of the world attempting to do #3? If I remember high school correctly, the idea of those libraries is to keep at least 1 copy of every book published in that nation.
Or expand the coordination and collection mandate of the Federal Depository Library Program to include all physically published materials across all participating libraries, so that at least X (3?) libraries have a physical copy of any given published item in their collection/archives.
And develop a network of antiquarians/collectors who place a list of their private holdings in some shared database as a final resource for researchers. Someone is buying those deacquisitioned materials…
Interesting read but this has happened before and is another case of technological innovation and lack of adaptability. I am a library enthusiast. I advocate for personal/private library of your own and public libraries of your community. I also like the larger private libraries of individuals/organizations who open up to the public and wish more people took advantage of these.
I think libraries are and will always be an integral part of daily life. Right now, the issues seem to be stemmed from the failure of maintenance. And I mean the whole system itself and all related systems. Libraries themselves continue to lower their threshold for what maintenance is upheld- book maintenance, infrastructure maintenance (the handling, filing, recordkeeping, etc.), technological maintenance (many libraries have digitization efforts that have spanned decades but most I've encountered seem to disregard redigitization which is important to gain better quality, OCR capability, and future proofing for better systemization), organizational maintenance (libraries' administrative, legal, etc.)-- since there are a lot of layers and abstractions to these libraries. They are their own ecosystems. Some have become very complicated and hence their organizations have grown. A part of this failure of maintenance is the lack in advocation of stewardship. Libraries should have long-term outlooks and goals and a lot of controversies and issues of recent years seem to really highlight this lack of long-term ability of maintenance with stewardship.
I may be rambling and have shared concerns with my local librarians and IT individuals involved with libraries. I'm curious if anyone else has had similar conclusions.
Nothing. Digital is already more expensive because the digital publishers claim a physical book checked out x number of times needs replacing so if a digital book is checked out x number of times, the library is charged full price again. I rarely saw books checked out that many times to require a physically new copy due to wear and tear - libraries generally had staff that could repair the books.
Everyone agrees rare books and manuscripts need to be kept, which the article admits, so that's not an issue.
So ultimately the question is "regular books" and the crux of the article's argument seems to be:
> One glaring problem with libraries’ increasing their reliance on digital texts is the relatively large number of errors in the Google Books corpus. As archivist Paul Conway has shown, “the imperfection of digital surrogates is an obvious and nearly ubiquitous feature of Google Books.” But, even overlooking the outright errors in the scanned pages, my point is that “content”—what books contain—goes far beyond words in a particular order and beyond page images of a single representative copy. As we detach texts from books, we enable a certain range of procedures while disabling others. Online editions can offer up texts for investigation and discovery, but they are not replacements for the material they represent.
This makes no sense to me. I've never seen an "error" in a Google Books page image. The OCR for sure, but who cares, and that can always be redone. And in terms of "enabling a certain range of procedures while disabling others", online enables a ton, while I'm at a total loss of what is "disabled". Again -- for rare books and manuscripts, of course. A beautiful gilt binding, absolutely. But for just your everyday workhorse books that make up 99+% of a library's content... I just totally and utterly fail to see anything being lost at all. (And so many original covers/bindings are replaced with generic library covers/bindings when acquired, so it's not even that.)