Google Burns the Library at Alexandria May 28, 2011

May 28, 2011

Imagine a visit to the universal library: a building in which all books, manuscripts, scrolls, rolls and tablets from all civilisations and all ages have been placed next to each other on shelves running for tens and tens of miles. When Borges and others wrote about this fabulous place in generations past theirs was only a dream. But with the coming of the Internet the dream became achievable because, though a massive undertaking – comparable in its way to rebuilding the pyramids on the Moon  – a complete digital library is a credible endeavour and, what is better, all humanity can be invited to the opening party. How wonderful  that Google (‘don’t be evil’) decided to enlist itself for this task on behalf of the world. How wonderful that it was not left to tax-payers or national interests to fund the library as a progressive’s indulgence or an arm of chauvinism. Instead, a multinational, representing the best of western entrepreneurship, set up the project philanthropically. But hiding away here is one of the saddest and one of the least told stories of the Internet age – something you’ll get no hint of on the slavish Wikipedia Google Books entry – for Google has sabotaged their own act of creation and in doing so perhaps destroyed the hope of a universal digital library for ever. This is not a reference to the arguments over  recent books and copyright. But rather to the fact that in building their digital library – particularly in the scanning of pre 1950 books – Google are doing a piss-poor job.

The best way to show just how badly is through a brief comparison. Now archive.org (the best site on the world wide web?), has pooled all available pdf scans of books (and the efforts of the early and heroic Project Gutenberg) together. Google pdf books are also included – along with ones by other bodies including MSN. But the Google scans are far inferior to others in four crucial respects.

(i) Google scanning is often atrocious: books are sometimes incomplete or illegible.

(ii) It is not possible to take the text from Google pdfs with the cut and paste function.

(iii) As the pdfs are in black and white image quality is typically inferior.

(iv) Google has chronically overcompensated for copyright laws, particularly outside the US, with the result that many users are blocked from downloading relevant pdfs even when these are out of copyright.

To illustrate the extent of Google’s mismanagement the present author took four random titles that have been scanned by Google but also by other organisations and that are to be found at archive.org: for details see below.* Of the four books the non-Google scans are perfect. Of the Google scans one is a train-wreck with missing and blurred pages,  a second, is borderline unusable – try reading a novel where the first pages are missing, a third is flawed with image problems and a fourth, if you can get hold of a copy, passes muster – though the non-Google version remains superior here as well. It should be noted too that this is not an ‘unlucky’ or a manipulated sample (honest Indian). If anything Google got off lightly, thinking of some of the sorry Google-produced pdfs that this author has stumbled upon in his time…

The obvious counter-argument that Google will make is that ‘we are doing this as an act of kindness, free of charge: how dare you criticise us!’ And certainly it is true that for the first weeks in Google Books any new user will be gagging on the splendour of it all, quite unconcerned by such ‘tiny’ flaws. And, yes, it is an extraordinary experience to pretend to browse ALL books using Google Books’ search function. But, as imperfections, like the ones mentioned above, mount – and there is also the question of poor metadata – disillusionment sets in. After all, Google does not do all this free of charge. This is an extra service that brings millions of users to their search engines. They also have links on the book pages to Amazon and ABE books (not to mention Google Reader) so that when the book they have scanned is not available (for copyright reasons) the Internet user can purchase a paper copy: a bit cheeky when copyright has, in fact, expired, as is the case in three and perhaps four of the four volumes examined in the sample described above!! How long, after all, does it take a scanner to determine the death date (and hence copyright) of a well-known author like Sabine Baring-Gould!? There is the suspicion – strengthened by the manner in which Google hides the pdf download button away and puts up anti-robot technology to ‘protect’ their pdfs – that Google would prefer that as few people as possible have the autonomy to read these texts on their own, free of charge.

Far more seriously though is the fact that Google’s project to scan the world’s book and create a new Alexandria ‘for that library where every book shall live open to one another’ prevents other better intentioned projects from doing the work with the requisite love and care – this is certainly the experience of funding-starved bodies trying to follow in Google’s footsteps. Let us say, for the sake of argument, that a British ornithological organization wanted to scan to pdf and make freely available on the Internet all the five hundred or so British bird books written between 1850-1900. They would apply for cash (twenty thousand dollars?) to their own trustees, perhaps external bodies, perhaps local companies, perhaps even to a governmental body. In each case though they would be turned down, because ‘Google is already doing it’. But Google, as we have seen, are not already ‘doing it’. The illustrations in the bird books Google had done would be difficult to enjoy. Pages would be missing or blurred. And as the organisation is in the UK (i.e. the EU in copyright terms) UK users would not be able to download most of them anyway because of spurious copyright concerns!

Google, in historical terms, are building the library of Alexandria anew with one enormous digital hand and sloshing gasoline up and down the shelves and lighting matches with the other: Google, creator and destroyer of one of the most exciting projects that humanity has ever undertaken.

Don’t be evil!


28 May 2011: First Dianne writes in from Medieval Writing (that is well worth a visit). She suspects that Beachcombing is going over the top – and she may well be right: ‘‘Burns’, hmmm – a little hyperbole perhaps? The books are still there, even if the first, excessively hasty, attempts to digitise them are a trifle substandard. I did put up an earlier post about my problems with the Google books copyright issue, where I found myself having to print a large book one page at a time from the read online version because Google would not let me download a book that was, in fact, out of copyright. I couldn’t find a proxy server with a big enough bandwidth allowance to let me download the pdf. It has to be said that some of the Internet Archive scans are pretty crappy as well, and also the free Kindle versions they put up. I think there are people just going around university libraries doing very quick and dirty jobs on the whole thing. No doubt a lot of time and energy is being wasted, but I guess it represents the first steps to something like a universal digital library. I don’t think even a university funding body would suggest that a fine ornithological treatise had ‘been done’ in digital edition if the only copy was a grotty Google books edition. Instead, they would say it wasn’t in their immediate priorities and tell you to try to find private funding, but then they would do that anyway.’ Beach has heard lots of anecdotal material about funding being scotched – he would grateful if anyone could send any specific examples in. As to non-Google scans being substandard, he has almost always found that they are first rate. Next up is Woodwose. She is more irate and some of these details did damage to Beach’s blood pressure: ‘I am generally a techno-innocent and babe in the woods when it comes to dealing with huge corporations-as-the-antichrist, but this really makes me angry. My friends and I find our books – still in copyright – at least partially available on Google’s ‘service’, with no compensation.  I use Google books a fair amount and never realized that there was a way to cut and paste the text in out-of-copyright books. I admit that I have never run across corrupted files or badly scanned pages in a Google book – which must just be luck of the draw. The horror you describe reminds me of a situation we had with the Ohio Historical Society. (Brace yourself.) In the early days of scanning technology, the Historical Society decided to scan all available death certificates and then THROW AWAY THE ORIGINALS. [aaaargh!!!] There are so many errors that the certificate database is for most intents and purposes useless. And those originals are gone forever. I’m afraid that we did not learn from this and, thanks to Google, have plunged even further into this madness – the Ohio libraries Tsar has decreed that to eliminate wasteful duplication, only one or two copies of most books will be kept throughout the state. Duplicates are to be purged and sold or thrown away. No sense in having too many books – EVERYTHING is online, says the Tsar blithely. The director of the OSU Library (the largest University in Ohio) did such a bang-up job with the ‘remodeled library’ (read coffee-shop with free wi-fi) that he was hired by the Kent State Library expressly for the purpose of ‘getting rid of the books in the library’. I don’t know where this is going to end. The Google lawsuits don’t give me much hope that the world is going to change. The people in charge of libraries no longer seem to care about learning, but seem to be business majors–regarding books as commodities like cans of beans. They seem to grasp at the newest, shiniest technology as an end in itself, not as a means to promote learning. Anti-clutter people always say – you don’t need to buy books – everything’s available for free at the library!  But, of course, that is no longer true – I feel I need to maintain my own library. My husband is about to retire and normally we would be thinking of downsizing to a smaller house. But we cannot give up our books. The traditional wisdom is, when you downsize, donate your books to a library!  I have tried that – with disastrous results. Long, depressing story, will omit. So I have loads of books in subjects that no public library would want and which I refuse to donate to some college so they can sell them for pennies on the dollar through Better World Books. But, of course, they don’t need them anyway – because Google has put everything on line….  (Except 99% of my art/costume history collection.) My apologies for this wide-ranging rant which I have made All About Me. I believe in books the way some people believe in their deities and I am saddened by the malignant Fahrenheit 451 spirit that is abroad. Google has much to answer for, but I do not know how to stop them.’ Then finally Ricardo with some pointers on copyright law. ‘I basically defend a return to a 20 year copyright term and I’m against DRM technologies that place a key on culture, stopping things from entering the public domain after the copyright has expired (because, although the copyright has expire by law you can’t break the DRM and yourself make a copy for others).  About the 20 years… it might be even less. I think this term should be connected with the average time works have been shown to have a commercial value. I wouldn’t mind some registration process where you (the author) could renew your copyright explicitly while you are alive. You would just have to do it regularly. When you lost interest the work would simply expire and revert to public domain. Basically… where do you start? With an ‘American’ mind frame, copyright should be given for encouraging creativity or in a ‘French’ mind frame, copyright is an expression of some ‘natural right’?’ For Beach the catastrophe here was the EU’s decision in the 90s to merge European copyright according to the strictest regime of all: the German, a system so strict that it was not strictly enforced in Germany! Beachcombing being a bit of a radical would even favour eternal copyright for author and family members as long as the family paid ten dollars per book every decade, say. Then on the question of Google Ricardo writes: ‘Google is a for-profit private entity. Trusting them to do ‘no evil’… it should always be understood that it will always be their definition of ‘evil’ in that phrase. So, what for them is Good might not be for the rest of us (and, in the long term, probably isn’t). Which can be seen by the way they jolly go along with China in matters of censorship. But yes, I concur with the gist of the post, question is, how to turn things around… One thing I see is not the competition for funding (it can be proven that, in fact, Google is sloppy work, not good for preservation and use) but the niche carving that Google has done, with this copyright thing, that stops others from also digitizing…’ Ricardo finished his email by introducing Beach to Europeana that is well worth a look. Thanks Ricardo, thanks Woodwose and thanks Dianne!

31 May 2011: ‘First off Sami from ty.rannosaur.us writes in – his less than sympathetic comments were true of about a third of the forty or so emails that Beach has got on this to date on this subject. She also makes the very important point that Google fixes problems. ‘I don’t understand what you are getting at. Are you upset at the quality of Google’s scans or the idea that Google doing these scans reduces funding for academia? Google Books has been a treasure trove of useful information for me and they have been fairly decent about going back and fixing them after I report them. I can’t really get upset about having instant access to millions of reference material that I don’t have personally. I will agree with you about metadata, Google’s current practice is atrocious.’ Then comes Professor Poe: ‘About 8 years ago (I think), I met some people in Michigan (at UMichigan?) who ran a project that solicited people to make good electronic books. At least I think that’s what they did. Maybe they corrected electronic books in Project Guttenberg. I can’t recall, other than to say it was a ‘distributed’ effort involving the preservation of old books. Lots of volunteers. I was going to do a magazine article on them. They knew about this Google thing, and talked my ear off about it. They did just what you did: compared good texts to Google’s scans. And they found a remarkably high rate of error. I lost touch with them (obviously), but I’m keenly interested in this problem. Do you know what is being done about it, if anything?’ Beachcombing fears absolutely nothing… Then comes MommaMackie Reading about Google’s project was like being told of the attack, violation and possible approaching death of my oldest and dearest friend. What is it about the science of information technology that brings out the stupidity in people? The destruction and/or disposal of original and rare documents after they’ve been scanned into the system is irresponsibility on the highest level [this a reference to Woodwose above].  Their failure to confirm the usability of the scanned data beforehand simply compounds their sin a hundredfold, in my opinion. And it is pretty obvious that the project was not being led by true librarians, with true love and respect for the value of the original bound and unbound documents.  Selling them for pennies on the dollar or simply burning them? Had they sold the books, etc., at a fair market value, the project would have paid for itself many times over AND the originals been placed in hands that would treat them with respect.’ Thanks Sami, Marshall and Kathy!!!

25 June: Jonathan Jarret over at a Corner of Tenth Century Europe partially disagrees: ‘I do indeed care about this one, though I would have to agree that you may have gone over the top with the rhetoric. Google is of course not destroying this stuff, merely messing up good access to it. One can understand that the scale they’re working at (though the grunt work is usually being done by volunteers or temps in academic libraries, which is where the real quality problem comes from) permits minimum checking, though it is disheartening to see a service that works on keywords so inattentive to metadata; but, what they could do to solve that is make the meta-data checking even easier and more responsive, and possibly also accept revised or improved scans –and there is two-way traffic between Google and archive.org, so bad Google copies are sometimes replaced –but they have a bottom line to make. For a lot of the awful early copies, for example, we have to blame not Google but Stanford, who were one of the earliest participants with Google in the whole Google Books venture. The real problem is that Google have been allowed to do this at all; given that, we can expect little better, and we can still hope that they will improve things. So on that score I refuse to lose hope. It needs a lot of user contribution but it can get better. There may of course be a better plan altogether, which I go on to below. I would meanwhile, however, raise a controversial finger and suggest that the real firelighters for the libraries have been electronic journal providers. Stories of people getting rid of books are thankfully rare, and even if they do, though it is painful for us to contemplate it, because we love such items, it is usually the books with the fewest readers that get the chop (the pulp? the second-hand shop?). The real danger is to journals. Probably we can all think of actual institutions, direly in need of shelf space (because these worthy but misguided places still buy books too) who having acquired a JSTOR subscription or whatever at who knows what expense, have then junked their eighty-year run of the English Historical Review or similar because that really is online, in good quality, and they really need the space. And no-one will buy that, no-one has the space much though they would like it except for another library, which is probably in the same plight and sees no point. Then the library’s budget is cut, the subscription price goes up and they have to cancel it. (The money is probably put towards a sports coach.) Goodbye, access to all journals concerned; no going back. That’s where the fires have started, IMO. So, what can we do? We can join in with this:  and get it done properly. While I feel your pain, in other words, I think that circulating a solution may be better than becoming the Ginsberg of Google Books…’ Patrick from US Military History has even stronger opinions: ‘While I agree to the extent that often Google Books has corrupted or at least bad scans, at least the scans are there.  I too, tend to check Archive.org first because their digital copies are invariably of higher quality.  That being said, there are many books that are only available on Google Books and so I will still use them.  I live in Germany and I simply must rely on digital copies of books a lot of times because as much as I would like to, I don’t have the money to buy a physical copy of every book I use.  I therefore prioritize my buying and use the digital sites to make up that lack.  I too, think Google goes a little overboard with their copyright policy, but you yourself have pointed to an at least marginally quasi-legal work around through the use of proxies. The fact remains that Google Books does a good service by making the contents of some very eminent libraries available online.  Libraries that many people will never be able to visit.  For that alone, Google is to be commended.  It is easy for people to get ten-different kinds of worked up when ‘evil’ big business does something that is not to their standards.  The question I have is, who else is making such a huge effort at digitizing books?  The answer is no one else is doing it on the scale Google is because only Google has the funding, we can complain and maybe Google will institute better QC methods into their scanning, but from where I sit it seems just as likely that they could decide it is a flawed product and stop doing it and for all the good Gutenberg and Archive.org do, they do not have anything like the deep pockets that Google has.  Then again, I don’t automatically assume that something is bad if a corporation is doing it.’ Beachcombing would reply by saying that he is certainly not anti business and he is glad that it has fallen to a corporation rather than a state to undertake the task. However, he has an overwhelming sense that humanity is only going to ever do this once and it is being done very, very badly… Thanks Jonathan and thanks Patrick!

*So as to pick randomly we took the first two titles using ‘Beach’ (this website’s moniker) as a search word and the second two titles using ‘strange history’ (our web address) as search words, picking only books that were in both Google’s and other scanners’ collections.

The four books that came up were:

A) William Henry Babcock (obit 1922) Cypress Beach (1890) [Google] [Other]

B) George B. Somerville (obit 19??) The Lure of Long Beach (1914) [Google - not downloadable, read via archive.org] [Other]

C) Sabine Baring-Gould (obit 1924) Strange survivals; some chapters in the history of man (1905) [Google] [Other]

D) Sabine Baring-Gould (obit 1924) Freaks of fanaticism and other strange events (1891) [Google - not downloadable, read via archive.org] [Other]

The contrast between the flawed Google operation and that of its rivals is striking. All non-Google scans are perfect and in colour. It is a simple matter to cut and paste from the text. The only negative thing to be said about them is that they are often several times larger in MB than their Google equivalents.

The Google scans, meanwhile, score poorly. Here are the details. Note that ‘EU users’ probably serves for all non-US users in what follows:

A) Google’s Cypress Beach. This book, despite being out of copyright in the EU, is not available to EU users  in pdf form. It is in black and white and it is impossible to lift text from it employing the copy/cut/paste function. Then the Google version has missed several pages between the contents leaf and page ten!

B) Google’s The Lure of Long Beach. This book is perhaps not in copyright in the EU – though this is uncertain because of the lack of an established obit – in any case, it is not available to EU Internet users. It is in black and white and it is impossible to lift text from it using the copy/cut/paste function. The Google version has all the pages this time but none of the (luscious) images, an important part of the work in question, on the online reader! This might be corrected in the pdf version but there seems to be no way to get this, even using some of the tricks described at the base of this page.

C) Google’s Strange survivals. The book despite being out of copyright in the EU is not available to EU users  in pdf form. It is in black and white and it is impossible to lift text from it using the copy/cut/paste function. Then the Google version has been badly scanned: pages 256, 260 are blurred (though they can be read), pages 247-248 and 267-268, meanwhile, are missing. If the reader if forced to read the text on archive.org’s online reader because they can’t download the pdf then they will find that pages 88-89 appear three times there (!) and that pages 247-248 and 267-268 but also 96-97 are missing.

D) Google’s Freaks of fanaticism. The book despite being out of copyright in the EU is not available to EU users from Google books in pdf form. It is in black and white and it is impossible to lift text from it using the copy/cut/paste function. Here the scan on the online reader is almost perfect with only a couple of finger shots intruding at 8 and 441, though note that these do not impede reading. However, the present reader was not able to get the pdf scan because Google kept refusing access despite the button appearing!

Note for non-US users there are ways around. Imagine you find a Google book that you want to read but on the relevant Google page there is no pdf download button. If the book was published prior to 1920 you have probably run into a copyright difficulty with Google trying to respect your local copyright laws and perhaps overcompensating. There are three solutions that Beachcombing knows of. (i) Go to archive.org and look for the Google volume there: when you find it click on the ‘read online’ function. (ii) Go to archive.org and look for the Google volume there: when you find it click on the ‘Full Text’ function, though expect some difficulties! (iii) To disguise an EU Internet address and download the pdf as if you were Stateside employ an anonymous proxy.   Beachcombing, being a law-abiding sort should remind readers that EU copyright extends for seventy years from death (not publication!): clearly if you were to get a Google pdf with an anonymous proxy it would be important to establish personally that the text in question was within those limits so that you could respect local laws… Beachcombing certainly wouldn’t like to upset the authorities in Brussels, may they be blessed in their wisdom and munificence.