[8] However, it has also been criticized for potential copyright violations,[8][9] and lack of editing to correct the many errors introduced into the scanned texts by the OCR process.
However, Tim Parks, writing in The New York Review of Books in 2014, noted that Google had stopped providing page numbers for many recent publications (likely the ones acquired through the Partner Program) "presumably in alliance with the publishers, in order to force those of us who need to prepare footnotes to buy paper editions.
With no need to flatten the pages or align them perfectly, Google's system not only reached a remarkable efficiency and speed but also helped protect the fragile collections from being over-handled.
Afterwards, the crude images went through three levels of processing: first, de-warping algorithms used the LIDAR data fix the pages' curvature.
Then, optical character recognition (OCR) software transformed the raw images into text, and, lastly, another round of algorithms extracted page numbers, footnotes, illustrations and diagrams.
Google expended considerable resources in coming up with optimal compression techniques, aiming for high image quality while keeping the file sizes minimal to enable access by internet users with low bandwidth.
This page displays information extracted from the book—its publishing details, a high frequency word map, the table of contents—as well as secondary material, such as summaries, reader reviews (not readable in the mobile version of the website), and links to other relevant texts.
They can export the bibliographic data and citations in standard formats, write their own reviews, add it to their library to be tagged, organized, and shared with other people.
[31] The project has received criticism that its stated aim of preserving orphaned and out-of-print works is at risk due to scanned data having errors and such problems not being solved.
To make this book available as an ePub formatted file we have taken those page images and extracted the text using Optical Character Recognition (or OCR for short) technology.
Our computer algorithms also have to automatically determine the structure of the book (what are the headers and footers, where images are placed, whether text is verse or prose, and so forth).
[37] Scholars have frequently reported rampant errors in the metadata information on Google Books – including misattributed authors and erroneous dates of publication.
[31] Other metadata errors reported include publication dates before the author's birth (e.g. 182 works by Charles Dickens prior to his birth in 1812); incorrect subject classifications (an edition of Moby Dick found under "computers", a biography of Mae West classified under "religion"), conflicting classifications (10 editions of Whitman's Leaves of Grass all classified as both "fiction" and "nonfiction"), incorrectly spelled titles, authors, and publishers (Moby Dick: or the White "Wall"), and metadata for one book incorrectly appended to a completely different book (the metadata for an 1818 mathematical work leads to a 1963 romance novel).
[38][39] A review of the author, title, publisher, and publication year metadata elements for 400 randomly selected Google Books records was undertaken.
[40]Metadata errors based on incorrect scanned dates has made research using the Google Books Project database difficult.
They argue that because the vast majority of books proposed to be scanned are in English, it will result in disproportionate representation of natural languages in the digital world.
This has led the makers of Google Scholar to start their own program to digitize and host older journal articles (in agreement with their publishers).
"[73] 2003: The team works to develop a high-speed scanning process as well as software for resolving issues in odd type sizes, unusual fonts, and "other unexpected peculiarities.
The announcement soon triggered controversy, as publisher and author associations challenged Google's plans to digitize, not just books in the public domain, but also titles still under copyright.
[82] October 2006: The University of Wisconsin–Madison announced that it would join the Book Search digitization project along with the Wisconsin Historical Society Library.
March 2007: The Bavarian State Library announced a partnership with Google to scan more than a million public domain and out-of-print works in German as well as English, French, Italian, Latin, and Spanish.
[87] May 2007: Mysore University announces Google will digitize over 800,000 books and manuscripts–including around 100,000 manuscripts written in Sanskrit or Kannada on both paper and palm leaves.
[68] June 2007: The Committee on Institutional Cooperation (rebranded as the Big Ten Academic Alliance in 2016) announced that its twelve member libraries would participate in scanning 10 million books over the course of the next six years.
[11] August 2010: It was announced that Google intends to scan all known existing 129,864,880 books within a decade, amounting to over 4 billion digital pages and 2 trillion words in total.
[116] The Authors Guild and Association of American Publishers separately sued Google in 2005 for its book project, citing "massive copyright infringement.
"[117] Google countered that its project represented a fair use and is the digital age equivalent of a card catalog with every word in the publication indexed.
The settlement received significant criticism on a wide variety of grounds, including antitrust, privacy, and inadequacy of the proposed classes of authors and publishers.
[128] Also, in November that year, the China Written Works Copyright Society (CWWCS) accused Google of scanning 18,000 books by 570 Chinese writers without authorization.
Google agreed on Nov 20 to provide a list of Chinese books it had scanned, but the company refused to admit having "infringed" copyright laws.
[130] Google licensing of public domain works is also an area of concern due to using of digital watermarking techniques with the books.