साहाय्यम्:Internet Archive

Internet Archive

ऋजुपथः: [[H:IA]]

Guidelines to downloading files from and uploading files to the Internet Archive

फलकम्:Nutshell

The Internet Archive

The Internet Archive is a non-profit digital library that holds nearly 3 million digitised books as well as music, audio, video and other files. It is one of the main sources of DjVu files for use on Wikisource. As well as files based on their own scans, the Internet Archive will also derive files (including DjVu files) from scans uploaded by its users. This can be a useful way to convert user-made scans into a DjVu file compatible with Wikisource (as well as making the work available for others).

This help page focuses on DjVu files, because that is the most used file type on Wikisource, but the process can be used for any other file type available from the Internet Archive.

Getting files

फलकम्:Nutshell

Searching

Go to The Internet Archive
Search for the book (or other text) you want. The basic search has a text field and a drop-down list. Type the title of the book in the text field and set the drop-down to "Texts".
Click "Go"
If the correct files are found on the Archive, you should see it in the search results. If there are multiple appropriate files, select the one you deem the best. This is subjective, but a clear scan will work best for proofreading, so aim for the best quality available (also note that some scans may have dirt or writing on the pages, which may or may not make proofreading harder). Different scans may come from different editions. If so, it is up to you which you pick but the earliest edition available is a popular choice.
If unsuccessful, you can also try following links, searching by subject, searching by author, or using the Advanced Search function.

If you didn't find the intended book but found some that is interesting to work, is strongly recommended to check if it is really suitable to be available on Wikisource in licensing terms (e.g., if it is a public domain work or licensed using compatible copyleft licences). Internet Archive accepts contributions still in copyright or under some restrictive licensing terms, but Wikisource will not accept them automatically, simply because they are available on archive.org - they must also meet licensing requirements.

DjVu file

The DjVu file can be downloaded (and uploaded to Wikimedia Commons) by following the steps below or manually tweaking the URL to the default DjVu URL format.

1. On the left side of the details page, will be a box with the title "View the Book" as shown in Fig. 1.

2. Click on the "HTTP" link to get to the list of files. This is indicated by the red arrow in Fig. 1.

margin:0 auto 0 auto; border:1px solid black; " width="200px"

style="

text-align:center;

" | Fig. 1.

style="

text-align:center;

" | A basic form of the "View the Book" box, found in the Details pages of the Internet Archive.

3. This will open a list of files, as shown in Fig. 2.

4. Locate the file with the .djvu suffix. This is indicated by the red arrow in Fig. 2.

Other files can be downloaded instead of the DjVu. If required, proceed with the most appropriate file from the list.

An alternative format for text are PDF documents, with the .pdf suffix.
Audio files in the Ogg Vorbis format have the .ogg suffix.
Video files in the Ogg Theora format have the .ogv suffix.
The original scans are available from this list as well. In this example, the file sikhafghansinco00shahrich_jp2.zip is an archive of JPEG 2000 images of individual pages. This can sometimes be useful as it will contain high quality versions of illustrations, photographs and other elements of the book.

5. This is the file that needs to be uploaded to Wikimedia Commons. See Uploading (below).

margin:0 auto 0 auto; border:1px solid black; "

style="

text-align:center;

" | Fig. 2.

style="

text-align:center;

" | Example of a list of files on the Internet Archive.

OR

The DjVu file download link can be retrieved by manually tweaking the book URL to the default DjVu download URL format.

https://archive.org/details/$File$ to https://archive.org/stream/$File$/$File$.djvu

Uploading

There are three main ways to upload the file to Wikimedia Commons.

One: ia-upload tool

ऋजुपथः: H:IA-Upload

The ia-upload tool is currently the most easy-to-use way to upload files from archive.org to Wikimedia Commons. You can check or contribute to their open source code.

Go to IA-Upload. It will request an "OAuth" (permission to have restricted access) from your account on Wikimedia Commons at each run.
Insert the archive.org identifier-access (the $ID portion of the URL as in https://archive.org/details/$ID) in the first field.
Insert the desired filename for the file to be uploaded on Commons in the second field, without the File: prefix or .djvu suffix, and proceed.
Review the automatic metadata, changing it as and if needed. It will be based on Commons' {{book}} template.
Proceed and after a few seconds you will find the file properly uploaded to Commons and list in your contributions.

Two: Automatic transfer

Use the URL2Commons tool to automatically transfer the DjVu file from the Internet Archive to Wikimedia Commons.

Refer to Help:URL2Commons for information on using the tool.
Right click on the appropriate file in the Internet Archive file list and select "Copy Shortcut" or equivalent.
Paste this into the top panel of the URL2Commons tool.
Proceed as described in the URL2Commons help document.

Three: Manual download and upload

Download the file to your own computer, then upload it to Wikimedia Commons manually.

To download, right click on the appropriate file in the Internet Archive file list and select "Save Target As.." or equivalent.
This may take some time, depending on the size of the file.

If you use download manager softwarte of any kind, follow the instructions for that software.
Once downloaded, go to Wikimedia Commons' Upload Wizard (guided upload process with helpful steps) or Upload page (quicker but requires more knowledge of Commons' policies and methods).

Others

There are other ways to upload files to Wikimedia Commons, such as the bulk uploader Commonist. These still require downloading the file(s) to your own computer before uploading to Commons.

Adding files

Files can be added to the Internet Archive by any registered user. The following information is presented for ease of use and reference for Wikisource users. However, Wikisource is not affiliated with the Internet Archive and any or all of these stages may be changed by the Archive at any time. It is strongly recommended that anyone attempting this should refer to the Internet Archive's own instructions, and follow those above the steps listed here.

These instructions are:

Internet Archive FAQ — Uploading Content

The following Internet Archive blog posts might be useful as well:

How to produce a DjVu file

You need to login (don't use OpenId, it won't function^[१]).

फलकम्:Nutshell

Uploading

Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead^[२]) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks

When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.^[३]

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.^[४] The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.^[५] You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software^[६] with a quite good images and OCR output in many languages and fonts and an aggressive compression^[७] which mantains an high quality of the final DjVu file.^[८] However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the fixed-ppi field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats

Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.^[९] It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting

If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at info archive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!

Step by step

Preparing the file

If uploading a collection of page scans:

The page scans should each be in an image format. For example, JPEG format.
The page scans should be named in the correct alphabetical order. It may be a good idea to use a naming format such as "MyScan001.jpg", "MyScan002.jpg" etc. If so, remember to use leading zeroes, otherwise page 10 will come after page 1 but before page 2.
Make sure that the page scans are the only file in the folder you are using.
Create a zip file of the folder containing your page scans. The file name should be in the format "Myscan_images.zip", where "Myscan" is whatever you want to call the file. The "_images" suffix is important; your scan may not derive properly later if this is omitted.

Files such as PDFs can just be uploaded as they are.

Uploading

Note: the following instructions are for the classic uploader, superseded by the 2013 upload and create item wizard. Most of the instructions below should be unnecessary and ignorable if you use the new, simpler uploader. A blog post How to upload a scanned book to the Internet Archive is available with many screenshots; ignore all the advice on identifiers and metadata, it's just the author's personal opinion.

Log in to the Internet Archive.
Click the "Upload" button at the top right of the screen.
Select the file to upload
Fill in the information requested and choose an appropriate licence (this will be similar to the licences on Wikisource).
- Title (required)
- Description (required)
- Keywords (required)
- Author
- Creative Commons Licence or Public Domain Mark
Wait for the upload to complete.
Click the "Share my File(s)" button.
You will see the message "Please wait while your page is created..." then "Your Page is Ready!" followed by link to page.
Clicking the link will result in a "Your item is not yet public" message.
Pick a collection for your file. The options will include "movie, audio, text, etree" and "community video, community audio, community text". You will probably be using "text" and "community text". Select the appropriate collection and click the "Submit" to the right.
- At this stage, you might be told to wait and come back later. This text is: "Your item is in the process of being derived, and you may not replace the metadata until the derive has finished (because any changes queued now would roll back those being made by the derive). Please try this page again after your item has finished deriving. [Item History]" In this case, simply follow those instructions: try again later.
In the Metadata Editor complete more information (including the information from earlier stages).
Click the Submit button. This will enter the file into log. This will take some time to complete

Deriving

Derivation can take up to a few days. This can be monitored either through the filename or the 'Contributions' page. The various formats of the work should automatically be derived from the files that were uploaded. If this has not occurred, the "View the book" in the left-hand sidebar will not be showing the various available formats (DjVu, EPUB, Kindle, Daisy etc). Derivation failure can have numerous reasons, many of which are internal to IA and have nothing to do with the uploaded file.

First, force the derivation from the file page:

Click "Edit item"
You will see two choices: "change the information" and "change the files". Click "change the information".
Click "Item Manager"
Click "Derive"

In case this fails:

Go to the 'Contributions' page.
Click on 'See your contribution tasks that are not yet completed.'
The screen will display a list similar to this image.
If the derivation process is still running, then wait.
If the process has stopped and marked red, and 'waiting for Admin', then email to infoफलकम्:@archive.org, advise them of the problem and request restart of the derivation process. Be sure to include the uploaded page link.

Requested uploads

You can request mass upload of public domain book scans from any external website to Internet Archive by preparing

1) A list of URLs of the books to download
2) A CSV table with title, creator, date, description, sponsor (digitising institution) etc.

साहाय्यम्:Internet Archive/Requested uploads

Admins who are also Wikisourcerors

Admins have a checkbox to rerun or interrupt pending tasks

Some Internet Archive volunteers are given admin status on specific collections and can edit all items in those collections. No volunteers are known to have admin status on the general "Community texts" collection, but they can still help in the simplest cases, namely a derive.php red row waiting for admin.

The following users are available for requests if you don't feel like disturbing the staff:

Nemo
Alex brollo (admin for opallibriantichi collection)

Notes

↑ See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
↑ See forum.
↑ If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.
Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.
↑ See forum.
↑ Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.
↑ Version 9.0 since 2013.
↑ In the example, dimension is 1/6 compared to djvudigital output.
↑ Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.
↑ FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.

[1] See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .

[2] See forum.

[textPDF-3] If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.
Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.

[4] See forum.

[5] Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.

[6] Version 9.0 since 2013.

[7] In the example, dimension is 1/6 compared to djvudigital output.

[example-8] Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.

[9] FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.

[१]

[२]

[३]

[४]

[५]

[६]

[७]

[८]

[९]