Earlier tonight I was working on a project for a customer that wants to translate the Hebrew Interlinear Bible into English which obviously has been done many times before. This customer however has some translations that he wants to make for himself so I needed to find a Hebrew Interlinear Bible in text or PDF format. I was able to locate the Hebrew Interlinear Bible in PDF format however there was a separate PDF for each chapter in each book which numbers something like 930 different PDF’s. I was able to use the wget command described in detail below to download all of the PDF’s with a single command on my Windows 7 computer.

Install wget Using Cygwin:

To use wget on Windows you can install Cygwin following the directions in this article which also describes adding the cygwin applications to your Windows 7 environment path. This means that you can open a command prompt, type wget, and have the application run without having to be in the Cygwin bin directory.

Once Cygwin is installed you can use the below command to download every file located on a specific web page.

Use wget To Download All Files Located On A Web Page With Windows 7:

wget -r -A.pdf http://www.example.com/page-with-pdfs.htm

The command above will download every single PDF linked from the URL http://example.com/page-with-pdfs.htm. The “-r” switch tells wget to recursively download every file on the page and the “-A.pdf” switch tells wget to only download PDF files. You could switch pdf to mp3 for instance to download all mp3 files on the specified URL. If you wanted to follow other links on the URL you specify to download PDF’s on secondary pages then you can use the “-l” switch as shown in the example below.

Download Every PDF Including PDF’s On Secondary Pages Using wget:

wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm

As you can see above the “-l” switch specifies to wget to go one level down from the primary URL specified. You could obviously switch that to how ever many levels down in the links you want to follow.

Example Output From Downloading Multiple PDF’s On A Single Page Using wget:

C:downloadspdfsnewtest>wget -r -A.pdf http://www.example.com/pdfs/pdf-list.htm
--2010-12-22 01:28:41--  http://www.example.com/pdfs/pdf-list.htm
Resolving www.example.com (www.example.com)... 77.232.81.100
Connecting to www.example.com (www.example.com)|77.232.81.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73026 (71K) [text/html]
Saving to: `www.example.com/pdfs/pdf-list.htm'

100%[============================================================================================================>] 73,026      73.2K/s   in 1.0s

2010-12-22 01:28:43 (73.2 KB/s) - `www.example.com/pdfs/pdf-list.htm' saved [73026/73026]

Loading robots.txt; please ignore errors.
--2010-12-22 01:28:43--  http://www.example.com/robots.txt
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 36 [text/plain]
Saving to: `www.example.com/robots.txt'

100%[============================================================================================================>] 36          --.-K/s   in 0s

2010-12-22 01:28:43 (1.04 MB/s) - `www.example.com/robots.txt' saved [36/36]

Removing www.example.com/pdfs/pdf-list.htm since it should be rejected.

--2010-12-22 01:28:43--  http://www.example.com/pdfs/another-list.htm
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 23293 (23K) [text/html]
Saving to: `www.example.com/pdfs/another-list.htm'

100%[============================================================================================================>] 23,293      --.-K/s   in 0.003s

2010-12-22 01:28:43 (8.72 MB/s) - `www.example.com/pdfs/another-list.htm' saved [23293/23293]

Removing www.example.com/pdfs/another-list.htm since it should be rejected.

--2010-12-22 01:28:44--  http://www.example.com/some-dir/DatabaseInfo/DatabaseInfo.html
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 10843 (11K) [text/html]
Saving to: `www.example.com/some-dir/DatabaseInfo/DatabaseInfo.html'

100%[============================================================================================================>] 10,843      --.-K/s   in 0s

2010-12-22 01:28:44 (29.4 MB/s) - `www.example.com/some-dir/DatabaseInfo/DatabaseInfo.html' saved [10843/10843]

Removing www.example.com/some-dir/DatabaseInfo/DatabaseInfo.html since it should be rejected.

--2010-12-22 01:28:44--  http://www.example.com/pdfs/QDpdf/gen1.pdf
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 48194 (47K) [application/pdf]
Saving to: `www.example.com/pdfs/QDpdf/gen1.pdf'

100%[============================================================================================================>] 48,194      --.-K/s   in 0.1s

2010-12-22 01:28:44 (328 KB/s) - `www.example.com/pdfs/QDpdf/gen1.pdf' saved [48194/48194]

--2010-12-22 01:28:44--  http://www.example.com/pdfs/QDpdf/gen2.pdf
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 40742 (40K) [application/pdf]
Saving to: `www.example.com/pdfs/QDpdf/gen2.pdf'

100%[============================================================================================================>] 40,742      --.-K/s   in 0.1s

2010-12-22 01:28:45 (277 KB/s) - `www.example.com/pdfs/QDpdf/gen2.pdf' saved [40742/40742]

--2010-12-22 01:28:45--  http://www.example.com/pdfs/QDpdf/gen3.pdf
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 42678 (42K) [application/pdf]
Saving to: `www.example.com/pdfs/QDpdf/gen3.pdf'

100%[============================================================================================================>] 42,678      --.-K/s   in 0.1s

2010-12-22 01:28:45 (290 KB/s) - `www.example.com/pdfs/QDpdf/gen3.pdf' saved [42678/42678]

--2010-12-22 01:28:45--  http://www.example.com/pdfs/QDpdf/gen4.pdf
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 42070 (41K) [application/pdf]
Saving to: `www.example.com/pdfs/QDpdf/gen4.pdf'

100%[============================================================================================================>] 42,070      --.-K/s   in 0.1s

2010-12-22 01:28:45 (304 KB/s) - `www.example.com/pdfs/QDpdf/gen4.pdf' saved [42070/42070]

--2010-12-22 01:28:45--  http://www.example.com/pdfs/QDpdf/gen5.pdf
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 36710 (36K) [application/pdf]
Saving to: `www.example.com/pdfs/QDpdf/gen5.pdf'

100%[============================================================================================================>] 36,710      --.-K/s   in 0.009s

2010-12-22 01:28:45 (3.87 MB/s) - `www.example.com/pdfs/QDpdf/gen5.pdf' saved [36710/36710]

Notice how the first couple of files are not PDF’s and they are downloaded however they are not saved since we specified only to save PDF files. If a file other than a PDF is downloaded you will receive a message similar to “Removing blahblahblah since it should be rejected.”. Once wget has followed each link it will stop and all of the PDF files will be located in the directory you issued the command from.

The above information for wget will also work on any distribution of Linux. If the wget command is not available you simply need to install the wget package which for instance on CentOS Linux can be done via the Yum Package Manager by typing “yum install wget”.

DeliciousStumbleUponDiggTwitterFacebookRedditLinkedInEmail
Tags: , , , , , , , , , , , , , , , ,
2 Responses to “Use wget To Download All PDF Files Listed On A Web Page, wget All PDF Files In A Directory”
  1. Ch SE Serv says:

    Hi this is an interesting post, thanks robots txt also known as the Robots Exclusion protocol or roborts.txt protocol, is a convention to prevent cooperating web spinders .

    [Reply]

    alex Reply:

    Hello Ch,

    Thanks for taking the time to leave feedback.

    Thanks.
    alex

    [Reply]

  2.  
Leave a Reply

*Type the letter/number combination in the abvoe field before clicking submit.

*