@dircategory Net Utilities @dircategory World Wide Web * Wget: (wget). The non-interactive network downloader.
Copyright (C) 1996, 1997, 1998, 2000, 2001 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being "GNU General Public License" and "GNU Free Documentation License", with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.
This chapter is a partial overview of Wget's features.
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
Wget will simply download all the URLs specified on the command line. URL is a Uniform Resource Locator, as defined below.
However, you may wish to change some of the default parameters of Wget. You can do it two ways: permanently, adding the appropriate command to `.wgetrc' (see section Startup File), or specifying it on the command line.
URL is an acronym for Uniform Resource Locator. A uniform resource locator is a compact string representation for a resource available via the Internet. Wget recognizes the URL syntax as per RFC1738. This is the most widely used form (square brackets denote optional parts):
http://host[:port]/directory/file ftp://host[:port]/directory/file
You can also encode your username and password within a URL:
ftp://user:password@host/path http://user:password@host/path
Either user or password, or both, may be left out. If you leave out either the HTTP username or password, no authentication will be sent. If you leave out the FTP username, `anonymous' will be used. If you leave out the FTP password, your email address will be supplied as a default password.(1)
You can encode unsafe characters in a URL as `%xy', xy
being the hexadecimal representation of the character's ASCII
value. Some common unsafe characters include `%' (quoted as
`%25'), `:' (quoted as `%3A'), and `@' (quoted as
`%40'). Refer to RFC1738 for a comprehensive list of unsafe
characters.
Wget also supports the type feature for FTP URLs. By
default, FTP documents are retrieved in the binary mode (type
`i'), which means that they are downloaded unchanged. Another
useful mode is the `a' (ASCII) mode, which converts the line
delimiters between the different operating systems, and is thus useful
for text files. Here is an example:
ftp://host/directory/file;type=a
Two alternative variants of URL specification are also supported, because of historical (hysterical?) reasons and their widespreaded use.
FTP-only syntax (supported by NcFTP):
host:/dir/file
HTTP-only syntax (introduced by Netscape):
host[:port]/dir/file
These two alternative forms are deprecated, and may cease being supported in the future.
If you do not understand the difference between these notations, or do
not know which one to use, just use the plain ordinary format you use
with your favorite browser, like Lynx or Netscape.
Since Wget uses GNU getopts to process its arguments, every option has a short form and a long form. Long options are more convenient to remember, but take time to type. You may freely mix different option styles, or specify options after the command-line arguments. Thus you may write:
wget -r --tries=10 http://fly.srk.fer.hr/ -o log
The space between the option accepting an argument and the argument may be omitted. Instead `-o log' you can write `-olog'.
You may put several options that do not require arguments together, like:
wget -drc URL
This is a complete equivalent of:
wget -d -r -c URL
Since the options can be specified after the arguments, you may terminate them with `--'. So the following will try to download URL `-x', reporting failure to `log':
wget -o log -- -x
The options that accept comma-separated lists all respect the convention
that specifying an empty list clears its value. This can be useful to
clear the `.wgetrc' settings. For instance, if your `.wgetrc'
sets exclude_directories to `/cgi-bin', the following
example will first reset it, and then set it to exclude `/~nobody'
and `/~somebody'. You can also clear the lists in `.wgetrc'
(see section Wgetrc Syntax).
wget -X '' -X /~nobody,/~somebody
<base
href="url"> to the documents or by specifying
`--base=url' on the command line.
<base
href="url"> to HTML, or using the `--base' command-line
option.
bind() to ADDRESS on
the local machine. ADDRESS may be specified as a hostname or IP
address. This option can be useful if your machine is bound to multiple
IPs.
no-clobber" is actually a misnomer in this mode--it's not
clobbering that's prevented (as the numeric suffixes were already
preventing clobbering), but rather the multiple version saving that's
prevented.
When running Wget with `-r', but without `-N' or `-nc',
re-downloading a file will result in the new copy simply overwriting the
old. Adding `-nc' will prevent this behavior, instead causing the
original version to be preserved and any newer copies on the server to
be ignored.
When running Wget with `-N', with or without `-r', the
decision as to whether or not to download a newer copy of a file depends
on the local and remote timestamp and size of the file
(see section Time-Stamping). `-nc' may not be specified at the same
time as `-N'.
Note that when `-nc' is specified, files with the suffixes
`.html' or (yuck) `.htm' will be loaded from the local disk
and parsed as if they had been retrieved from the Web.
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.ZIf there is a file named `ls-lR.Z' in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file. Note that you don't need to specify this option if you just want the current invocation of Wget to retry downloading a file should the connection be lost midway through. This is the default behavior. `-c' only affects resumption of downloads started prior to this invocation of Wget, and whose local files are still sitting around. Without `-c', the previous example would just download the remote file to `ls-lR.Z.1', leaving the truncated `ls-lR.Z' file alone. Beginning with Wget 1.7, if you use `-c' on a non-empty file, and it turns out that the server does not support continued downloading, Wget will refuse to start the download from scratch, which would effectively ruin existing contents. If you really want the download to start from scratch, remove the file. Also beginning with Wget 1.7, if you use `-c' on a file which is of equal size as the one on the server, Wget will refuse to download the file and print an explanatory message. The same happens when the file is smaller on the server than locally (presumably because it was changed on the server since your last download attempt)---because "continuing" is not meaningful, no download occurs. On the other side of the coin, while using `-c', any file that's bigger on the server than locally will be considered an incomplete download and only
(length(remote) - length(local)) bytes will be
downloaded and tacked onto the end of the local file. This behavior can
be desirable in certain cases--for instance, you can use `wget -c'
to download just the new portion that's been appended to a data
collection or log file.
However, if the file is bigger on the server because it's been
changed, as opposed to just appended to, you'll end up
with a garbled file. Wget has no way of verifying that the local file
is really a valid prefix of the remote file. You need to be especially
careful of this when using `-c' in conjunction with `-r',
since every file will be considered as an "incomplete download" candidate.
Another instance where you'll get a garbled file if you try to use
`-c' is if you have a lame HTTP proxy that inserts a
"transfer interrupted" string into the local file. In the future a
"rollback" option may be added to deal with this case.
Note that `-c' only works with FTP servers and with HTTP
servers that support the Range header.
default style each dot
represents 1K, there are ten dots in a cluster and 50 dots in a line.
The binary style has a more "computer"-like orientation--8K
dots, 16-dots clusters and 48 dots per line (which makes for 384K
lines). The mega style is suitable for downloading very large
files--each dot represents 64K retrieved, there are eight dots in a
cluster, and 48 dots on each line (so each line contains 3M).
Specifying `--progress=bar' will draw a nice ASCII progress bar
graphics (a.k.a "thermometer" display) to indicate retrieval. If the
output is not a TTY, this option will be ignored, and Wget will revert
to the dot indicator. If you want to force the bar indicator, use
`--progress=bar:force'.
wget --spider --force-html -i bookmarks.htmlThis feature needs much more work for Wget to get close to the functionality of real WWW spiders.
m suffix, in hours using h
suffix, or in days using d suffix.
Specifying a large value for this option is useful if the network or the
destination host is down, so that Wget can wait long enough to
reasonably expect the network error to be fixed before the retry.
No options -> ftp.xemacs.org/pub/xemacs/ -nH -> pub/xemacs/ -nH --cut-dirs=1 -> xemacs/ -nH --cut-dirs=2 -> . --cut-dirs=1 -> ftp.xemacs.org/xemacs/ ...If you just want to get rid of the directory structure, this option is similar to a combination of `-nd' and `-P'. However, unlike `-nd', `--cut-dirs' does not lose with subdirectories--for instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be placed to `xemacs/beta', as one would expect.
basic (insecure) or the
digest authentication scheme.
Another way to specify username and password is in the URL itself
(see section URL Format). For more information about security issues with
Wget, See section Security Considerations.
Set-Cookie header, and the client responds with the
same cookie upon further requests. Since cookies allow the server
owners to keep track of visitors and for sites to exchange this
information, some consider them a breach of privacy. The default is to
use cookies; however, storing cookies is not on by default.
wget --cookies=off --header "Cookie: name=value"
Content-Length headers, which makes Wget
go wild, as it thinks not all the document was retrieved. You can spot
this syndrome if Wget retries getting the same document again and again,
each time claiming that the (otherwise normal) connection has closed on
the very same byte.
With this option, Wget will ignore the Content-Length header--as
if it never existed.
wget --header='Accept-Charset: iso-8859-2' \
--header='Accept-Language: hr' \
http://fly.srk.fer.hr/
Specification of an empty string as the header value will clear all
previous user-defined headers.
basic authentication scheme.
User-Agent header field. This enables distinguishing the
WWW software, usually for statistical purposes or for tracing of
protocol violations. Wget normally identifies as
`Wget/version', version being the current version
number of Wget.
However, some sites have been known to impose the policy of tailoring
the output according to the User-Agent-supplied information.
While conceptually this is not such a bad idea, it has been abused by
servers denying information to clients other than Mozilla or
Microsoft Internet Explorer. This option allows you to change
the User-Agent line issued by Wget. Use of this option is
discouraged, unless you really know what you are doing.
root to run Wget in his or her directory. Depending on
the options used, either Wget will refuse to write to `.listing',
making the globbing/recursion/time-stamping operation fail, or the
symbolic link will be deleted and replaced with the actual
`.listing' file, or the listing will be written to a
`.listing.number' file.
Even though this situation isn't a problem, though, root should
never run Wget in a non-trusted user's directory. A user could do
something as simple as linking `index.html' to `/etc/passwd'
and asking root to run Wget with `-N' or `-r' so the file
will be overwritten.
wget ftp://gnjilux.srk.fer.hr/*.msgBy default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently. You may have to quote the URL to protect it from being expanded by your shell. Globbing makes Wget look for a directory listing, which is system-specific. This is why it currently works only with Unix FTP servers (and the ones emulating Unix
ls output).
wget -r -nd --delete-after http://whatever.com/~popular/page/The `-r' option is to retrieve recursively, and `-nd' to not create directories. Note that `--delete-after' deletes files on the local machine. It does not issue the `DELE' command to remote FTP sites, for instance. Also note that when `--delete-after' is specified, `--convert-links' is ignored, so `.orig' files are simply not created in the first place.
<IMG> tag
referencing `1.gif' and an <A> tag pointing to external
document `2.html'. Say that `2.html' is similar but that its
image is `2.gif' and it links to `3.html'. Say this
continues up to some arbitrarily high number.
If one executes the command:
wget -r -l 2 http://site/1.htmlthen `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be downloaded. As you can see, `3.html' is without its requisite `3.gif' because Wget is simply counting the number of hops (up to 2) away from `1.html' in order to determine where to stop the recursion. However, with this command:
wget -r -l 2 -p http://site/1.htmlall the above files and `3.html''s requisite `3.gif' will be downloaded. Similarly,
wget -r -l 1 -p http://site/1.htmlwill cause `1.html', `1.gif', `2.html', and `2.gif' to be downloaded. One might think that:
wget -r -l 0 -p http://site/1.htmlwould download just `1.html' and `1.gif', but unfortunately this is not the case, because `-l 0' is equivalent to `-l inf'---that is, infinite recursion. To download a single HTML page (or a handful of them, all specified on the commandline or in a `-i' URL input file) and its (or their) requisites, simply leave off `-r' and `-l':
wget -p http://site/1.htmlNote that Wget will behave as if `-r' had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to `-p':
wget -E -H -k -K -p http://site/documentTo finish off this topic, it's worth knowing that Wget's idea of an external document link is any URL specified in an
<A> tag, an
<AREA> tag, or a <LINK> tag other than <LINK
REL="stylesheet">.
wget -Ga,area -H -k -K -r http://site/documentHowever, the author of this option came across a page with tags like
<LINK REL="home" HREF="/"> and came to the realization that
`-G' was not enough. One can't just tell Wget to ignore
<LINK>, because then stylesheets will not be downloaded. Now the
best bet for downloading a single page and its requisites is the
dedicated `--page-requisites' option.
GNU Wget is capable of traversing parts of the Web (or a single HTTP or FTP server), following links and directory structure. We refer to this as to recursive retrieving, or recursion.
With HTTP URLs, Wget retrieves and parses the HTML from
the given URL, documents, retrieving the files the HTML
document was referring to, through markups like href, or
src. If the freshly downloaded file is also of type
text/html, it will be parsed and followed further.
Recursive retrieval of HTTP and HTML content is breadth-first. This means that Wget first downloads the requested HTML document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.
The maximum depth to which the retrieval may descend is specified with the `-l' option. The default maximum depth is five layers.
When retrieving an FTP URL recursively, Wget will retrieve all
the data from the given directory tree (including the subdirectories up
to the specified depth) on the remote server, creating its mirror image
locally. FTP retrieval is also limited by the depth
parameter. Unlike HTTP recursion, FTP recursion is performed
depth-first.
By default, Wget will create a local directory tree, corresponding to the one found on the remote server.
Recursive retrieving can find a number of applications, the most important of which is mirroring. It is also useful for WWW presentations, and any other opportunities where slow network connections should be bypassed by storing the files locally.
You should be warned that recursive downloads can overload the remote servers. Because of that, many administrators frown upon them and may ban access from your site if they detect very fast downloads of big amounts of content. When downloading from Internet servers, consider using the `-w' option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.
Of course, recursive download may cause problems on your machine. If left to run unchecked, it can easily fill up the disk. If downloading from local network, it can also take bandwidth on the system, as well as consume memory and CPU.
Try to specify the criteria that match the kind of download you are trying to achieve. If you want to download only one page, use `--page-requisites' without any additional recursion. If you want to download things under one directory, use `-np' to avoid downloading things from other directories. If you want to download all the files from one directory, use `-l 1' to make sure the recursion depth never exceeds one. See section Following Links, for more information about this.
Recursive retrieval should be used with care. Don't say you were not warned.
When retrieving recursively, one does not wish to retrieve loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.
For example, if you wish to download the music archive from `fly.srk.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.
Wget possesses several mechanisms that allows you to fine-tune which links it will follow.
Wget's recursive retrieval normally refuses to visit hosts different than the one you specified on the command line. This is a reasonable default; without it, every retrieval would have the potential to turn your Wget into a small version of google.
However, visiting different hosts, or host spanning, is sometimes a useful option. Maybe the images are served from a different server. Maybe you're mirroring a site that consists of pages interlinked between three servers. Maybe the server has two equivalent names, and the HTML pages refer to both interchangeably.
wget -rH -Dserver.com http://www.server.com/You can specify more than one address by separating them with a comma, e.g. `-Ddomain1.com,domain2.com'.
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \
http://www.foo.edu/
When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFs, you will not be overjoyed to get loads of PostScript documents, and vice versa.
Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the PostScript files.
Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.
Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.
Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
wget -I /people,/cgi-bin http://host/people/bozo/
wget -r --no-parent http://somehost/~luzer/my-archive/You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.
When `-L' is turned on, only the relative links are ever followed. Relative links are here defined those that do not refer to the web server root. For example, these links are relative:
<a href="foo.gif"> <a href="foo/bar.gif"> <a href="../foo/bar.gif">
These links are not relative:
<a href="/foo.gif"> <a href="/foo/bar.gif"> <a href="http://www.server.com/foo/bar.gif">
Using this option guarantees that recursive retrieval will not span hosts, even without `-H'. In simple cases it also allows downloads to "just work" without having to convert links.
This option is probably not very useful and might be removed in a future release.
The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.
To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.
Also note that followed links to FTP directories will not be retrieved recursively further.
One of the most important aspects of mirroring information from the Internet is updating your archives.
Downloading the whole archive again and again, just to replace a few changed files is expensive, both in terms of wasted bandwidth and money, and the time to do the update. This is why all the mirroring tools offer the option of incremental updating.
Such an updating mechanism means that the remote server is scanned in search of new files. Only those new files will be downloaded in the place of the old ones.
A file is considered new if one of these two conditions are met:
To implement this, the program needs to be aware of the time of last modification of both local and remote files. We call this information the time-stamp of a file.
The time-stamping in GNU Wget is turned on using `--timestamping'
(`-N') option, or through timestamping = on directive in
`.wgetrc'. With this option, for each file it intends to download,
Wget will check whether a local file of the same name exists. If it
does, and the remote file is older, Wget will not download it.
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the time-stamps say.
The usage of time-stamping is simple. Say you would like to download a file so that it keeps its date of modification.
wget -S http://www.gnu.ai.mit.edu/
A simple ls -l shows that the time stamp on the local file equals
the state of the Last-Modified header, as returned by the server.
As you can see, the time-stamping info is preserved locally, even
without `-N' (at least for HTTP).
Several days later, you would like Wget to check if the remote file has changed, and download it if it has.
wget -N http://www.gnu.ai.mit.edu/
Wget will ask the server for the last-modified date. If the local file has the same timestamp as the server, or a newer one, the remote file will not be re-fetched. However, if the remote file is more recent, Wget will proceed to fetch it.
The same goes for FTP. For example:
wget "ftp://ftp.ifi.uio.no/pub/emacs/gnus/*"
(The quotes around that URL are to prevent the shell from trying to interpret the `*'.)
After download, a local directory listing will show that the timestamps match those on the remote server. Reissuing the command with `-N' will make Wget re-fetch only the files that have been modified since the last download.
If you wished to mirror the GNU archive every week, you would use a command like the following, weekly:
wget --timestamping -r ftp://ftp.gnu.org/pub/gnu/
Note that time-stamping will only work for files for which the server
gives a timestamp. For HTTP, this depends on getting a
Last-Modified header. For FTP, this depends on getting a
directory listing with dates in a format that Wget can parse
(see section FTP Time-Stamping Internals).
Time-stamping in HTTP is implemented by checking of the
Last-Modified header. If you wish to retrieve the file
`foo.html' through HTTP, Wget will check whether
`foo.html' exists locally. If it doesn't, `foo.html' will be
retrieved unconditionally.
If the file does exist locally, Wget will first check its local
time-stamp (similar to the way ls -l checks it), and then send a
HEAD request to the remote server, demanding the information on
the remote file.
The Last-Modified header is examined to find which file was
modified more recently (which makes it "newer"). If the remote file
is newer, it will be downloaded; if it is older, Wget will give
up.(2)
When `--backup-converted' (`-K') is specified in conjunction with `-N', server file `X' is compared to local file `X.orig', if extant, rather than being compared to local file `X', which will always differ if it's been converted by `--convert-links' (`-k').
Arguably, HTTP time-stamping should be implemented using the
If-Modified-Since request.
In theory, FTP time-stamping works much the same as HTTP, only FTP has no headers--time-stamps must be ferreted out of directory listings.
If an FTP download is recursive or uses globbing, Wget will use the
FTP LIST command to get a file listing for the directory
containing the desired file(s). It will try to analyze the listing,
treating it like Unix ls -l output, extracting the time-stamps.
The rest is exactly the same as for HTTP. Note that when
retrieving individual files from an FTP server without using
globbing or recursion, listing files will not be downloaded (and thus
files will not be time-stamped) unless `-N' is specified.
Assumption that every directory listing is a Unix-style listing may sound extremely constraining, but in practice it is not, as many non-Unix FTP servers use the Unixoid listing format because most (all?) of the clients understand it. Bear in mind that RFC959 defines no standard way to get a file list, let alone the time-stamps. We can only hope that a future standard will define this.
Another non-standard solution includes the use of MDTM command
that is supported by some FTP servers (including the popular
wu-ftpd), which returns the exact time of the specified file.
Wget may support this command in the future.
Once you know how to change default settings of Wget through command line arguments, you may wish to make some of those settings permanent. You can do that in a convenient way by creating the Wget startup file---`.wgetrc'.
Besides `.wgetrc' is the "main" initialization file, it is convenient to have a special facility for storing passwords. Thus Wget reads and interprets the contents of `$HOME/.netrc', if it finds it. You can find `.netrc' format in your system manuals.
Wget reads `.wgetrc' upon startup, recognizing a limited set of commands.
When initializing, Wget will look for a global startup file, `/usr/local/etc/wgetrc' by default (or some prefix other than `/usr/local', if Wget was not installed there) and read commands from there, if it exists.
Then it will look for the user's file. If the environmental variable
WGETRC is set, Wget will try to load that file. Failing that, no
further attempts will be made.
If WGETRC is not set, Wget will try to load `$HOME/.wgetrc'.
The fact that user's settings are loaded after the system-wide ones means that in case of collision user's wgetrc overrides the system-wide wgetrc (in `/usr/local/etc/wgetrc' by default). Fascist admins, away!
The syntax of a wgetrc command is simple:
variable = value
The variable will also be called command. Valid values are different for different commands.
The commands are case-insensitive and underscore-insensitive. Thus `DIr__PrefiX' is the same as `dirprefix'. Empty lines, lines beginning with `#' and lines containing white-space only are discarded.
Commands that expect a comma-separated list will clear the list on an empty command. So, if you wish to reset the rejection list specified in global `wgetrc', you can do it with:
reject =
The complete set of commands is listed below. Legal values are listed after the `='. Simple Boolean values can be set or unset using `on' and `off' or `1' and `0'. A fancier kind of Boolean allowed in some cases is the lockable Boolean, which may be set to `on', `off', `always', or `never'. If an option is set to `always' or `never', that value will be locked in for the duration of the Wget invocation--commandline options will not override.
Some commands take pseudo-arbitrary values. address values can be hostnames or dotted-quad IP addresses. n can be any positive integer, or `inf' for infinity, where appropriate. string values can be any non-empty string.
Most of these commands have commandline equivalents (see section Invoking), though some of the more obscure or rarely used ones do not.
Content-Length header; the same as
`--ignore-length'.
Content-Length.
This is the sample initialization file, as given in the distribution. It is divided in two section--one for global usage (suitable for global startup file), and one for local usage (suitable for `$HOME/.wgetrc'). Be careful about the things you change.
Note that almost all the lines are commented out. For a command to have any effect, you must remove the `#' character at the beginning of its line.
### ### Sample Wget initialization file .wgetrc ### ## You can use this file to change the default behaviour of wget or to ## avoid having to type many many command-line options. This file does ## not contain a comprehensive list of commands -- look at the manual ## to find out what you can put into this file. ## ## Wget initialization file can reside in /usr/local/etc/wgetrc ## (global, for all users) or $HOME/.wgetrc (for a single user). ## ## To use the settings in this file, you will have to uncomment them, ## as well as change them, in most cases, as the values on the ## commented-out lines are the default values (e.g. "off"). ## ## Global settings (useful for setting up in /usr/local/etc/wgetrc). ## Think well before you change them, since they may reduce wget's ## functionality, and make it behave contrary to the documentation: ## # You can set retrieve quota for beginners by specifying a value # optionally followed by 'K' (kilobytes) or 'M' (megabytes). The # default quota is unlimited. #quota = inf # You can lower (or raise) the default number of retries when # downloading a file (default is 20). #tries = 20 # Lowering the maximum depth of the recursive retrieval is handy to # prevent newbies from going too "deep" when they unwittingly start # the recursive retrieval. The default is 5. #reclevel = 5 # Many sites are behind firewalls that do not allow initiation of # connections from the outside. On these sites you have to use the # `passive' feature of FTP. If you are behind such a firewall, you # can turn this on to make Wget use passive FTP by default. #passive_ftp = off # The "wait" command below makes Wget wait between every connection. # If, instead, you want Wget to wait only between retries of failed # downloads, set waitretry to maximum number of seconds to wait (Wget # will use "linear backoff", waiting 1 second after the first failure # on a file, 2 seconds after the second failure, etc. up to this max). waitretry = 10 ## ## Local settings (for a user to set in his $HOME/.wgetrc). It is ## *highly* undesirable to put these settings in the global file, since ## they are potentially dangerous to "normal" users. ## ## Even when setting up your own ~/.wgetrc, you should know what you ## are doing before doing so. ## # Set this to on to use timestamping by default: #timestamping = off # It is a good idea to make Wget send your email address in a `From:' # header with your request (so that server administrators can contact # you in case of errors). Wget does *not* send `From:' by default. #header = From: Your Name <username@site.domain> # You can set up other headers, like Accept-Language. Accept-Language # is *not* sent by default. #header = Accept-Language: en # You can set the default proxies for Wget to use for http and ftp. # They will override the value in the environment. #http_proxy = http://proxy.yoyodyne.com:18023/ #ftp_proxy = http://proxy.yoyodyne.com:18023/ # If you do not want to use proxy at all, set this to off. #use_proxy = on # You can customize the retrieval outlook. Valid options are default, # binary, mega and micro. #dot_style = default # Setting this to off makes Wget not download /robots.txt. Be sure to # know *exactly* what /robots.txt is and how it is used before changing # the default! #robots = on # It can be useful to make Wget wait between connections. Set this to # the number of seconds you want Wget to wait. #wait = 0 # You can force creating directory structure, even if a single is being # retrieved, by setting this to on. #dirstruct = off # You can turn on recursive retrieving by default (don't do this if # you are not sure you know what it means) by setting this to on. #recursive = off # To always back up file X as X.orig before converting its links (due # to -k / --convert-links / convert_links = on having been specified), # set this variable to on: #backup_converted = off # To have Wget follow FTP links from HTML files by default, set this # to on: #follow_ftp = off
The examples are divided into three sections loosely based on their complexity.
wget http://fly.srk.fer.hr/
wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg
wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg &The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use `-t inf'.
wget ftp://gnjilux.srk.fer.hr/welcome.msg
wget ftp://prep.ai.mit.edu/pub/gnu/ links index.html
wget -i fileIf you specify `-' as file name, the URLs will be read from standard input.
wget -r http://www.gnu.org/ -o gnulog
wget --convert-links -r http://www.gnu.org/ -o gnulog
wget -p --convert-links http://www.server.com/dir/page.htmlThe HTML page will be saved to `www.server.com/dir/page.html', and the images, stylesheets, etc., somewhere under `www.server.com/', depending on where they were on the remote server.
wget -p --convert-links -nH -nd -Pdownload \
http://www.server.com/dir/page.html
wget -S http://www.lycos.com/
wget -s http://www.lycos.com/ more index.html
wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
wget -r -l1 --no-parent -A.gif http://www.server.com/dir/More verbose, but the effect is the same. `-r -l1' means to retrieve recursively (see section Recursive Retrieval), with maximum depth of 1. `--no-parent' means that references to the parent directory are ignored (see section Directory-Based Limits), and `-A.gif' means to download only the GIF files. `-A "*.gif"' would have worked too.
wget -nc -r http://www.gnu.org/
wget ftp://hniksic:mypassword@unix.server.com/.emacs
wget -O - http://jagor.srce.hr/ http://www.srce.hr/You can also combine the two options and make pipelines to retrieve the documents from remote hotlists:
wget -O - http://cool.list.com/ | wget --force-html -i -
crontab 0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
wget --mirror --convert-links --backup-converted \
http://www.gnu.org/ -o /home/me/weeklog
wget --mirror --convert-links --backup-converted \
--html-extension -o /home/me/weeklog \
http://www.gnu.org/
Or, with less typing:
wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
This chapter contains all the stuff that could not fit anywhere else.
Proxies are special-purpose HTTP servers designed to transfer data from remote servers to local clients. One typical use of proxies is lightening network load for users behind a slow connection. This is achieved by channeling all HTTP and FTP requests through the proxy which caches the transferred data. When a cached resource is requested again, proxy will return the data from cache. Another use for proxies is for companies that separate (for security reasons) their internal networks from the rest of Internet. In order to obtain information from the Web, their users connect and retrieve remote data using an authorized proxy.
Wget supports proxies for both HTTP and FTP retrievals. The standard way to specify proxy location, which Wget recognizes, is using the following environment variables:
http_proxy
ftp_proxy
no_proxy
no_proxy is `.mit.edu', proxy will not be used to retrieve
documents from MIT.
In addition to the environment variables, proxy location and settings may be specified from within Wget itself.
Some proxy servers require authorization to enable you to use them. The
authorization consists of username and password, which must
be sent by Wget. As with HTTP authorization, several
authentication schemes exist. For proxy authorization only the
Basic authentication scheme is currently implemented.
You may specify your username and password either through the proxy URL or through the command-line options. Assuming that the company's proxy is located at `proxy.company.com' at port 8001, a proxy URL location containing authorization data might look like this:
http://hniksic:mypassword@proxy.company.com:8001/
Alternatively, you may use the `proxy-user' and
`proxy-password' options, and the equivalent `.wgetrc'
settings proxy_user and proxy_passwd to set the proxy
username and password.
Like all GNU utilities, the latest version of Wget can be found at the master GNU archive site prep.ai.mit.edu, and its mirrors. For example, Wget 1.8.1 can be found at ftp://prep.ai.mit.edu/gnu/wget/wget-1.8.1.tar.gz
Wget has its own mailing list at wget@sunsite.dk, thanks to Karsten Thygesen. The mailing list is for discussion of Wget features and web, reporting Wget bugs (those that you think may be of interest to the public) and mailing announcements. You are welcome to subscribe. The more people on the list, the better!
To subscribe, send mail to wget-subscribe@sunsite.dk. the magic word `subscribe' in the subject line. Unsubscribe by mailing to wget-unsubscribe@sunsite.dk.
The mailing list is archived at http://fly.srk.fer.hr/archive/wget. Alternative archive is available at http://www.mail-archive.com/wget%40sunsite.auc.dk/.
You are welcome to send bug reports about GNU Wget to bug-wget@gnu.org.
Before actually submitting a bug report, please try to follow a few simple guidelines.
gdb `which
wget` core and type where to get the backtrace.
Since Wget uses GNU Autoconf for building and configuring, and avoids using "special" ultra--mega--cool features of any particular Unix, it should compile (and work) on all common Unix flavors.
Various Wget versions have been compiled and tested under many kinds of Unix systems, including Solaris, Linux, SunOS, OSF (aka Digital Unix), Ultrix, *BSD, IRIX, and others; refer to the file `MACHINES' in the distribution directory for a comprehensive list. If you compile it on an architecture not listed there, please let me know so I can update it.
Wget should also compile on the other Unix systems, not listed in `MACHINES'. If it doesn't, please let me know.
Thanks to kind contributors, this version of Wget compiles and works on Microsoft Windows 95 and Windows NT platforms. It has been compiled successfully using MS Visual C++ 4.0, Watcom, and Borland C compilers, with Winsock as networking software. Naturally, it is crippled of some features available on Unix, but it should work as a substitute for people stuck with Windows. Note that the Windows port is neither tested nor maintained by me--all questions and problems should be reported to Wget mailing list at wget@sunsite.dk where the maintainers will look at them.
Since the purpose of Wget is background work, it catches the hangup
signal (SIGHUP) and ignores it. If the output was on standard
output, it will be redirected to a file named `wget-log'.
Otherwise, SIGHUP is ignored. This is convenient when you wish
to redirect the output of Wget after having started it.
$ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz & $ kill -HUP %% # Redirect the output to wget-log
Other than that, Wget will not try to interfere with signals in any way.
C-c, kill -TERM and kill -KILL should kill it alike.
This chapter contains some references I consider useful.
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
While Wget is retrieving static pages, there's not much of a problem. But for Wget, there is no real difference between a static page and the most demanding CGI. For instance, a site I know has a section handled by an, uh, bitchin' CGI script that converts all the Info files to HTML. The script can and does bring the machine to its knees without providing anything useful to the downloader.
For such and similar cases various robot exclusion schemes have been devised as a means for the server administrators and document authors to protect chosen portions of their sites from the wandering of robots.
The more popular mechanism is the Robots Exclusion Standard, or RES, written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are supposed to download and parse.
Wget supports RES when downloading recursively. So, when you issue:
wget -r http://www.server.com/
First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft `<draft-koster-robots-00.txt>' titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META tag, like
this:
<meta name="robots" content="nofollow">
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual `/robots.txt' exclusion.
When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.
ps. If this
is a problem, avoid putting passwords from the command line--e.g. you
can use `.netrc' for this.
GNU Wget was written by Hrvoje Nik@v{s}i'{c} hniksic@arsdigita.com. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".
Special thanks goes to the following people (no particular order):
ansi2knr-ization. Lots of
portability fixes.
Digest
authentication.
The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:
Ian Abbott Tim Adam, Adrian Aichner, Martin Baehr, Dieter Baron, Roger Beeman, Dan Berger, T. Bharath, Paul Bludov, Daniel Bodea, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Tim Charron, Noel Cragg, Kristijan @v{C}onka@v{s}, John Daily, Andrew Davison, Andrew Deryabin, Ulrich Drepper, Marc Duponcheel, Damir D@v{z}eko, Alan Eldridge, Aleksandar Erkalovi'{c}, Andy Eskilsson, Christian Fraenkel, Masashi Fujita, Howard Gayle, Marcel Gerrits, Lemble Gregory, Hans Grobler, Mathieu Guillaume, Dan Harkless, Herold Heiko, Jochen Hein, Karl Heuer, HIROSE Masaaki, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Jonas Jensen, Simon Josefsson, Mario Juri'{c}, Hack Kampbj@o rn, Const Kaplinsky, Goran Kezunovi'{c}, Robert Kleine, KOJIMA Haime, Fila Kolodny, Alexander Kourakos, Martin Kraemer, Hrvoje Lacko, Daniel S. Lewart, Nicol'{a}s Lichtmeier, Dave Love, Alexander V. Lukyanov, Jordan Mendelson, Lin Zhe Min, Tim Mooney, Simon Munton, Charlie Negyesi, R. K. Owen, Andrew Pollock, Steve Pothier, Jan P@v{r}ikryl, Marin Purgar, Csaba R'{a}duly, Keith Refson, Tyler Riddle, Tobias Ringstrom, Edward J. Sabol, Heinz Salzmann, Robert Schmidt, Andreas Schwab, Chris Seawood, Toomas Soome, Tage Stabell-Kulo, Sven Sternberger, Markus Strasser, John Summerfield, Szakacsits Szabolcs, Mike Thomas, Philipp Thomas, Dave Turner, Russell Vincent, Charles G Waldman, Douglas E. Wegscheid, Jasmin Zainul, Bojan @v{Z}drnja, Kristijan Zimmer.
Apologies to all who I accidentally left out, and many thanks to all the subscribers of the Wget mailing list.
GNU Wget is licensed under the GNU GPL, which makes it free software.
Please note that "free" in "free software" refers to liberty, not price. As some GNU project advocates like to point out, think of "free speech" rather than "free beer". The exact and legally binding distribution terms are spelled out below; in short, you have the right (freedom) to run and change Wget and distribute it to other people, and even--if you want--charge money for doing either. The important restriction is that you have to grant your recipients the same rights and impose the same restrictions.
This method of licensing software is also known as open source because, among other things, it makes sure that all recipients will receive the source code along with the program, and be able to improve it. The GNU project prefers the term "free software" for reasons outlined at http://www.gnu.org/philosophy/free-software-for-freedom.html.
The exact license terms are defined by this paragraph and the GNU General Public License it refers to:
GNU Wget is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
GNU Wget is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is included as part of this manual; if you did not receive it, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
In addition to this, this manual is free in the same sense:
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being "GNU General Public License" and "GNU Free Documentation License", with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
The full texts of the GNU General Public License and of the GNU Free Documentation License are available below.
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.