使用WGET在 Web 上递归下载目录
1. 简介
有时我们希望在 Web 服务器上获得某个目录,其中包含我们需要的文件。或者,也许,我们想要抓取一个网站以便能够访问我们在本地需要的目录。
在本文中,我们将亲自动手使用*wget *工具,了解如何下载 Web 上的目录和子目录。
2. 镜像整个网站
首先,我们将了解如何下载整个网站。wget使我们能够使用选项 –mirror, -m 镜像所有内容:
$ wget -m
--2022-03-11 14:02:45--
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:2b08, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.blogdemo.com/index.html’
www.blogdemo.com/in [ <=> ] 137,01K --.-KB/s in 0,1s
2022-03-11 14:02:45 (1,04 MB/s) - ‘www.blogdemo.com/index.html’ saved [140303]
Loading robots.txt; please ignore errors.
--2022-03-11 14:02:45-- robots.txt
Reusing existing connection to www.blogdemo.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘www.blogdemo.com/robots.txt’
www.blogdemo.com/ro [ <=> ] 72 --.-KB/s in 0s
2022-03-11 14:02:45 (5,23 MB/s) - ‘www.blogdemo.com/robots.txt’ saved [72]
...
请注意,此操作需要一些时间和内存,因为它正在尝试下载整个网站。
3. 递归下载所需目录
像上面那样镜像整个网站可能没有帮助,因为它不够灵活。**一般情况下,我们希望根据自己的需要得到具体的目录。**幸运的是, wget也使我们能够这样做。*我们使用选项–recursive (-r)打开递归下载以获得所需的子目录。**在后续部分中,我们会将此选项与其他wget*选项结合使用,以满足所需的操作。
3.1. wget带有*–no-host-directories和–cut-dirs*选项
使用wget实现我们目标的第一种方法是使用选项*–no-host-directories (-nh)和 –cut-dirs。-nh选项禁用以主机名为前缀的目录。另一方面,第二个选项–cut-dirs*指定要忽略的目录组件的数量。使用这些选项,我们可以操纵目录的递归检索。
例如,如果我们只使用选项 -r下载www.blogdemo.com/linux/category/web 的子目录,我们最终直接得到 4 个目录。但是,当我们添加选项 -nh时,我们会得到 linux/category/web目录路径。此外,通过设置*–cut-dirs的值 ,我们可以进一步使用这个目录技巧。通过将第二个选项设置为 1,我们获得了category/web*。使用值 2,我们得到*web/*等等。让我们看看完整的命令:
$ wget -r -np -nH --cut-dirs=1 linux
--2022-03-11 15:26:49-- linux
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:2b08, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: [following]
--2022-03-11 15:26:49--
Reusing existing connection to www.blogdemo.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘linux’
linux [ <=> ] 109,52K --.-KB/s in 0,1s
2022-03-11 15:26:49 (877 KB/s) - ‘linux’ saved [112148]
Loading robots.txt; please ignore errors.
--2022-03-11 15:26:49-- robots.txt
Reusing existing connection to www.blogdemo.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’
robots.txt [ <=> ] 72 --.-KB/s in 0s
2022-03-11 15:26:49 (5,48 MB/s) - ‘robots.txt’ saved [72]
...
如果不需要下载父目录,请确保保留*–no-parent (-np ) *选项。
3.2. wget带有*–level*选项
使用wget达到我们目的的第二种方法是使用* –level ( -l )。**此选项限制wget将递归到的子目录的深度 。**例如,如果我们要下载级别值为 1 的 www.blogdemo.com/linux 的子目录, wget会检索位于linux/*中的第一级子目录,如 linux/category。
如果我们将此级别值增加到 2,那么 wget也会进入linux/category下的其他子目录。让我们看一个例子:
$ wget -np -r -l 2
--2022-03-11 16:17:38--
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:28f8, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.blogdemo.com/linux/index.html’
www.blogdemo.com/linux/i [ <=> ] 109,52K --.-KB/s in 0,1s
2022-03-11 16:17:39 (903 KB/s) - ‘www.blogdemo.com/linux/index.html’ saved [112148]
...
请注意,我们在这里再次使用 -np和 -r选项。
–level的默认值为 5。因此,如果我们没有为此选项指定值, wget将递归到 5 个深度级别。此外,此选项的值 0 等于无穷大。
4. 附加功能
wget是一个非常强大的工具,它提供了更多的附加功能供我们使用。在本节中,我们将研究大多数时候需要的一些核心选项。
4.1. 更改浏览器
有时,我们需要手动定义一个用户代理来解决一些问题。同样,此选项提供的灵活性在某些情况下非常有用。例如, wget可能会引发一些错误:
$ wget -r
--2022-03-11 16:39:18--
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:2b08, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2022-03-11 16:39:19 ERROR 403: Forbidden.
我们可以通过使用 –user-agent ( -U ) 更改浏览器来解决这些类型的问题。试一试吧:
$ wget -r --user-agent="Mozilla"
--2022-03-11 16:45:17--
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:28f8, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
...
4.2. 转换链接以供本地查看
如果我们想让链接适合本地检查,我们可以使用选项 –convert-links。此选项在下载后转换链接:
wget -r --no-parent --convert-links category/web
--2022-03-11 17:31:46-- category/web
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:28f8, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
...
...
FINISHED --2022-03-11 17:01:46--
Total wall clock time: 3m 0s
Downloaded: 11 files, 452K in 0,7s (630 KB/s)
Converting links in www.blogdemo.com/linux/category/networking... 24-29
Converting links in www.blogdemo.com/linux/category/scripting... 24-46
Converting links in www.blogdemo.com/linux/category/security... 24-22
Converting links in www.blogdemo.com/linux/category/processes... 24-45
Converting links in www.blogdemo.com/linux/category/files... 24-60
Converting links in www.blogdemo.com/linux/category/administration... 24-43
Converting links in www.blogdemo.com/linux/category/search... 24-21
Converting links in www.blogdemo.com/linux/category/web... 24-18
Converting links in www.blogdemo.com/linux/category/filesystems... 24-29
Converting links in www.blogdemo.com/linux/category/installation... 24-17
Converted links in 10 files in 0,01 seconds.
4.3. 关闭机器人排除文件
wget遵循由 Martijn Koster 等人编写的机器人排除标准。1994年。根据这个标准,有一个文本文件指示机器人在下载操作时要避开哪些目录路径。wget首先请求文本文件 robots.txt以符合网络服务器管理部门给出的指令。**这个过程有时会阻止我们检索我们想要的目录。**因此,我们可以关闭此机器人排除文件:
$ wget -r --level=1 --no-parent --convert-links -e robots=off -U="Mozilla"
--2022-03-11 17:48:36--
Resolving www.blogdemo.com (www.blogdemo.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:28f8, ...
Connecting to www.blogdemo.com (www.blogdemo.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
...