使用WGET的输出文档和标题

2018-03-25 1700 words 4 minutes

Contents

1. 概述

GNU Wget 是一个事实上的标准程序，用于从 Web 服务器下载数据。我们将在本教程中采用实践方法来了解使用wget命令将文档和标题输出到标准输出的几种方法。

2. 默认输出行为

要了解wget 命令的默认输出行为，让我们使用它从 google.com 下载数据：

$ wget http://www.google.com
--2022-04-02 19:27:07--  http://www.google.com/
Resolving www.google.com (www.google.com)... 172.217.174.228, 2404:6800:4009:81d::2004
Connecting to www.google.com (www.google.com)|172.217.174.228|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'index.html'
index.html                                                            [ <=>                                                                                                                                                                ]  15.93K  --.-KB/s    in 0.03s
2022-04-02 19:27:08 (597 KB/s) - 'index.html' saved [16316]

我们可以注意到输出中有很多诊断信息，而实际的文档内容并没有显示在标准输出上。另一方面，wget将文档保存在名为index.html的文件中。此外，一开始似乎诊断信息被发送到标准输出。然而，实际上，它被发送到 stderr 流。

我们可以通过将 stderr 重定向到不同的文件并验证其内容来验证wget是否将诊断信息发送到 stderr ：

$ wget http://www.google.com 2>stderr.dump
$ cat stderr.dump
--2022-04-02 19:33:57--  http://www.google.com/
Resolving www.google.com (www.google.com)... 142.250.67.196, 2404:6800:4009:81f::2004
Connecting to www.google.com (www.google.com)|142.250.67.196|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'index.html'
     0K .......... .....                                        533K=0.03s
2022-04-02 19:33:57 (533 KB/s) - 'index.html' saved [16289]

让我们记住这种行为，因为wget将响应标头解释为诊断信息的一部分。因此，它将标头信息发送到 stderr。此外，我们对wget的默认诊断输出不感兴趣，因此我们将使用*–quiet* ( -q ) 选项来抑制这种噪音。

3. wget与–output-document

wget命令默认将文档内容输出到单独的文件中。但是，我们可以使用–output-document ( -O ) 选项将内容重定向到我们选择的文件。作为一个特定的用例，如果我们使用-作为文件，它会将内容定向到 stdout。

让我们通过首先将输出重定向到content_from_google文件来看看这一点：

$ wget -q --output-document content_from_google www.google.com
$ ls -l content_from_google
-rw-r--r--  1 tavasthi  192360288  16323 Apr  3 01:16 content_from_google

接下来，让我们将输出发送到标准输出：

$ wget -q --output-document - www.google.com
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

伟大的！我们已经学习了如何将文档输出到标准输出。

在接下来的两节中，我们将重点介绍如何将标头输出到标准输出。

4. wget与–save-headers

通过使用–save-headers选项，我们可以要求wget*在实际文档内容之前添加标题，同时通过在 headers 之后插入一个空行来分隔两者*。在这种情况下，wget将标题和文档内容重定向到同一个目标文件。

让我们看看它的实际效果：

$ wget -q --save-headers --output-document - www.google.com
HTTP/1.1 200 OK
Date: Sat, 02 Apr 2022 20:01:40 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2022-04-02-20; expires=Mon, 02-May-2022 20:01:40 GMT; path=/; domain=.google.com; Secure
Set-Cookie: AEC=AVQQ_LAeuGIhWDkqKiZiuP8N3P1Jz1x5Jkzoi0ckbpZotvhLRMeBQbD0F0I; expires=Thu, 29-Sep-2022 20:01:40 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
Set-Cookie: NID=511=fC52DE0Nqpm0zfbhAiW4qm6kdo7gy3dibVDFc6jos0QM32GcCFox_3VNLcgvSCaAeGHMp4LkqqvNda_nzO36w-NsjI4_ArdvfUnGuKIY6pgsTFPjIIb4L80X0m9ZU1a-zhSmObqwbEytIHaxMaP61L0qhJVRCgNpkCkfBubsEjQ; expires=Sun, 02-Oct-2022 20:01:40 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

此外，我们可以通过简单地将两个流重定向到单独的文件并验证它们的内容来验证wget没有将标头信息发送到 stderr：

$ wget -q --save-headers --output-document header_with_content www.google.com 2>stderr.out 1>stdout.out
bash-3.2$ test -s stderr.out
bash-3.2$ echo $?
1

通过在test 命令中使用 -s选项，我们可以自信地说stderr.out文件是空的，文件大小为 0。

5. wget与–server-response

wget提供了*–server-response选项，我们可以使用它来获取响应头。但是，与–save-headers选项不同，–server-response*将标头响应视为一条诊断信息，并更愿意将其发送到 stderr 流。

让我们使用*–server-response选项并将 stderr 和 stdout 流的内容分别重定向到stderr.out和stdout.out*文件：

$ wget -q --server-response --output-document header_with_content www.google.com 2>stderr.out 1>stdout.out
$ cat stderr.out
  HTTP/1.1 200 OK
  Date: Sat, 02 Apr 2022 20:10:04 GMT
  Expires: -1
  Cache-Control: private, max-age=0
  Content-Type: text/html; charset=ISO-8859-1
  P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
  Server: gws
  X-XSS-Protection: 0
  X-Frame-Options: SAMEORIGIN
  Set-Cookie: 1P_JAR=2022-04-02-20; expires=Mon, 02-May-2022 20:10:04 GMT; path=/; domain=.google.com; Secure
  Set-Cookie: AEC=AVQQ_LCUA9Yq67FEAgNtMJs9LdKfaRbLx_iMk99w5qmdIaHRhFPYkxzPCw; expires=Thu, 29-Sep-2022 20:10:04 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
  Set-Cookie: NID=511=CeFWgDo2-TR1PqQSpvyZkMCfZdtLSEIELe9T6KhKT2LaMR_QD8gNU2IkhQppOPHPPccQK8emgfYCyBUZZHfZpKbNPqJ8NgCCizFAI-oOSuh5B3ISULBVxUuaIjL5MZ6wp0EGKc-qv_hVvmgmhlRe7rjRdjgwXs3Svp2ubTnWNFg; expires=Sun, 02-Oct-2022 20:10:04 GMT; path=/; domain=.google.com; HttpOnly
  Accept-Ranges: none
  Vary: Accept-Encoding
  Transfer-Encoding: chunked

由于标头被发送到 stderr 流，如果我们想在 stdout 上输出标头，我们需要从 stderr 到 stdout 的额外重定向。因此，让我们使用2>&1 重定向将 stderr 重定向到 stdout：

$ wget -q --server-response --output-document - www.google.com 2>&1
  HTTP/1.1 200 OK
  Date: Sat, 02 Apr 2022 20:23:01 GMT
  Expires: -1
  Cache-Control: private, max-age=0
  Content-Type: text/html; charset=ISO-8859-1
  P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
  Server: gws
  X-XSS-Protection: 0
  X-Frame-Options: SAMEORIGIN
  Set-Cookie: 1P_JAR=2022-04-02-20; expires=Mon, 02-May-2022 20:23:01 GMT; path=/; domain=.google.com; Secure
  Set-Cookie: AEC=AVQQ_LDoJE9yujyrwmXjPVoxYtDVSpHrcPvcTsAjCEgSRD_1iU0PUCHTW4E; expires=Thu, 29-Sep-2022 20:23:01 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
  Set-Cookie: NID=511=Vj-mKsn9Lba3zf7DaDQ4mqhY4jJkHtStKG07jt98OQzod_sexTrqs6A7i_H7L6VJjJ3Ev_5JWkpFMvIoUiHdNtu9rE18C5vxEypdxp6mYwiMkOqI4Z2m_28RbFYgzhpNn4OmXh44xom-TxKKMkszAUMtP5FaI637gJ7XrHvhx_s; expires=Sun, 02-Oct-2022 20:23:01 GMT; path=/; domain=.google.com; HttpOnly
  Accept-Ranges: none
  Vary: Accept-Encoding
  Transfer-Encoding: chunked
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

我们必须注意到，与*–save-headers*选项不同，标题和实际内容之间没有空行分隔。

让我们通过将 stderr 和 stdout 流重定向到单独的文件并检查它们的大小来验证上述命令是否确实将输出和标头发送到 stdout：

$ (wget -q --server-response --output-document - www.google.com 2>&1) 1>stdout.out 2>stderr.out
$ test -s stderr.out
$ echo $?
1

我们可以看到stderr.out文件的大小为零。因此，我们可以自信地说我们的方法按预期工作。