在给定的行号上拆分文件

codingman included in Linux

2017-11-04 3201 words 7 minutes

Contents

1. 概述

当我们处理一个大文件时，有时我们需要将它分成几部分并分别处理。我们称之为“拆分文件”。

方便的split 命令在大多数情况下可以帮助我们拆分一个文件。然而，在本教程中，我们将讨论一个特定的文件拆分场景：如何在给定的行号处拆分文件。

2. 问题介绍

我们在使用split命令分割文件的时候，可以按照大小或者行数来分割文件。但是，有时我们希望在给定的行号处拆分文件。

示例文件将帮助我们快速理解问题。假设我们有一个名为input.txt 的文本文件：

$ cat input.txt
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.
13 is my line number.
14 is my line number.
15 is my line number.

该文件有 15 行。现在，让我们在三个行号处拆分文件：4、7 和12。也就是说，拆分后，我们会得到四个文件：

file1将包含input.txt的第 1-4 行（4 行）
file2包含input.txt的第 5-7 行（3 行）
file3包含input.txt的第 8-12 行（5 行）
file4有 input.txt第 13-15 行（3 行）

由于分割文件包含的行数不同，我们无法使用分割命令来解决问题。

我们将使用三种方法解决问题：

使用head和tail 命令的shell 脚本
基于*sed *命令的 shell 脚本
使用*awk *命令

通常，当我们需要对一个文件进行分块时，很可能面对的是一个大文件。因此，解决方案的性能确实很重要。

我们将讨论解决方案的性能并找出最有效的方法。

3. 使用head和tail命令

使用head和tail命令及其*-n*选项，我们可以从输入文件中提取行。

让我们从input.txt 中提取第 3-7 行：

$ tail -n +3 input.txt | head -n $(( 7-3+1 ))
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.

因此，我们可以创建一个 shell 脚本来包裹tail | head命令在给定的行号处拆分文件：

$ cat head_and_tail.sh
#!/bin/bash
INPUT_FILE="input.txt"  # The input file
LINE_NUMBERS=( 4 7 12 ) # The given line numbers (array)
START=1                 # The offset to calculate lines
IDX=1                   # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
    # Extract the lines using the head and tail commands
    tail -n +$START "$INPUT_FILE" | head -n $(( i-START+1 )) > "file$IDX.txt"
    (( IDX++ ))
    START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"

现在，让我们运行脚本并检查它是否可以将 input.txt 拆分为预期的块：

$ ./head_and_tail.sh
$ head file*
==> file1.txt <==
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.
==> file2.txt <==
05 is my line number.
06 is my line number.
07 is my line number.
==> file3.txt <==
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.
==> file4.txt <==
13 is my line number.
14 is my line number.
15 is my line number.

如上面的输出所示，我们的问题得到解决。

4. 使用sed命令

** sed命令支持两个给定行号的地址范围。**

例如，我们可以编写一个简短的sed单行代码来从input.txt文件中提取第 3-7 行：

$ sed -n '3,7p; 8q' input.txt
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.

在上面的命令中，我们告诉sed命令在使用“ 8q ”打印第 7 行后停止进一步处理以获得更好的性能。

正如我们所见，**使用sed命令的地址范围提取行比head tail组合更直接。**因此，为了解决我们的问题，我们只需要计算每个地址范围的边界并将它们传递给sed命令：

$ cat using_sed.sh  
#!/bin/bash
INPUT_FILE="input.txt"  # The input file
LINE_NUMBERS=( 4 7 12 ) # The given line numbers (array)
START=1                 # The start line number
IDX=1                   # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
    # Extract the lines using sed command
    NEXT_LINE=$(( i+1 ))
    sed -n "$START, $i p; $NEXT_LINE q" "$INPUT_FILE" > "file$IDX.txt"
    (( IDX++ ))
    START=$NEXT_LINE
done
# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"

现在，让我们运行脚本并检查它创建的文件：

$ ./using_sed.sh
$ head file*
==> file1.txt <==
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.
==> file2.txt <==
05 is my line number.
06 is my line number.
07 is my line number.
==> file3.txt <==
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.
==> file4.txt <==
13 is my line number.
14 is my line number.
15 is my line number.

伟大的！问题已经解决。

5. 使用awk命令

由于强大的awk脚本本身支持数组、循环、重定向和许多其他特性，我们不需要将awk命令包装在 shell 脚本中来解决问题。

我们甚至可以使用awk一行来解决这个问题。但是，我们将其分成多行代码并适当缩进，以便我们更容易理解它：

awk -v nums="4 7 12" '
    BEGIN {
        c=split(nums,b)
        for(i=1; i<=c; i++) a[b[i]]
        j=1; out = "file1.txt"
    } 
    { print > out }
    NR in a {
        close(out)
        out = "file" ++j ".txt"
    }' input.txt

如果我们运行上面的awk命令，我们将得到四个文件，每个文件都有预期的数据。

现在，让我们了解它是如何工作的：

-v nums=”4 7 12″：我们将给定的行号分配给变量nums
BEGIN { … } ： BEGIN块中的代码只会在从输入文件读取第一行之前运行一次
- c=split(nums,b)：使用split()函数，我们将三个数字拆分成一个数组 ( b[] )，变量c保存数组的长度 ( 3 )
- for(i=1; i<=c; i++) a[b[i]]：我们创建另一个关联数组 *a[] ，将**b[]*的元素作为键。例如：b[1]=4 -> a[4]； *b[2]=7 -> a[7]*等等
- j=1；out = “file1.txt”：这里我们初始化一个变量（out）来包含输出文件的文件名和一个变量（j）来保存每个输出文件的索引
{ print > out }：我们将当前行打印到输出文件
NR in a { close(out); out = “file” ++j “.txt” }：如果当前行号存在于关联数组*a[]*中，我们需要关闭当前输出文件并增加文件名中的索引

6.性能

到目前为止，我们已经学习了三种不同的方法来解决这个问题。现在是时候讨论他们的表现了。

在我们对脚本进行基准测试之前，让我们回顾一下这三种方法并估计结果。

假设我们需要将一个输入文件分成n 个块：

head_and_tail.sh – 需要2n 个进程 ( tail | head ) 并处理输入文件n次
using_sed.sh – 启动n 个进程 ( sed ) 并处理输入n次
awk命令——创建单个进程 ( awk ) 并仅处理一次输入

根据上面的分析， awk解决方案似乎成本最低，性能应该最好。相反，head_and_tail.sh 应该是最慢的。

接下来，让我们验证一下我们的估计是否正确。

6.1. 创建一个大输入文件

我们的input.txt只有 15 行，不适合做性能测试。让我们创建一个包含 1 亿行的big.txt输入文件：

$ seq 100000000 > big.txt

$ du big.txt 
848M	big.txt
$ wc -l big.txt
100000000 big.txt

我们将使用big.txt作为性能基准测试的输入文件。

6.2. 性能基准

我们将使用*time *命令来测试每个脚本或命令以对其性能进行基准测试。

在我们开始测试之前：

我们已将INPUT_FILE变量更改为指向big.txt
此外，由于我们的输入文件现在有 100 M 行，我们已将LINE_NUMBERS数组更改为“ ( 400000 50000000 70000000 ) ”
我们在tmpfs文件系统的**/tmp目录中进行所有测试，以避免任何文件系统缓存影响

首先，让我们测试一下我们的head_and_tail.sh脚本：

$ time ./head_and_tail.sh 
real 1.40
user 1.11
sys 1.00

其次，我们将查看using_sed.sh脚本的运行速度：

$ rm file* ; time ./using_sed.sh
real 10.80
user 10.08
sys 0.68

最后，让我们测试一下 awk脚本：

$ rm file* ; time awk -v nums="400000 50000000 70000000" ' .... '  big.txt
real 18.73
user 18.33
sys 0.38

6.3. 了解结果

结果令人惊讶！

即使head_and_tail.sh脚本启动八个进程并读取大输入文件四次，它也是最快的解决方案。

然而，我们认为可能是最快解决方案的awk命令比head_and_tail.sh脚本慢了大约 16 倍，并且是三种方法中最慢的一种。

sed解决方案介于两者之间，但它仍然比head_and_tail.sh慢大约八倍。

现在，问题来了：为什么只读取一次输入文件的awk命令比**head_and_tail.sh慢这么多？

这是**因为awk命令会读取文件的每一行，并根据给定的FS和RS初始化一些内部属性，例如字段、NF、记录等。**然后，它将读取我们的awk脚本并查看它是否应该在文本中执行某些操作。在我们的例子中，我们什么都不做，只是将行重定向到一个文件。然后awk命令会将文本写入文件。因此，它带来了手头问题不需要的大量开销。

另一方面，** head和tail命令只会读取换行符而不做任何事情或保留一行的文本**。他们寻找直到找到目标行号。再一次，他们不阅读和保持线路。相反，他们只是将内容转储到输出中。

sed命令也读取并保存输入文件的每一行。因此，它也比head_and_tail解决方案慢得多。

但是，sed命令比awk命令进行更少的初始化，因此，sed 脚本比awk解决方案更快。

**此外，sed命令的“ q ”地址命令增强了它的性能。**如果我们从using_sed.sh脚本中删除“ $NEXT_LINE q ” 并再次测试，它会变慢：

$ time ./using_sed_without_q.sh
real 15.69
user 14.69
sys 0.99