Cassandra 是面向列的数据库还是列族数据库？

2018-08-07 817 words 2 minutes

Contents

1. 简介

Apache Cassandra 是一个开源分布式 NoSQL 数据库，旨在处理跨多个数据中心的大量数据。Cassandra 的数据模型是跨多个文档和论文的讨论主题，通常会导致信息混乱或相互矛盾。这是由于 Cassandra 能够分别存储和访问列族，这导致错误分类为column-oriented而不是 column-family。

在本教程中，我们将了解数据模型之间的差异，并确定 Cassandra 行分区存储数据模型的性质。

2. 数据库数据模型

Apache Cassandra git repo 上的 *README *文件指出：

Cassandra is a partitioned row store. Rows are organized into tables with a required primary key.
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.
Row store means that like relational databases, Cassandra organizes data by rows and columns.

由此，我们可以得出结论，**Cassandra 是一个行分区存储。然而，column-family或wide-column也是合适的名称，我们将在下面找到。

column-family数据模型与column-oriented模型**不同。column-family数据库将一行及其所有列族存储在一起，而column-oriented数据库只是按列而不是按行存储数据表。

2.1. 面向行和面向列的数据存储

我们以Employees表为例：

  ID         Last    First   Age
  1          Cooper  James   32
  2          Bell    Lisa    57
  3          Young   Joseph  45

row-oriented数据库将上述数据存储为：

1,Cooper,James,32;2,Bell,Lisa,57;3,Young,Joseph,45;

而column-oriented数据库将数据存储为：

1,2,3;Cooper,Bell,Young;James,Lisa,Joseph;32,57,45;

Cassandra 不会像row-oriented或column-oriented数据库那样存储其数据。

2.2. 分区行存储

Cassandra 使用行分区存储，这意味着行包含列。column-family数据库存储具有映射到值的键和分组到多个列族中的值的数据。

在行分区存储中，Employees数据如下所示：

"Employees" : {
           row1 : { "ID":1, "Last":"Cooper", "First":"James", "Age":32},
           row2 : { "ID":2, "Last":"Bell", "First":"Lisa", "Age":57},
           row3 : { "ID":3, "Last":"Young", "First":"Jospeh", "Age":45},
           ...
     }

行分区存储具有包含列的行，但每行中的列数不必相同（如big-table）。有些行可能有数千列，而有些行可能仅限于一列。

我们可以将行分区存储视为二维键值存储，其中行键和列键用于访问数据。要访问最小的数据单元（一列），我们必须首先指定行名（键），然后是列名。