Data Cubes
DEFINITION: A data cube is a type of multidimensional matrix that lets users explore and analyze a collection of data from many different perspectives, usually considering three factors (dimensions) at a time.
When we try to extract information from a stack of data, we need tools to help us find what's relevant and what's important and to explore different scenarios. A report, whether printed on paper or viewed on-screen, is at best a two-dimensional representation of data, a table using columns and rows. That's sufficient when we have only two factors to consider, but in the real world we need more powerful tools.
Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
But data cubes aren't restricted to just three dimensions. Most online analytical processing (OLAP) systems can build data cubes with many more dimensions—Microsoft SQL Server 2000 Analysis Services, for example, allows up to 64 dimensions. We can think of a 4-D data cube as consisting of a series of 3-D cubes, though visualizing such higher-dimensional entities in spatial or geometric terms can be a problem.
In practice, therefore, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions.
Relational or Multidimensional?
Since data cubes are such a useful interpretation tool, most OLAP products are built around a structure in which the cube is modeled as a multidimensional array. These multidimensional OLAP, or MOLAP, products typically run faster than other approaches, primarily because it's possible to index directly into the data cube's structure to collect subsets of data.
However, for very large data sets with many dimensions, MOLAP solutions aren't always so effective. As the number of dimensions increases, the cube becomes sparser—that is, many cells representing specific attribute combinations are empty, containing no aggregated data. As with other types of sparse databases, this tends to increase storage requirements, sometimes to unacceptable levels. Compression techniques can help, but using them tends to destroy MOLAP's natural indexing. ?
Data cubes can be built in other ways. Relational OLAP uses the relational database model. The ROLAP data cube is implemented as a collection of relational tables (up to twice as many as the number of dimensions) instead of as a multidimensional array. Each of these tables, called a cuboid, represents a particular view.
Because the cuboids are conventional database tables, we can process and query them using traditional RDBMS techniques, such as indexes and joins. This format is likely to be efficient for large data collections, since the tables must include only data cube cells that actually contain data.
However, ROLAP cubes lack the built-in indexing of a MOLAP implementation. Instead, each record in a given table must contain all attribute values in addition to any aggregated or summary values. This extra overhead may offset some of the space savings, and the absence of an implicit index means that we must provide one explicitly.
From a structural perspective, data cubes are made up of two elements: dimensions and measures. Dimensions are already explained; measures are simply the actual data values.
It's important to keep in mind that the data in a data cube has already been processed and aggregated into cube form. Thus we normally don't perform calculations within a data cube. This also means that we're not looking at real-time, dynamic data in a data cube.
The data contained within a cube has already been summarized to show figures such as unit sales, store sales, regional sales, net sale profits and average time for order fulfillment. With this data, an analyst can efficiently analyze any or all of those figures for any or all products, customers, sales agents and more. Thus data cubes can be extremely helpful in establishing trends and analyzing performance. In contrast, tables are best suited to reporting standardized operational scenarios.
时文选读
数据立方体
定义:数据立方体是一类多维矩阵,让用户从多个角度探索和分析数据集,通常是一次同时考虑三个因素(维度)。
当我们试图从一堆数据中提取信息时,我们需要工具来帮助我们找到那些有关联的和重要的信息,以及探讨不同的情景。一份报告,不管是印在纸上的还是出现在屏幕上,都是数据的二维表示,是行和列构成的表格。在我们只有两个因素要考虑时,这就足矣,但在真实世界中我们需要更强的工具。
数据立方体是二维表格的多维扩展,如同几何学中立方体是正方形的三维扩展一样。 “立方体”这个词让我们想起三维的物体,我们也可以把三维的数据立方体看作是一组类似的互相叠加起来的二维表格。
但是数据立方体不局限于三个维度。大多数在线分析处理( OLAP)系统能用很多个维度构建数据立方体,例如,微软的SQL Server 2000 Analysis Services工具允许维度数高达64个(虽然在空间或几何范畴想像更高维度的实体还是个问题)。
在实际中,我们常常用很多个维度来构建数据立方体,但我们倾向于一次只看三个维度。数据立方体之所以有价值,是因为我们能在一个或多个维度上给立方体做索引。
关系的还是多维的?
由于数据立方体是一个非常有用的解释工具,所以大多数 OLAP产品都围绕着按多维阵列建立立方模型这样一个结构编制。这些多维的OLAP产品,即MOLAP产品,运行速度通常比其他方法更快,这是因为能直接把索引做进数据立方的结构,方便收集数据子集。
然而,对于非常大的多维数据集, MOLAP方案并不总是有效的。随着维度数目的增加,立方体变得更稀疏,即表示某些属性组合的多个单元是空的,没有集合的数据。相对于其他类型的稀疏数据库,数据立方体往往会增加存储需求,有时会达到不能接受的程度。压缩技术能有些帮助,但利用这些技术往往会破坏MOLAP的自然索引。
数据立方体还可以用其他的方法构建。关系 OLAP就利用了关系数据库模型。ROLAP数据立方体是按关系表格的集合实现的(最多可达维度数目的两倍),来代替多维阵列。其中的表格叫做立方单元,代表特定的视图。
由于立方单元是一个常规的数据库表格,所以我们能用传统的 RDBMS技术(如索引和连接)来处理和查询它们。这种形式对大量的数据集合可能是有效的,因为这些表格必须只能包含实际有数据的数据立方单元。
但是 ROLAP缺少了用MOLAP实现时所具有的内在索引功能。相反,给定表格中的每个记录必须包括所有的属性值而任何集合的或摘要的数据。这种额外的开销可能会抵消掉一些节省出来的空间,而隐性索引的缺少意味着我们必须提供显性的索引。
从结构角度看,数据立方体由两个单元构成:维度和测度。维度已经解释过了,测度就是实际的数据值。
记住这点是很重要的:数据立方体中的数据是已经过处理并聚合成立方形式。因此,通常不需要在数据立方体中进行计算。这也意味着我们看到数据立方体中的数据并不是实时的、动态的数据。
立方体中的数据已经过摘要,表示诸如计件销售、店面销售、区域销售、销售纯利和完成订单的平均时间等数据。有了这些数据,分析师能针对一个或全部产品、客户、销售代理等,就这些数字中的一个或全部进行分析。这样,在预测趋势和分析业绩时,数据立方体就非常有用,而表格最适合报告标准化的运作情况。