关于数据仓库的设计!
上一篇 / 下一篇 2008-02-13 17:16:09 / 个人分类:数据仓库专区
;Q4~i(NFk&a4iQ9BE0
Data Warehouse Concept
A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data designed to support management DSS needs.
1、Subject-Oriented
Subject-Oriented data is organized around major subject areas of an enterprise and is useful for an enterprise-wide understanding of those subjects. For Example, a banking operational system maintains independent records of customer savings, loans, and other transactions. A warehouse pulls this independent data together to provide financial information.
j9N0]$SWH iiR0E0The data from diverse sources is transformed so that it is consistent and meaningful for the warehouse.ITPUB个人空间iLkhv
So the main work is around the fact table.
x}j&n%C6W01.1 Original data from data source
oA?\Wt3J.d01.2 Dimension
P ]f7[Ctf4bx1]5K01.3MeasurementITPUB个人空间b0s*gt+Yl"`#`^.[K
1.4dimension granularity
gGdLG;L#H01.5Star model or snowflake modelITPUB个人空间1rC,du*h?z
1.6OLAP model
2、Integrated
Data on a given subject is integrated.ITPUB个人空间x6P&T)p1L*k
In many organizations, data resides in diverse independent systems, making it difficult to integrate the information into a single set of meaningful data from analysis. A key characteristic of a warehouse is that data is completely consolidated or integrated. Data is structured in a globally accepted manner, even when the underlying source data is structured differently (conforming dimension). Integration and transformation processes can be time-consuming and costly. It requires commitment from every part of the organization, particularly top-level managers who make the decisions and allocate resources and funds.
5IU9R ?X,Q+F6@0So the main work may be like :ITPUB个人空间b*Xh0Ta~)g$[
2.1 Universal dimension from different department.
Bs1q&zn }BK+K|02.2Dimension design include NULL value or violate constraintITPUB个人空间,K,Q9?[lT
2.3Define mid-exchange table2.4But we didn’t control the data quality caused by man-made
3、Nonvolatile
Typically, data in the data warehouse is read-only (less volatile than operational systems). Data is loaded into the data warehouse for the first-time load, and then refreshed regularly. Warehouse data is accessed by business users. Warehouse operations typically involve:ITPUB个人空间
D*A,iJ'k7sL,`sw L.M
Loading the initial set of warehouse data (often called the first-time load)ITPUB个人空间YN3qds
Refreshing the data regularly (called the refresh cycle)ITPUB个人空间A2x0jZ9r
So the main work is around:
UQC4lk0o:qe
X03.1 DTS design and scheduleITPUB个人空间7V@({(R7P a6t!Pt
3.2The dimension data is incremental, it is to say that they can be inserted and updated, but can’t be deleted. And dimension data must be unique.ITPUB个人空间2wn sD%c?
3.3 The original or fact data is incremental, but according to requirement, it can be updated or deleted under the control. The full process is danger and impossible when the data volume is too huge.
4、Time-Variant
Warehouse data is by nature historical; data is retained for a long time, from two to ten years, compared with one to three months of data for a typical operational system. The data allows for analysis of past and present trends, and for forecasting, using what-if scenarios.
ch_cgH1@|0Base the Time-Variant and data volume, we must consider :
rH+r3|8_?!P2H04.1 The design about the database4.2The design about the tableITPUB个人空间:Xn|x8|s?
4.3The OLAP increment
数据库设计,主要针对SQLServer2000而言,包括以下方面:
i'D
{}Op9n&kae0
Qz
m{Z4Z0数据库设计
4Y1Of*D8[|;F0文件组设计
2I!^)U%r7?.cns!E0历史数据表和当前数据表设计ITPUB个人空间P x6x4['F$]q!dJ
分区表设计ITPUB个人空间f4^#rP^
EL:|
数据库链接使用
M*[w@o l@#k0日志表
*b$nx3W]M,M G0增量数据抽取ITPUB个人空间O2X"Uk4p9z8I
维度数据抽取ITPUB个人空间L1?a1{3[lW6v\ PUS
原始数据抽取ITPUB个人空间(p6y/iqY ]g
DT
日结数据处理
5js5L-~v!t#|y0OLAP的增量处理和分区
;vGA2gN tWO/]V0数据库调优
数据库设计
数据库一般的分类方式有:ITPUB个人空间Voz%K%x1o p6k
按照业务来分(财务系统和销售系统)
.}*s,O
Q h"rU{0按照处理阶段来分(原始数据和日结数据)ITPUB个人空间@0^+LK(v
按照数据存储时间来分(当前数据和历史数据)
当然这些分类标准不是一成不变的也可以交叉分类,而且不同的关系数据库也不一致,如Oracle可以使用不同数据块大小的表空间存储不同数据,SQLServer和Sybase则采用不同的数据库实现对数据的存储。
3RwtEM5X2ma0建议数据库的分类按照不同的处理阶段进行数据存储,可以方便的进行数据库的备份和管理工作。如
-U3A)Maj|'[V0dataware_org 存储ODS层数据,保留一定期限的原始数据ITPUB个人空间#oCi@7P!Uv6p
dataware_fact 存储日结数据,可以较长时间的保留系统数据。
!q-n m j}+j0dataware_dim 存储维度基础数据。
待续........
关于文件组的设计
/xGidOC0
e3v
iD#fEdV0数据库可以按照以下几种情况进行文件组设计(其实同上,关键如何分类更加合理):ITPUB个人空间2Eu I:k{g
1、按照业务数据来源分类ITPUB个人空间$Aa,_ f
SF#Q"R"}\
2、按照ETL处理过程分类ITPUB个人空间
K.e^I9`#Tk ^?h-oN
3、按照数据的存储周期分类(历史数据还是临时数据)ITPUB个人空间 }d+^b9N
4、按照数据的物理存储类型分类(即索引还是数据)
vCLzE+N"c0ITPUB个人空间\%Va9|{ i(e{D/C
个人建议,采用按照物理存储类型和数据存储周期进行分库,如ITPUB个人空间
P \P fz_|3J J
历史数据文件组ITPUB个人空间0^,{:| }
Ih@i:g
临时表数据文件组ITPUB个人空间Y[0ucI
索引文件组
关于历史数据和临时数据的分开处理
+u%X+L7kFs5_ebrQum"@0
%T{
x%QI U2T
kM0众所周知,对于大数据量的数据存储任何数据库都与遭遇性能瓶颈。
)EQf&JJ]4Mm0因此建议对于大数据量的表采取分表处理:即将数据区分为临时数据和历史数据分开存储
1Rg9f6R?0
w3f2G(aA4f5{^'q:p0尽管可能会带来一些维护和处理上的不便,但是显而易见可以提升系统的性能。临时表中少量的数据可以有效地进行日结等处理,临时表数据需要进行定期地进行数据的转移工作。
+u&lN'n*cHI0还有一个更大的好处是,临时表和历史表放在不同的文件组或者数据库中,可以减少系统的IO冲突和备份周期的制定.
关于分月表的问题