APOSTILA_ADMDADOS_DW

download APOSTILA_ADMDADOS_DW

of 69

Transcript of APOSTILA_ADMDADOS_DW

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    1/69

    DATAPREVDNG DIRETORIA DE NEGCIOSDETI.N Departamento de Negcios Tratamento de Informaes

    Modelagem Multidimensional para Data Warehouse

    Instrutor: Roge Oliveira

    Colaboradores:

    Alfredo M. V. Martins

    Delmir Peixoto A Jr.

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    2/69

    SUMRIO

    Prefcio.............................................................................................................3

    ndice do curso.................................................................................................4 Tabela de referncia cruzada: Anexos x Captulos......................................5 Anexos...............................................................................................................6 Listas de exerccios.........................................................................................39

    Pg. 2

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    3/69

    PREFCIO

    A presente apostila trata-se de uma coleo de diversas publicaes a respeito do tema DataWarehousing. Os autores dos respectivos artigos esto indicados e a eles cabe o crdito das idias. Seu

    objetivo servir de apoio como fonte de consulta aos diversos conceitos apresentados no curso.Quando possvel os artigos encontram-se diretamente no corpo da apostila, alguns traduzidos outros nooriginal. Para aqueles que so demasiadamente grandes no esto presentes, mas podem serencontrados atravs dos hyperlinks indicados.

    A seguir, ser apresentado o ndice dos tpicos discutidos no curso. Uma tabela de refernciacruzada ajuda a escolher quais artigos melhor abordam determinado assunto. Ao final so apresentadosos exerccios propostos para serem desenvolvidos ao longo das aulas.

    Pg. 3

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    4/69

    NDICE DO CURSO

    1. Introduo1.1 - Evoluo de Bancos de Dados1.2 - Conceitos de Data Warehousing

    1.3 Exerccios

    2 - Modelagem Multidimensional2.1 - Definio dos elementos utilizados

    2.1.1 - Simbologia adotada2.1.2 - Tabelas de Fatos2.1.3 - Tabelas de Dimenses2.1.4 - Exerccios

    2.2 - Comparao entre abordagens2.2.1 - Modelo Entidades-Relacionamentos2.2.2 - Modelo Esquema-Estrela

    2.2.3 - Modelo Floco de Neve2.2.4 - Exerccios

    2.3 - Procedimentos para elaborao de modelo2.3.1 - A partir de um MER2.3.2 - A partir das consultas a serem atendidas2.3.3 Demonstrao de caso prtico

    2.4 Casos especiais2.4.1 Surrogate keys eSlowly Changing Dimensions2.4.2 Relacionamentos NxM2.4.3 - ODS2.4.4 Tabela fatos sem fatos

    3 - Integrao de Modelos3.1 - Projeto de um Data Warehouse3.2 - Projeto de Data Marts independentes3.3 - Projeto de Data Marts integrados num Data Warehouse3.4 Exerccios

    4 - Concluso4.1 - Dvidas e comentrios finais

    Pg. 4

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    5/69

    Tabela de referncia cruzada: Anexos x Captulos

    Anexo Ttulo Captulos1 Dimensional Modeling and E-R Modeling

    In The Data Warehouse2.2

    2 Gerenciando Tabelas Auxiliares 2.4.23 No H Garantias 2.2.14 Princpios de Projeto para

    um Data Warehouse Dimensional2.3.2

    5 Mapeamento Entre os Modelos E/R e Star 2 e 36 Trs casos interessantes para o uso de Snowflakes 2.2.37 What Not To Do 28 Data Mart No Igual a Data Warehouse 39 Curso de Data Warehouse 1,2 e 310 Strategies to Solutions:

    How to Implement a Data Warehouse

    3

    11 The Anti-Architect 312 Getting Started And Finishing Well 313 A Conceptual Modelling Perspective

    for DataWarehouses1 e 2

    14 Information Strategy:Data Mart vs. Data Warehouse

    3.3

    15 Business Intelligence 116 Factless Fact Table 2.4.417 Slowly Changing Dimensions 2.4.118 Surrogate Keys 2.4.1

    19 Introduction: The Operational Data Store 2.4.320 Designing the ODS 2.4.321 Relocating the ODS 2.4.3

    Pg. 5

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    6/69

    Anexo 1

    Dimensional Modeling and E-R Modeling InThe Data Warehouse

    byJoseph M. Firestone, Ph.D.

    White Paper No. EightJune 22, 1998

    Introduction

    Dimensional Modeling (DM) is a favorite modeling technique in data warehousing. In DM, a model oftables and relations is constituted with the purpose of optimizing decision support query performance inrelational databases, relative to a measurement or set of measurements of the outcome(s) of thebusiness process being modeled. In contrast, conventional E-R models are constituted to (a) removeredundancy in the data model, (b) facilitate retrieval of individual records having certain criticalidentifiers, and (c) therefore, optimize On-line Transaction Processing (OLTP) performance.

    Practitioners of DM have approached developing a logical data model by selecting the business processto be modeled and then deciding what each individual low level record in the "fact table" (the grain ofthe fact table) will mean. The fact table is the focus of dimensional analysis. It is the table dimensionalqueries segmentin the process of producing solution sets. The criteria for segmentation are containedin one or more "dimension tables" whose single part primary keys become foreign keys of the relatedfact table in DM designs. The foreign keys in a related fact table constitute a multi-part primary key forthat fact table, which, in turn, expresses a many-to-many relationship. [1]

    In a DM further, the grain of the fact table is usually a quantitative measurement of the outcome of the

    business process being analyzed. While the dimension tables are generally composed of attributesmeasured on some discrete category scale that describe, qualify, locate, or constrain the fact tablequantitative measurements.

    Since a dimensional model is visually represented as a fact table surrounded by dimension tables, it isfrequently called a star schema. Figure One is an illustration of a DM/star schema using a studentacademic fact database.

    Pg. 6

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    7/69

    While there is consensus in the field of data warehousing on the desirability of using DM/star schemasin developing data marts, there is an on-going controversy over the form of the data model to be used inthe data warehouse. The "Inmonites," support a position identified with Bill Inmon, and contend that

    the data warehouse should be developed using an E-R model. The "Kimballites" believe in RalphKimball's view that the data warehouse should always be modeled using a DM/star schema. IndeedKimball has stated that while DM/star schemas have the advantages of greater understandability andsuperior performance relative to E-R models, their use involves no loss of information, because any E-R model can be represented as a set of DM/star schema models without loss of information.

    In this paper I will comment on two issues related to the controversy. First, the claim that any E-Rmodel can be represented as an equivalent set of DM/star schema models [2], and second, the questionof whether an E-R structured data warehouse, absent associative entities, i.e. fact tables, is a viableconcept, given recent developments in data warehousing.

    Can DM Models Represent E-R Models?

    In a narrow technical sense, not every E-R model can be represented as a star schema or closely relateddimensional model. It depends on the relationships in the conceptual model formalized by the logicaldata model.As Ralph Kimball has pointed out on numerous occasions, star schemas represent many-to-manyrelationships. If there are no many-to-many relationships in an underlying conceptual model, there is noopportunity to define a series of dimensional models. That is, the possibility of a dimensional model is

    Pg. 7

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    8/69

    associated with the presence of many-many relationships of whatever order. On the other hand, an E-Rmodel can be defined whether or not many-many relationships exist. But without them it would haveno fact tables.

    Having said the above, it really doesn't directly address the central question of whether an E-R data

    warehouse model can always be represented as a series of dimensional models. But it does shed somelight on it. Specifically, the answer to the question depends on whether the underlying conceptualmodel of a data warehouse must always contain many-to-many relationships. I think the answer to thisquestion is yes, and that it follows that an E-R data warehouse can be expressed as a star schema. Hereare my reasons.

    (1) Data warehouses must contain "grain" attributes in the sense of the term specified by Ralph Kimballin The Data Warehouse Toolkit. This is a necessary conclusion for anyone who believes either in aqueryable data warehouse, or in a data warehouse that will primarily serve as a feeder system forqueryable data marts. In either case, the grain attributes must be available as part of the data warehouse,because they provide data on the extent to which any business is meeting its goals or objectives.Without such attributes, business performance can't be evaluated, and a primary DSS-related purposeof the data warehouse architecture can't be fulfilled.(2) If the grain attributes are present in the data warehouse, what kinds of relationships will beassociated with them and what kinds of entities will contain them? In the underlying conceptual modelof the data warehouse, there will be attributes that are causally related to the grain attributes, attributesthat are effects of the grain attributes, and attributes such as product color, geographic level, and timeperiod that are descriptive of the grain attributes. In the conceptual model, the grain attributes will beassociated with many-many relations among these different classes of factors. How can these many-many relations be resolved in a formal model, whether E-R or dimensional?(3) The various causal, effect, and descriptive factors will be contained in fundamental entities, andperhaps in attributive entities, or sub-type entities as well. In a correct E-R or dimensional model,however, the entities containing the grain attributes can only be associative entities, because the grainattributes will not belong to any one fundamental entity in the model; but will be properties of a many-many relation (an n-ary association) among fundamental entities.Since fact tables are resolved many-many relations among fundamental entities, it follows that in acorrect E-R model, fact tables are a necessary consequence of grain attributes and of standard E-Rmodeling rules requiring conceptual correctness and conceptual and syntactic completeness. It goeswithout saying that fact tables are also the means of resolving many-many relationships in dimensionalmodels.(4) If fact tables must be present in correct E-R models, it still doesn't follow, however, that thefundamental entities related to them must be de-normalized dimension tables as specified indimensional models. Here, in my view, is where the major distinction between dimensional and E-Rdata warehouse models will be found.In E-R models, normalization through addition of attributive and sub-type entities destroys the cleandimensional structure of star schemas and creates "snowflakes," which, in general, slow browsingperformance. But in star schemas, browsing performance is protected by restricting the formal model toassociative and fundamental entities, unless certain special conditions (pointed out in "Toolkit," and inRalph Kimball's various DBMS columns) exist.So, that's it. In data warehouses, conventional E-R models and Star Schemas are both options, and thisis due to the semantics of data warehouses as DSS applications requiring many-to-many relationshipscontaining essential grain attributes. Kimball's position is therefore essentially correct: a data

    Pg. 8

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    9/69

    warehouse E-R model can be represented as a series of dimensional models. But this argument has anadditional implication I'd like to see widely discussed.I emphasized earlier that both correct dimensional and E-R models rely on fact tables to resolve themany-many relations encompassing grain attributes that are so essential for the data warehouse. If thisis true, then why are fact tables so frequently associated with dimensional data warehouse models and

    not with correct E-R data warehouse models? I suspect this may be because many E-R data warehousemodels may not always explicitly recognize many-many relations and the need to resolve them withassociative entities, i.e. fact tables. Instead, these models are being defined with fundamental entitiescontaining some of the characteristics of associative entities but also carrying with them the risks ofconfusion, contradiction, and redundancy inherent in an incomplete resolution of many-to-manyrelationships, and ad hoc de-normalization of fundamental entities.I can't prove that this hunch of mine is valid, and that the problem in E-R data modeling I've inferred iswidespread. But there are examples of the problem in the data warehousing literature. One goodexample is in the recent book by Silverston, Inmon, and Graziano (Wiley, 1997) [3], called "The DataModel Resource Book." Figure 10.2 on P. 266 presents a sample data warehouse data model. This datamodel contains no fact tables, but three tables come closest:CUSTOMER_INVOICES,PURCHASE_INVOICES, andBUDGET_DETAILS.Let's focus on CUSTOMER_INVOICES, which is typical of the three. The multi-part primary key iscomposed of:INVOICE_ID, andLINE_ITEM_SEQ.A number of foreign keys are included as mandatory attributes, but constitute no part of the primarykey, and are not determined by it. These are:CUSTOMER_ID,SALES_REP_ID, andPRODUCT_CODE.Other mandatory attributes are:INVOICE_DATE, BILL_TO_ADDRESS_ID,MANAGER_REP_ID, ORGANIZATON_ID,ORG_ADDRESS_ID, QUANTITY, UNIT_PRICE,AMOUNT, andLOAD_DATE.An optional attribute is PRODUCT_COST.

    I believe that this entity diverges as much as it does from a fact table in a dimensional model, notbecause it is an E-R model-based entity, but because: (a) it fails to adequately model the conceptualdistinction between customer invoice and customer sales, (b) doesn't recognize that unit price, amount,and quantity are attributes of a sale, related not only to an invoice but also to Sales Reps, Products, andCustomers, and (c) in consequence doesn't correctly resolve the many-many relationship of Sales Reps,Customer Invoices, Products, and Customers. In short, the CUSTOMER_INVOICES entity, asconstructed in the example, represents an error in the E-R model. That is why the QUANTITY,UNIT_PRICE, and AMOUNT attributes are not contained in a CUSTOMER_SALES associativeentity, a true fact table, with a multi-part key drawn from SALES_REPS, CUSTOMER_INVOICES,PRODUCTS, and CUSTOMERS.

    Pg. 9

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    10/69

    This point is emphasized further by looking at the star schema design for sales analysis provided inFigure 11.1 on P. 271. This design is supposed to provide an example of a departmental specific datawarehouse, (or data mart). While this figure includes a CUSTOMER_SALES table that looks a lot likea fact table, it still reflects the conceptual confusion in the underlying model. Specifically, the multi-part key of this "fact table" includes INVOICE_ID, and LINE_ITEM_SEQ, as parts of the primary key.

    But neither attribute comes from a dimension table, nor are they degenerate dimension attributes sincethey are part of the primary key.

    Instead they originate in the "fact table." And since from the previous CUSTOMER_INVOICES entitywe know that INVOICE_ID, and LINE_ITEM_SEQ constitute a unique primary key, it follows thatCUSTOMER_SALES is not an associative entity or fact table at all, but instead is another fundamentalentity, very similar to CUSTOMER_INVOICES, that again confuses the distinction betweenCUSTOMER_INVOICES and CUSTOMER_SALES.

    In short, Figure 11.1 is not a valid star schema design, as Figure 10.2 is not a valid E-R model. Becauseneither the CUSTOMER_INVOICES entity in one, nor the CUSTOMER_SALES entity in the other, isan appropriately normalized entity, whose non-key attributes are fully dependent on the primary key. Ifthey were, they would present properly constructed associative entities resolving many-many relationsincludingCUSTOMER_INVOICES, and CUSTOMER_SALES.

    Again, how typical this example is of E-R modeling in data warehousing I can't say. That's the questionI'd like to see more widely discussed. Is the widely perceived divergence between E-R and dimensionalmodeling in data warehousing due to the fact that dimensional modeling necessarily involves facttables and E-R modeling normally does not, or is the perceived divergence due to the fact that E-Rmodeling practices in data warehousing are not faithful to E-R modeling principles; and if they werethey would involve fact tables to exactly the same extent as dimensional models?

    Is An E-R Data Warehouse Model With No Fact Tables A Viable Concept?

    DM/Star schemas represent n-ary associations. N-ary associations are embodied in many-to-manyrelations. These may be resolved within a data model in an entity associating two or more entities. Astar schema with one fact table (the associative entity) and two dimension tables represents a binaryassociation. One with one fact table, and three dimension tables represents a ternary association, and soon.

    As we have seen E-R models can also represent n-ary associations. They differ from star schemas notin the presence of fact tables, but in the fact that their dimension tables are "snowflaked" to meet therequirements of normalization.

    Since star schemas and "snowflaked" E-R models represent n-ary associations, to say that another typeof E-R model eliminating fact tables should be used to structure the data in the data warehouse is alsoto say that n-ary associations should not be used for this purpose. But n-ary associations are essentialfor analysis in the context of DBMS DSS applications, because analytical DSS queries employ many-to-many relationships and are frequently multi-stage in character. Many-to-many relationships can onlybe resolved in data models into (1) n-ary associations of various types with associative entities (facttables), or (2) more atomic data dependency relationships in E-R models without fact tables. I think the

    Pg. 10

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    11/69

    second alternative ensures poor query response performance in large databases, and thereforediscourages and often prevents execution of a multi-stage analysis process.

    It does so because it provides no structure for navigating the logic of the particular n-ary associationimplied by an analytical DSS query, and therefore requires that the DBMS engine construct the

    association "on the fly." In contrast, the first alternative provides a navigational structure for such aquery, with consequent good query performance, and practical implementing of multistage analysisprocesses. Among associative models however, a DM/Star design generally provides better navigationand performance than an E-R /Snowflake (in the absence of tools with special capability to handle themore complex snowflake model).

    If one accepts this argument (and if it's correct, 95% of it is in some way owed to Ralph Kimball, and ifit's wrong, the correct 95% of it is still owed to Ralph Kimball); then the claim that dimensionalmodeling or "snowflaked" E-R models should not be employed in the data warehouse, largely amountsto the claim that only the limited, constrained analysis supported by data dependency models withoutassociative entities should be employed. That is, the data warehouse becomes no more than a bigstaging area for data marts, and has no independent analytical function of its own. I can't subscribe tothis conclusion.

    After all, in recent data warehousing/data mart system architectures, we've added an Operational DataStore (ODS) [4], distinct from the data warehouse, and a non-queryable centralized staging area forstoring, extracted, cleansed, and transformed data and for gathering centralized metadata forimplementing an Enterprise Data Mart Architecture (EDMA) [5]. Why then do we need yet anothernon-queryable staging area? Also, if the data warehouse is only a staging area and we can do analysisonly in data marts, where do we go for enterprise-wide DSS?

    Conclusion

    In the context of the "Inmonite"/"Kimballite" dispute over the proper form of data warehouse datamodels, this paper examined: (1) the claim that any E-R model can be represented as an equivalent setof DM/star schema models; and (2) the question of whether an E-R structured data warehouse, absentassociative entities, i.e. fact tables, is a viable concept given recent developments in data warehousing.A number of conclusions are supported by the arguments.

    Not every E-R model can be represented as a set of star schemas containing equivalentinformation;

    But every properly constructed E-R data warehousing model can be so represented; Many E-R data warehouse models are not properly constructed in that they don't explicitly

    recognize many-many relations and the need to resolve them with associative entities, i.e. fact

    tables. To use data warehousing E-R models specifying atomic data dependency relationships without

    fact tables is to ensure poor query response performance in large databases, and thereforediscourage, and often prevent, execution of a multi-stage analysis process. In effect, it is tomake the data warehouse no more than a big staging area for data marts, with no independentanalytical function of its own.

    Given the development of ODSs and non-queryable centralized staging areas for storing,extracted, cleansed, and transformed data and for gathering centralized metadata for

    Pg. 11

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    12/69

    implementing an Enterprise Data Mart Architecture (EDMA); we don't need another non-queryable staging area called a data warehouse. What we do need, instead, is a dimensionallymodeled data warehouse for enterprisewide DSS, prepared to provide the best in query responseperformance and to support the most advanced OLAP [6] functionality we can devise.

    References[1] Ralph Kimball, The Data Warehouse Toolkit (New York, NY: John Wiley & Sons, Inc., 1996), Pp.15-16[2] I thank Ralph Kimball for prodding myself and other participants in [email protected] list server group about the importance of examining this issue.[3] W. H. Inmon, Claudia Imhoff, and Ryan Sousa, Corporate Information Factory (New York, NY:John Wiley & Sons, Inc., 1998), Pp. 87-100[4] Len Silverston, W. H. Inmon, and Kent Graziano, The Data Model Resource Book (New York, NY:John Wiley & Sons, Inc., 1997)[5] Douglas Hackney, Understanding and Implementing Successful Data Marts (Reading, MA:Addison-Wesley, 1997), Pp. 52-54, 183-84, 257, 307-309[6] "What is OLAP?" The OLAP Report, revised February 19, 1998,@http://www.olapreport.com/fasmi.htm

    Pg. 12

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    13/69

    Anexo 2

    Gerenciando Tabelas AuxiliaresAutor: Ralph Kimball

    Traduo: Delmir PeixotoUm olhar cuidadoso para relacionamentos muitos-para-muitos entre dimenses importantes.

    Dimenses multi-valoradas so normalmente ilegais em um projeto dimensional. Usualmenteexige-se que quando a granulao de uma tabela de fatos declarada, as nicas dimenses legaisque podem ser associadas tabela de fatos so aquelas que empregam um valor nico para aquelagranulao.

    Por exemplo, no mundo bancrio, Se a granulao da tabela de fatos Conta por Ms, entoexclui-se a dimenso Transao porque ela emprega muitos valores diferentes durante o ms. Sedesejar-se ver transaes individuais, ento declarara-se uma granulao refinada, tal como Contapor Transao por Hora do dia.

    Mas toda regra tem excees. Algumas vezes, mesmo quando uma dimenso emprega mltiplosvalores na presena da granulao da tabela de fatos, natural associar a dimenso multi-valorada tabela de fatos sem mudar a granulao. muito desejvel, por exemplo, associar a dimensoCliente quela tabela de fatos do banco cuja granulao Conta por Ms.

    O problema que o nmero de clientes associados com cada conta aberto. Algum pode ter umaconta de cheque em seu nome, mas sua esposa e ele podem ter tambm uma conta conjunta.Possivelmente pode-se ter tambm uma conta familiar com cinco ou seis nomes de clientes.

    A melhor forma de lidar com dimenses multi-valoradas atravs de uma tabela auxiliar , comomostrado na figura a seguir, onde ela chamada de Mapa para Conta de Clientes (Account toCustomer Map).

    Usando chaves hospedeiras (surrogate keys)

    Pg. 13

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    14/69

    O Mapa para Conta de Clientes um tipo de tabela de fatos cuja chave primria (PK) compostade mltiplas chaves estrangeiras (FK). A chave primria neste exemplo consiste da chave Conta(Account Key), da chave Cliente (Customer Key), e da chave DataInicial (Begindate Key). Umregistro individual nesta tabela mostra que um cliente particular foi parte de uma conta especficadurante o intervalo definido pela data inicial e final. Mas esta definio requer um olhar cuidadoso.

    muito importante que as chaves estrangeiras de cliente e conta sejam chaves hospedeirasreferindo-se s suas respectivas dimenses, ambas sendo dimenso de mudana lenta Tipo 2 (Type2 slowly changing dimensions (SCDs)). Em outras palavras, rastrea-se cuidadosamente mudanasnas dimenses cliente e conta, e continuamente edita-se novas verses de registros nestasdimenses para refletir mudanas. Em um Type 2 SCD, as chaves naturais para cliente e contapermanecem constantes, Mas as chaves hospedeiras muda sempre que insere-se um novo registrona dimenso.

    A tabela auxiliar precisa das chaves hospedeiras para que o registro das propriedades do clientepara a conta refira-se a descries do cliente e conta corretamente atualizadas durante o intervalode tempo designado. Mas esta preciso tem um preo: Toda vez que o cliente ou conta submete-sea uma mudana Tipo 2, preciso editar um novo registro na tabela auxiliar para refletir as novascombinaes de chave. Desta forma, o tempo inicial e o final na tabela auxiliar realmente refere-seao momento quando o cliente era parte da conta e ambas, a descrio do cliente e da conta, nohaviam sido mudadas. Embora isto parea complicado, ser mostrado na seo seguinte queusando twin timestamps, pode-se realizar consultas interessantes sem ter que ser um especialistaem lgica.

    Usando Twin Timestamps

    Uma lista dos clientes de uma conta chamada ABC123 em um perodo de tempo particular podeser conseguida com uma consulta SQL muito simples:

    SELECT customer.nameFROM account, map, customerWHERE account.accountkey = map.Accountkey

    AND customer.customerkey = map.CustomerkeyAND account.naturalid = ABC123AND 7/18/2001BETWEEN map.begindate AND map.enddate

    Esta no uma interpretao padro do BETWEEN. A SQL especifica a sintaxe do BETWEENcomo campo BETWEEN valor. Neste exemplo foi usado um relacionamento reverso, valorBETWEEN campos. Mas a maioria dos banco de dados relacionais modernos como o Oraclesuporta esta sintaxe.

    A desvantagem de se usar twin timestamps que isto complica a atualizao das tabelas auxiliares.Todo registro de mapa de conta atualmente vlida tem o ENDDATE aberto, o que feio.Quando um novo registro substibui este, o dado ENDDATE tem que ser ajustado ao valor real. Aalternativa de armazenar apenas o BEGINDATE torna a consulta muito mais complexa. Serianecessrio mudar a consulta acima para olhar para a maior data de incio menor ou igual da data

    Pg. 14

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    15/69

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    16/69

    Anexo 3

    No H GarantiasAutor: Ralph Kimball

    Traduo e Resumo: Delmir Peixoto

    A modelagem Entidade-Relacionamento est longe de ser uma soluo universal para regras denegcio de data warehouse.

    As regras de negcio so o corao e alma das aplicaes. Se os sistemas obedecerem as regrasde negcio, ento os dados estaro corretos e as aplicaes funcionaro.Mas o que exatamente uma regra de negcio? Onde so declaradas ou foradas?Elas podem se d em quatro nveis:

    1. Simples definies de formato de campo, forada diretamente pelo banco de dados:

    Ex: O campo Pagamento pode ser uma quantia interpretada como dlar.2. Multiplos relacionamentos de campo chave, forado por declaraes chavesresidentes no banco de dados:Ex: Uma chave estrangeira de produto numa tabela Vendas tem um relacionamento muitos-para-um com a chave primria do produto na tabela Produto.3. Relacionamentos entre entidades, declarados num diagrama entidade-relacionamento(E/R), mas no diretamente forados pelo banco de dados porque o relacionamento muitos-para-muitos:Ex: Empregados um subtipo de Pessoa.4. Lgica complexa de negcio, relativa a processos de negcio, e foradas talvez apenasno momento da entrada de dados, por uma aplicao complexa:

    Ex: Quando uma poltica de segurana foi definida mas no ainda aprovada peloresponsvel, a data de gesto pode ser NULL, mas quando assinada, a data deve ser atual emais recente que a data do acordo que a definiu.

    O ncleo dos softwares de banco de dados gerencia apenas os dois primeiros nveis, definiesde formato de campo e mltiplos relacionamentos de campos chave. Porm, h muito maiscontedo de negcio valioso nos nveis 3 e 4, relacionamentos entre entidades e lgicacomplexa de negcio.

    A modelagem E/R parece ser uma linguagem compreensiva para descrever relacionamentosentre entidades, mas no .

    A modelagem E/R uma tcnica de diagramao para especificar relacionamentos um-para-um,muitos-para-um, e muitos-para-muitos entre elementos de dados. O modelo E/R apenas ummodelo lgico, ferramentas como Computer Associates`s Erwin que convertem um diagramaE/R em declaraes de linguagem de definio de dados (DDL) que determina definieschaves e restries entre tabelas, forando os vrios tipos de relacionamentos apropriadamente.

    Embora a modelagem E/R seja uma tcnica til para iniciar o processo de entendimento dasregras de negcio, ela apresenta falhas quanto integridade e garantia:

    Pg. 16

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    17/69

    A modelagem E/R incompleta. As entidades e os relacionamentos de um dado diagramarepresentam apenas o que o analista decidiu enfatizar, ou foi informado. No h teste nummodelo E/R para determinar se o analista especificou todas as possibilidades derelacionamentos um-para-um, muitos-para-um, ou muitos-para-muitos.

    A modelagem E/R no nica. Um dado conjunto de relacionamento de dados pode serrepresentado por muitos diagramas E/R diferentes.

    A maioria dos relacionamentos de dados so muitos-para-muitos. Existem muitasvariedades de relacionamentos muitos-para-muitos envolvendo vrias condies e graus decorrelao que seria proveitoso incluir como regras de negcio, mas a modelagem E/R nofornece extenses declarao muitos-para-muitos bsica.

    A maioria dos grandes modelos E/R so ideais, no reais. Quase todos os modelos de dadoscorporativos so um exerccio de como as coisas devem ser. So um exerccio para entender onegcio, mas se no alimentado fisicamente com dados reais, no vale a pena usar o modelo dedados corporativo como a base para uma implementao prtica de data warehouse.

    Modelos E/R raramente so modelos de dados reais. Uma concluso do ponto anterior queno existe ferramentas para varrer os dados reais para ento criar modelos E/R. Quase sempreos modelos E/R so criados e depois os dados so adequados ao modelo. Este fato faz com quequando dados sujos chegam rea de preparao de dados depois de terem sidos extrados deuma fonte de produo primria, no se pode inser-los no modelos E/R considerando-os dadoslimpos. necessrio limp-los. E, considerando os dois primeiros pontos desta seo, mesmose eventualmente o dado for limpo e colocado no modelo E/R, no h garantia de que a fase delimpeza completa, nica, ou capturou os relacionamentos de dados que interessam.

    Modelos E/R conduzem a esquemas absurdamente complexos que se perdem do objetivoinicial. Todo programador est ciente de quanto complexo um modelo E/R pode se tornar. Osmodelos E/R que do base ao Oracle Financials pode facilmente requerer 2.000 tabelas, e omodelo da SAP pode facilmente requerer 10.000 delas. Estes esquemas gigantescos soobstculos aos objetivos bsicos de entendimento e alta performance de um data warehouse.

    O modelo E/R completamente incapaz de lidar com restries de integridade ou regras denegcio , exceto em alguns casos especiais . Regras declarativas so muito complexas paraserem capturadas como parte do modelo de negcio e devem ser definidas separadamente peloanalista/desenvolvedor.

    A modelagem E/R til no processamento de transaes porque ela reduz a redundncia dosdados, e til num conjunto limitado de atividades de limpeza de dados, mas est longe de seruma plataforma compreensiva para regras de negcio de data warehouse.

    Pg. 17

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    18/69

    Anexo 4

    Princpios de Projeto para um Data Warehouse DimensionalMaria Cludia CavalcantiLawrence Zordam Klein

    Pablo Lopes Alenquer

    http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html

    Pg. 18

    http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.htmlhttp://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    19/69

    Anexo 5

    Mapeamento Entre os Modelos E/R e StarRoberto Reis Monteiro Neto

    http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html

    Pg. 19

    http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.htmlhttp://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    20/69

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    21/69

    Combina-se visitantes e clientes numa simples dimenso lgica chamada Comprador. D-se aovisitante ou cliente uma simples e permanente identidade (ID) de comprador, mas faz-se achave para a tabela uma surrogate key (chave hospedeira) de forma que se possa rastrearmudanas para o comprador a qualquer momento. A dimenso de comprador possuir os

    seguintes atributos: Chave hospedeira do comprador ID do comprador (ID fixo para cada comprador fsico) Recency (recente) Frequency(freqncia)

    Atributos apenas dos clientes:

    5 atributos de nome 10 atributos de locao 10 atributos de comportamento 25 atributos demogrficos

    Note a importncia de se incluir as informaes de recentes e freqncia como atributosdimensionais ao invs de como fatos e atualiz-las ao longo do tempo. Esta deciso torna adimenso comprador potente. Desta forma, pode-se fazer segmentaes clssicas decompradores diretamente da dimenso sem navegar por uma tabela de fatos numa aplicaocomplexa.

    Assumindo-se que a maioria dos ltimos 50 atributos de clientes so textuais, poderia-se ter umuma largura total de registro de 500 bytes ou mais. Supondo-se que tenha 20 milhes de

    compradores (16 milhes de visitantes e 4 milhes de clientes registrados). Neste caso haver80% de registros com os ltimos 50 atributos vazios. Numa dimenso de 10GB este percentual notrio.

    Este um caso claro de quando, dependendo do banco de dados, recomendado introduzir umasnowflake. Deve-se quebrar a dimenso numa dimenso base e uma subdimenso de snowflake.Todos os visitantes iro compartilhar um simples registro na subdimenso, o qual contervalores especiais de atributos null(nulo). Ver figura a seguir.

    Pg. 21

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    22/69

    Em um banco de dados com comprimento fixado, e de acordo com as suposies anteriores, abase da dimenso comprador seria 20 milhes x 25 bytes = 500 MB, e a dimenso snowflakeseria 4 milhes x 475 bytes = 1.9GB. Dessa forma, haveria uma economia de 8 GB usando asnowflake.

    Dimenses de Produtos Financeiros

    Bancos, casas de corretagem, e companhias de seguro, todas tm preocupaes na modelagemdas dimenses de seus produtos porque cada um dos produtos individualmente tem muitosatributos especiais no compartilhados por outros produtos. Uma conta de cheque pouco parececom uma hipoteca ou certificado de depsito. Todos tm diferentes nmeros de atributos.

    Se tentar-se construir uma dimenso de produto simples com a unio de todos os atributospossveis, resulta-se em milhes de atributos com muitos deles vazios.

    A soluo para este caso construir uma snowflake de contexto dependente. Deve-se isolar osatributos ncleo (the core attributes) numa tabela de dimenso de produto base, e incluir uma

    chave snowflake em cada registro base que apontar para sua prpria subdimenso de produtoextendida. Ver figura a seguir.

    Esta soluo no uma ligao relacional convencional!A chave snowflake deve conectar-se a uma tabela de subdimenso particular que um tipoespecfico de produto define.

    Pg. 22

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    23/69

    Dimenses de calendrio multi-empresarial.

    Construir uma dimenso calendrio num data warehouse distribudo transpondo mltiplasorganizaes difcil pois cada organizao tem seu perodo fiscal particular, estaes, e frias.Embora pode-se fazer um esforo herico para reduzir legendas incompatveis de calendrios,

    muitas vezes deseja-se olhar para todo o ambiente multi-empresarial da perspectiva de apenasuma das empresas.

    Diferente das dimenses de produtos financeiros, cada um dos calendrios pode ter o mesmonmero de atributos descrevendo perodos fiscais, estaes, e frias. Mas pode haver centenasde calendrios separados. Um varejista internacional pode ter que lidar com um calendrio paracada pas diferente.

    Neste caso deve-se modificar o projeto da snowflake para fazer a chave da snowflake se ligar auma nica subdimenso de calendrio(Ver figura a seguir). Mas a subdimenso temcardinalidade maior que a dimenso base! A chave para a subdimenso tanto a chavesnowflake como a chave da organizao.

    Nesta situao, deve-se especificar uma nica organizao na subdimenso antes de avaliar aligao entre as tabelas. Quando feito corretamente, a subdimenso tem um relacionamento um-

    para-um com a dimenso base como se as duas tabelas fossem uma nica entidade. Assim, odata warehouse do ambiente multi-empresarial pode ser pesquisado atravs do calendrio dequalquer uma das empresas que o constitui.

    Snowflakes permitidas

    Estes trs exemplos mostram como variaes de projeto de snowflakes podem ser til. Quandose pensa em alternativas de projetos, deve-se separar os aspectos fsicos dos lgicos. O projetofsico direciona a performance. O projeto lgico determina a facilidade de entendimento. Asnowflake pode ser usada quando maximizar estes dois objetivos.

    Pg. 23

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    24/69

    Anexo 7

    What Not To DoRalph Kimball

    http://www.rkimball.com/html/articles.html

    Pg. 24

    http://www.rkimball.com/html/articles.htmlhttp://www.rkimball.com/html/articles.html
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    25/69

    Anexo 8

    Data Mart No Igual a Data WarehouseBill Inmon

    Publicado em DM Direct em Novembro de 1999

    Traduo: Alfredo M. V. Martins

    O Data Warehouse no nada mais do que a unio de todos os Data Marts...,Ralph Kimball, 29 de dezembro de 1997.

    Voc pode apanhar todas as sardinhas no oceano e empilh-las e ainda assim elas noformaro uma baleia,Bill Inmon, 8 de janeiro de 1998.

    O desafio mais importante para o gerente de tecnologia de informao este ano decidir seconstri inicialmente o data warehouse ou se inicia pelo data mart. Os vendedores de data marts

    afirmaram que os data warehouses so difceis e caros de construir, demandam um longo tempopara serem projetados e desenvolvidos, requerem pensamento e investimento, e exigem que acorporao enfrente problemas difceis tais como a integrao dos dados legados, aadministrao dos macios volumes de dados, e a justificativa de custos relativos ao projeto doDSS(Sistema de Apoio a Deciso)/data warehouse para o Comit de Gerncia. O quadropintado pelos defensores dos data marts para a construo do data warehouse melanclico.Atende tambm a seus interesses e incorreto.

    Os vendedores de data mart olham para o data warehouse como um obstculo entre si e osrendimentos provenientes das vendas realizadas. claro, eles querem evitar que o datawarehouse alongue o seu ciclo de vendas, sem levar em considerao o efeito a longo prazo de

    construir um punhado de data marts e nenhum data warehouse. Os comerciantes de data martsesto vendendo uma perspectiva de muito curto prazo a custo do sucesso da arquitetura delongo prazo.

    Os defensores do data mart sugerem que podem existir caminhos alternativos, muito mais fceispara Sistemas de Apoio a Deciso (DSS) bem sucedidos do que construir um data warehouse.Um destes caminhos construir vrios data marts e quando eles crescerem o suficiente, cham-los de data warehouse ao invs de construir um verdadeiro data warehouse. Os defensores dodata mart argumentam que o data mart pode ser construdo muito mais rapidamente eeconomicamente do que um warehouse. Quando se constri um data mart no h necessidadepara um enorme confronto organizacional ou disciplinar e nenhuma preocupao com a

    arquitetura de longo prazo que criada pelos data marts.

    Infelizmente, ao evitar os viscerais problemas internos organizacionais e de projeto de umwarehousing, os defensores do data mart perdem muito do foco do warehousing. Ao construiruma arquitetura consistindo inteiramente de data marts, os defensores do data mart dirigem aorganizao para uma confuso ainda maior. Ao invs de um legado confuso de sistemasoperacionais, agora passamos a ter um legado confuso de sistemas operacionais E data martsconfusos. Data marts stovepipe (stovepipe = chamin de fogo um datamart stovepipe um

    Pg. 25

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    26/69

    data mart incompatvel com outro data mart) e aplicaes de Sistemas de Apoio a Deciso(DSS) stovepipe so o resultado de se construir somente data marts. E um ambiente de Sistemade Apoio a Deciso sem integrao como um homem sem um esqueleto, dificilmente umaentidade vivel e til.

    Uma Mudana nas AbordagensNos primeiros dias do comrcio de data warehouse, os vendedores de data mart tentaram pularno trem no warehouse proclamando que um data warehouse era a mesma coisa que um datamart. Comercial aps comercial, os vendedores de data mart confundiram as pessoas comdefinies equivocadas do que um data warehouse e do que um data mart. Os vendedores dedata mart espalharam meias verdades e desinformao sobre o data warehousing. O resultadofoi confuso.

    A confuso semeada pelos vendedores de data mart fizeram alguns clientes confusosconstruirem data marts sem nenhum warehouse real. Depois do 3o data mart , os clientesdescobriram que algo estava podre na Dinamarca. A deficincia de arquitetura por construirsomente data marts foi desmascarada. O cliente descobriu que quando voc no constri umdata warehouse, existe:

    redundncia macia de dados detalhados e histricos de um data mart para outro; resultados inconsistentes e irreconciliveis de um data mart para outro; uma interface no gerencivel entre os data marts e o ambiente de aplicaes legadas,etc.

    Em curto perodo de tempo, o mundo descobriu que um ambiente DSS sem um data warehouseera uma realidade extremamente insatisfatria.

    Agora que o mundo descobriu que construir data marts no a maneira adequada de procederem DSS, os vendedores de data mart e seus anunciantes esto novamente de volta e semeandoum tipo diferente de confuso. Desta vez, eles alteraram um pouco suas palavras originais eprometeram um caminho novo e melhorado para o sucesso fcil. Numa ligeira mudana doconceito original, a noo que agora est sendo difundida que um data warehouse meramente uma coleo de data marts integrados (o que quer que isto seja). A noo de quemltiplos data marts possam ser integrados paradoxal. A questo essencial associada aos datamarts que seus usurios fazem o seu depsito de dados de tal maneira que eles no tem queintegr-lo com outros marts.

    Dito de uma forma simples, por uma variedade de razes muito poderosas, no se podeconstruir data marts, observ-los crescer e magicamente transform-los num data warehousequando eles atingem um determinado tamanho. E da mesma maneira, integrar dados atravs dedata marts igualmente impensvel porque cada departamento que possui seu prprio data marttem suas prprias especificaes.

    Pg. 26

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    27/69

    Para se entender porque um ou mais data marts no podem ser transformados num datawarehouse, voc tem que inicialmente entender o que um data mart e o que um datawearehouse.

    Estruturas Arquitetnicas Diferentes

    Um data mart e um data warehouse so essencialmente diferentes estruturas arquitetnicas,mesmo embora ambos paream semelhantes quando vistos de longe e superficialmente.

    O que um Data Mart?

    Um data mart um conjunto de dados agregados organizados para apoiar a deciso,baseado nas necessidades de um dado departamento. As Finanas tem seu data mart, oMarketing tem o seu, e assim segue. E o data mart para Marketing s vagamente lembraoutro data mart de outro departamento.

    O mais importante, talvez, que os departamentos individuais POSSUEM o hardware, osoftware, os dados e os programas que constituem o data mart. Os direitos de propriedadepermitem que os departamentos contornem quaisquer tentativas de controle ou de disciplina quepoderiam coordenar os dados oriundos dos diferentes departamentos.

    Cada departamento tem sua prpria interpretao do que um data mart deveria parecer e cadadata mart departamental nico e especfico para suas prprias necessidades. Tipicamente, oprojeto da base de dados para um data mart construdo em torno de uma estrutura de juno deestrela que tima para as necessidades dos usurios encontrados naquele departamento. A fimde moldar a juno de estrela, os requisitos dos usurios para o departamento devem serreunidos. O data mart contem apenas um pouco da informao histrica e granular somente aoponto em que ele adere s necessidades do departamento. O data mart tipicamente hospedadonuma tecnologia multidimensional, o que bom em termos de flexibilidade de anlise, mas no timo para grandes quantidades de dados. Os dados encontrados nos data marts so altamenteindexados.

    Existem dois tipos de data marts dependente e independente. Um data mart dependente aquele cuja fonte o data warehouse. Um data mart independente aquele cuja fonte oambiente de aplicaes legadas. Todos os data marts dependentes so alimentados pela mesmafonte o data warehouse. Cada data mart independente alimentado unicamente eseparadamente prelo ambiente de aplicaes legadas. Os data mart dependentes soarquitetonicamente e estruturalmente sadios. Os data mart independentes so instveis earquitetonicamente insalubres, pelo menos para a grande integrao. O problema com os datamarts independentes que suas deficincias no se manifestam at que a organizao tenhaconstrudo muitos data marts.

    O que um Data Warehouse?

    Data warehouses so significativamente diferentes de data marts. Os data warehouses soorganizados em torno das reas de assuntos corporativos encontradas no modelo de dadoscorporativos. Normalmente o data warehouse construdo por organizaes com cooordenao

    Pg. 27

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    28/69

    centralizadora, sendo pertencente s mesmas, tal como a clssica organizao de Tecnologiade Informao (IT). O data warehouse representa um verdadeiro esforo corporativo.

    Pode ou no existir um relacionamento entre as reas de assuntos de quaisquer departamento eas reas de assuntos da corporao. O data warehouse contem os dados mais granulares que a

    corporao possui. O dado do data mart usualmente muito menos granular do que o dado dodata warehouse. ( isto , os data warehouses contem informaes muito mais detalhadasenquanto que a maioria dos data marts contem dados mais resumidos ou agregados). A estruturade dado do data warehouse essencialmente uma estrutura normalizada. A estrutura e ocontedo do dado num data warehouse no reflete o padro de nenhum departamento particular,mas representa as necessidades de dados da corporao. O volume de dados encontrados numdata warehouse significativamente diferente do volume de dados encontrados num data mart.Por causa do volume de dados de um data warehouse, o data warehouse levemente indexado.O data warehouse contem uma grande quantidade de dados histricos. A tecnologia dehospedagem do data warehouse otimizada manejando uma quantidade de dados decomprimento industrial. O dado do data warehouse integrado de muitas fontes legadas.

    Resumidamente, existem diferenas muito significativas entre a estrutura e o contedo dosdados armazenados num data warehouse e a estrutura e contedo dos dados armazenados numdata mart.

    A Figura 1 mostra algumas diferenas entre um data mart e um data warehouse.

    Por ser o dado armazenado num data warehouse granular, integrado e histrico, o datawarehouse atrai um volume significativo de dados. Por o warehouse atrair um volume

    Pg. 28

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    29/69

    significativo de dados, aconselhvel que ele seja construdo iterativamente. Se voc noconstri o warehouse iterativamente, voc gastar anos construindo o warehouse.

    Desde o primeiro artigo que foi escrito sobre data warehousing, tem sido reconhecido haveruma urgncia em conseguir resultados concretos e tangveis para o usurio final to rpido

    quanto possvel. O melhor conselho dos autores e consultores para a construo de datawarehousing foi de construir o warehouse rapidamente e evitar esforos longos e prolongados.

    De forma interessante, os defensores dos data marts e seus porta vozes afirmam que os datawarehouses levam um longo tempo de construo. somente no exagero do discurso dosdefensores dos data marts que se sugere que o warehouse seja construdo em proporesgigantescas.

    A Figura 2 mostra o caminho de construo recomendado para data warehouses.

    A teoria mais recente dos defensores de data mart que voc pode construir um ou mais datamarts , integr-los (apesar de ningum ser muito claro no que isto significa) e ento quando elescrescerem at um certo tamanho, eles possam ser (magicamente) transformados numwarehouse. Infelizmente esta sugesto incorreta por uma variedade de razes:

    O data mart projetado para atender as necessidades de um departamento. Muitosdepartamentos com objetivos muito diferentes devem ser satisfeitos. Esta a razo porexistirem tantos data marts diferentes na corporao, cada qual com sua prpria viso epercepo. O data warehouse projetado para atender s necessidades coletivas dacorporao como um todo. Um dado projeto pode ser timo para um departamento isoladoou para a corporao mas no para ambos. Os objetivos do projeto para a corporao so

    muito diferentes dos objetivos do projeto para um dado departamento. A granularidade do dado em um data mart muito diferente da granularidade do dadonum data warehouse. O data mart contem dados agregados ou resumidos. O data warehousecontem o dado mais detalhado que encontrado na corporao. Como a granularidade dodada mart muito mais elevada do que a encontrada no data warehouse, voc no conseguefacilmente decompor a granularidade do data mart na granularidade do data warehouse. Masvoc pode sempre ir na direo oposta e resumir unidades detalhadas de dados emagregaes.

    Pg. 29

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    30/69

    A estrutura dos dados num data mart (normalmente uma estrutura de juno estrela) somente remotamente compatvel com a estrutura de dados num warehouse (uma estruturanormalizada). A quantia de dados histricos encontrados num data mart muito diferente da histriados dados encontrados num warehouse. Data warehouses contem uma vasta quantia de

    histria. Data marts contem somente modestas quantias de histria. As reas de assuntos encontrada num data mart so s remotamente relacionadas com asreas de assuntos encontradas num data warehouse. Os relacionamentos encontrados num data mart no so aqueles relacionamentosencontrados num data warehouse. Os pedidos de recuperao de informao (queries) atendidas num data mart so muitodiferentes daqueles encontrados num data warehouse. O tipo de usurios (agricultores) que so encontrados nos marts so bem diferentesdos tipos de usurios (exploradores) encontrados num data warehouse. As estruturas de chave encontradas num data mart so significativamente diferentes dasestruturas de chave encontradas num data warehouse, e assim por diante.

    Realidade

    Existem simplesmente diferenas MAIS significativas entre um ambiente de data mart e umambiente de data warehouse. A afirmativa de que um data mart pode ser transformado num datawarehouse quando ele atinge um certo tamanho ou que data marts podem ser integradosconjuntamente to invlido dizer quanto afirmar que uma erva que cresce o suficiente possaser transformada num carvalho. Sendo a realidade e a gentica o que so, verdadeiro que umaerva e uma carvalho so, num determinado momento de suas vidas, organismos verdes vivosplantados no solo com aproximadamente o mesmo tamanho. Mas somente porque aquelas duasplantas partilham algumas poucas caractersticas bsicas no significa que uma erva rasteira

    possa se transformar num carvalho. Somente uma pessoa desinformada confundiria uma ervarasteira de um carvalho num estgio da vida das plantas.

    Pg. 30

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    31/69

    Anexo 9

    Curso de Data WarehouseRubens Melo

    http://www.mcc.ufc.br/eti/etipages/eti2000/moddesc.htm#DWH

    Pg. 31

    http://www.mcc.ufc.br/eti/etipages/eti2000/moddesc.htm#DWHhttp://www.mcc.ufc.br/eti/etipages/eti2000/moddesc.htm#DWH
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    32/69

    Anexo 10

    Strategies to Solutions:How to Implement a Data Warehouse

    Gary Clarkhttp://www.dmreview.com/portal.cfm?NavID=91&EdID=660&PortalID=8&Topic=4

    Pg. 32

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    33/69

    Anexo 11

    The Anti-ArchitectRalph Kimball

    http://www.rkimball.com/html/articles.html

    Pg. 33

    http://www.rkimball.com/html/articles.htmlhttp://www.rkimball.com/html/articles.html
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    34/69

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    35/69

    Anexo 13

    A Conceptual Modelling Perspectivefor DataWarehouses

    Jaroslav PokornPeter Sokolowsky

    http://wi99.iwi.uni-sb.de/Folien/Sek11_Pokorny.PDF

    Pg. 35

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    36/69

    Anexo 14

    Information Strategy:Data Mart vs. Data Warehouse

    Jane GriffinPublished in DM Review in February 1998.

    Printed from DMReview.com

    Do we need a single, enterprise-wide data warehouse, or are the information-intensive departments'data marts sufficient? This question is an industry debate and a common one for organizations that areconsidering an investment in an integrated information system. Data marts are often an attractivealternative to the mammoth job of implementing an enterprise-wide data warehouse.

    A data warehouse incorporates information about many subject areas--often the entire enterprise--whilethe data mart focuses on one or more subject areas. The data mart represents only a portion of anenterprise's data--perhaps data related to a business unit or work group. Typically, a data mart's data istargeted to a smaller audience of end users or used to present information on a smaller scope.

    The smaller-scale data mart is typically easier to build than the enterprise-wide warehouse; can bequickly implemented; and offers tremendous, fast payback for the users. The downside comes whenseveral department-focused data marts are implemented with no forethought for a future datawarehouse that serves the entire enterprise.

    What at first may seem like a quick and easy solution can cause a problem rather than solve it.Implementing several data marts to serve as reporting systems for individual departments can lead todata mart anarchy. Danger looms when individual departments select different hardware and software

    platforms, and the organization neglects to standardize and integrate information. This leaves theinformation technology (IT) department potentially supporting multiple databases, network operatingsystems and a variety of OLAP reporting tools.

    The ultimate goal with any integrated information system--whether it be a data mart or a datawarehouse--is to provide consistent, accurate data about the organization to the users. Department-focused data marts have only the information that group needs. Each department has its own specificuses for a data mart, which often ignore the information needs of other areas.

    Having different departments with various data marts also escalates the number of problems and issuesfor the IT group to resolve. Unlike the enterprise-wide data warehouse, IT cannot manage and maintainthese information stores from one central location. And one solution cannot address the myriad of

    problems that may arise from the data marts.

    Despite their potential pitfalls, data marts can pave the way for a large-scale IT investment in datawarehousing. The key rests in designing and implementing a scalable technical infrastructure for thedata marts that will allow the leveraging of information for an enterprise-wide data warehouse.

    Pg. 36

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    37/69

    A critical component of a scalable infrastructure involves using a standardized technical architectureacross all data marts. Like building a data warehouse, the data mart's architecture should be stable, yetflexible. There are several key components that must be in place to ensure this flexibility and stability.

    One component is centralized, integrated meta data and consistent definitions. Such consistency willsmooth the transition from data marts to data warehouse by making the individual systems compatible.All of the tools used to build the data marts, and eventually the data warehouse, must "speak" to eachother. This communication is accomplished by selecting tools that have integrated meta data.

    The extraction, transformation and loading (ETL) tools selected for the data mart must transform datainto common formats and integrate, match and index information from disparate sources. Using astandard technical platform--including a standard operating system, ETL, meta data management tooland reporting tool--can be an effective way to accomplish this task. Use of data marts with a standardinfrastructure can offer unsurpassed business analysis and management capabilities.

    Data must ultimately be put into the hands of the people who are responsible for the achievement ofbusiness objectives and strategies. Issues to consider in information management are: data load times,

    synchronization, recovery, summarization levels, method of data security implementation, datadistribution, data access and query speed, and ease of maintenance. All of these issues should beaddressed when implementing the data mart. If not now, they will have to be considered when anenterprise-wide system is implemented.

    With these key components, organizations implementing data marts will be able to scale the technicalarchitectures they put in place today into an enterprise-wide data warehouse to serve the informationdemands of tomorrow. While many of the issues are technical, the core issue of the data warehouseversus data mart is often political. Can IT deliver the data warehouse fast enough to meet theexpectations and needs of the departments that are demanding them? How much does standardizationslow down the organization?

    IT must wrestle with and overcome the time and standardization issues if they are to build a flexible,expansive, data warehousing architecture that meets the future needs of the organization.

    Pg. 37

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    38/69

    Anexo 15

    Business IntelligenceValentim Silva

    http://www.dds.pt/docs/BI%20WhitePaper.pdf

    Pg. 38

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    39/69

    Anexo 16

    Factless Fact TablesTwo Types of Useful Fact Tables Contain No Facts At All.

    DBMS- September 1996

    Over the past year I have given many examples of fact tables in dimensional data warehouses. Youshould recall that fact tables are the large tables "in the middle" of a dimensional schema. Fact tablesalways have a multipart key, in which each component of the key joins to a single dimension table.Fact tables contain the numeric, additive fields that are best thought of as the measurements of thebusiness, measured at the intersection of all of the dimension values.

    There has been so much talk about numeric additive values in fact tables that it may come as a surprisethat two kinds of very useful fact tables don't have any facts at all! They may consist of nothing butkeys. These are called factless fact tables. The first type of factless fact table is a table that records anevent. Many event-tracking tables in dimensional data warehouses turn out to be factless. One goodexample is shown in Figure 1. Here you will track student attendance at a college. Imagine that you

    have a modern student tracking system that detects each student attendance event each day. With theheightened powers of dimensional thinking that you have developed over the past few months, you caneasily list the dimensions surrounding the student attendance event. These dimensions include:

    Date: one record in this dimension for each day on the calendar Student: one record in this dimensionfor each student Course: one record in this dimension for each course taught each semester Teacher:one record in this dimension for each teacher Facility: one record in this dimension for each room,laboratory, or athletic field

    Pg. 39

    http://www.dbmsmag.com/9609d05.html#figure1http://www.dbmsmag.com/9609d05.html#figure1
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    40/69

    The grain of the fact table in Figure 1 is the individual student attendance event. When the studentwalks through the door into the lecture, a record is generated. It is clear that these dimensions are allwell-defined and that the fact table record, consisting of just the five keys, is a good representation ofthe student attendance event. Each of the dimension tables is deep and rich, with many useful textualattributes on which you can constrain and from which you can form row headers in reports.

    The only problem is that there is no obvious fact to record each time a student attends a lecture or suitsup for physical education. Tangible facts such as the grade for the course don't belong in this fact table.This fact table represents the student attendance process, not the semester grading process or even themidterm exam process. You are left with the odd feeling that something is missing.

    Actually, this fact table consisting only of keys is a perfectly good fact table and probably ought to beleft as is. A lot of interesting questions can be asked of this dimensional schema, including:

    Which classes were the most heavily attended? Which classes were the most consistently attended?Which teachers taught the most students? Which teachers taught classes in facilities belonging to otherdepartments? Which facilities were the most lightly used? What was the average total walking distance

    of a student in a given day?

    My only real criticism of this schema is the unreadability of the SQL. Most of the above queries end upas counts. For example, the first question starts out as:

    SELECT COURSE, COUNT(COURSE_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC.WHERE ... GROUP BY COURSE

    In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you can countany of the keys and still get the same correct answer. For example:

    SELECT COURSE, COUNT(TEACHER_KEY) FROM FACT_TABLE COURSE_DIMENSION,

    ETC. WHERE ... GROUP BY COURSEwould give the same answer because you are counting the number of keys that fly by the query, nottheir distinct values. Although this doesn't faze a SQL expert (such as my fellow columnist Joe Celko),it does make the SQL look odd. For this reason, data designers will often add a dummy "attendance"field at the end of the fact table in Figure 1. The attendance field always contains the value 1. Thisdoesn't add any information to the database, but it makes the SQL much more readable. Of course,select count (*) also works, but most query tools don't automatically produce the select count (*)alternative. The attendance field gives users a convenient and understandable place to make the query.

    Pg. 40

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    41/69

    Now your first question reads:

    SELECT COURSE, SUM(ATTENDANCE) FROM FACT_TABLE COURSE_DIMENSION, ETC.WHERE ... GROUP BY COURSE

    You can think of these kinds of event tables as recording the collision of keys at a point in space andtime. Your table simply records the collisions that occur. (Automobile insurance companies oftenliterally record collisions this way.) In this case, the dimensions of the factless fact table could be:

    Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness Claim Type

    Like the college course attendance example, this collision database could answer many interestingquestions. The author has designed a number of collision databases, including those for bothautomobiles and boats. In the case of boats, a variant of the collision database required a "dock"dimension as well as a boat dimension.

    A second kind of factless fact table is called a coverage table. A typical coverage table is shown inFigure 2. Coverage tables are frequently needed when a primary fact table in a dimensional datawarehouse is sparse. Figure 2 also shows a simple sales fact table that records the sales of products instores on particular days under each promotion condition. The sales fact table does answer manyinteresting questions but cannot answer questions about things that didn't happen. For instance, itcannot answer the question, "Which products were on promotion that didn't sell?" because it containsonly the records of products that did sell. The coverage table comes to the rescue. A record is placed inthe coverage table for each product in each store that is on promotion in each time period. Notice thatyou need the full generality of a fact table to record which products are on promotion. In general, whichproducts are on promotion varies by all of the dimensions of product, store, promotion, and time. Thiscomplex many-to-many relationship must be expressed as a fact table. This is one of Kimball's Laws:Every many-to-many relationship is a fact table, by definition.

    Perhaps some of you would suggest just filling out the original fact table with records representing zerosales for all possible products. This is logically valid, but it would expand the fact table enormously. Ina typical grocery store, only about 10 percent of the products sell on any given day. Including all of thezero sales could increase the size of the database by a factor of ten. Remember, too, that you wouldhave to carry all of the additive facts as zeros. Because many big grocery store sales fact tablesapproach a billion records, this would be a killer. Besides, there is something obscene about spendinglarge amounts of money on disk drives to store zeros.

    Pg. 41

    http://www.dbmsmag.com/9609d05.html#figure2http://www.dbmsmag.com/9609d05.html#figure2
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    42/69

    The coverage factless fact table can be made much smaller than the equivalent set of zeros described inthe previous paragraph. The coverage table must only contain the items on promotion; the items not onpromotion that also did not sell can be left out. Also, it is likely for administrative reasons that theassignment of products to promotions takes place periodically, rather than every day. Often a storemanager will set up promotions in a store once each week. Thus we don't need a record for every

    product every day. One record per product per promotion per store each week will do. Finally, thefactless format keeps us from storing explicit zeros for the facts as well.

    Answering the question, "Which products were on promotion that did not sell?" requires a two-stepapplication. First, consult the coverage table for the list of products on promotion on that day in thatstore. Second, consult the sales table for the list of products that did sell. The desired answer is the setdifference between these two lists of products.

    Coverage tables are also useful for recording the assignment of sales teams to customers in businessesin which the sales teams make occasional very large sales. In such a business, the sales fact table is toosparse to provide a good place to record which sales teams were associated with which customers. Thesales team coverage table provides a complete map of the assignment of sales teams to customers, evenif some of the combinations never result in a sale.

    Pg. 42

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    43/69

    FIGURE 1

    -- A factless fact table for recording student attendance on a daily basis at a college. The five dimension tables contain richdescriptions of dates, students, courses, teachers, and facilities. There are no additive, numeric facts.

    FIGURE 2

    --A factless coverage table used in conjunction with an ordinary sales fact table to answer the question, "Which productswere on promotion that did not sell?"Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, andwindows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of RedBrick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph throughhis Internet web page at http://www.rkimball.com.

    Pg. 43

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    44/69

    Anexo 17

    Slowly Changing DimensionsUnlike OLTP Systems, Data Warehouses Can Track Historical Data.

    DBMS, April 1996

    Slowly Changing DimensionsUnlike OLTP Systems, Data Warehouses Can Track Historical Data.

    One major difference between an OLTP system and a data warehouse is the ability to accuratelydescribe the past. OLTP systems are usually very poor at correctly representing a business as of amonth or a year ago. A good OLTP system is always evolving. O rders are being filled and, thus, theorder backlog is constantly changing. Descriptions of products, suppliers, and customers are constantlybeing updated, usually by overwriting. The large volume of data in an OLTP system is typically purgedevery 90 t o 180 days. For these reasons, it is difficult for an OLTP system to correctly represent thepast. In an OLTP system, do you really want to keep old order statuses, product descriptions, supplierdescriptions, and customer descriptions over a multiyear p eriod?

    The data warehouse must accept the responsibility of accurately describing the past. By doing so, the

    data warehouse simplifies the responsibilities of the OLTP system. Not only does the data warehouserelieve the OLTP system of almost all forms of repor ting, but the data warehouse contains specialstructures that have several ways of tracking historical data. (OLTP systems produce "flash reports" formanagement, and the people who run OLTP systems are proud of that capability. But beyond thesesimple d aily and weekly summaries and counts, the OLTP environment is a very costly environment inwhich to do any kind of complex reporting. Whether an OLTP shop likes it or not, the economics ofreporting favor the data warehouse.)

    Pg. 44

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    45/69

    A dimensional data warehouse database consists of a large central fact table with a multipart key. Thisfact table is surrounded by a single layer of smaller dimension tables, each containing a single primarykey. In a dimensional database, these issues of describing the past mostly involve slowly changingdimensions. A typical slowly changing dimension is a product dimension in which the detaileddescription of a given product is occasionally adjusted. For example, a minor ingredient change or a

    minor packaging change may be so small that production does not assign the product a new SKUnumber (which the data warehouse has been using as the primary key in the product dimension), butnevertheless gives the data warehouse team a revised description of t he product. The data warehouseteam faces a dilemma when this happens. If they want the data warehouse to track both the old and newdescriptions of the product, what do they use for the key? And where do they put the two values of thechanged ingredient attribute?

    Other common slowly changing dimensions are the district and region names for a sales force. Everycompany that has a sales force reassigns these names every year or two. This is such a commonproblem that this example is something of a joke in data ware housing classes. When the teacher asks,"How many of your companies have changed the organization of your sales force recently?" everyone

    raises their hands.There are three main techniques for handling slowly changing dimensions in a data warehouse:overwriting, creating another dimension record, and creating a current value field. Each techniquehandles the problem differently. The designer chooses among th ese techniques depending on the users'needs.

    OverwritingThe first technique is the simplest and fastest. But it doesn't maintain past history! Nevertheless, overwriting is frequentlyused when the data warehouse team legitimately decides that the old value of the changed dimension attribute is notinteresting . For example, if you find incorrect values in the city and state attributes in a customer record, then overwritingwould almost certainly be used. After the overwrite, certain old reports that depended on the city or state values would notreturn exactl y the same values. Most of us would argue that this is the correct outcome.

    Pg. 45

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    46/69

    Creating Another Dimension RecordThe second technique is the most common and has a number of powerful advantages. Suppose you work in a manufacturingcompany and one of your main data warehouse schemas is the company's shipments. The product dimension is one of themost important dimens ions in this dimensional schema. (See Figure 1.) A typical product dimension in a shipments schemawould have several thousand detailed records, each representing a distinguishable product capable of being shipped. A good

    product d imension table would have at least 50 attributes describing the products, including hierarchical attributes such as

    brand and category, as well as nonhierarchical attributes such as flavor and package type. An important attribute providedby manufacturin g operations is the SKU number assigned to the product. You should start by using the SKU number as thekey to the product dimension table.

    Suppose that manufacturing operations makes a slight change in packaging of SKU #38, and thepackaging description changes from "glued box" to "pasted box." Along with this change,manufacturing operations decides not to change the SKU number of the prod uct, or the bar code(UPC) that is printed on the box. If the data warehouse team decides to track this change, the best wayto do this is to issue another product record, as if the pasted box version were a brand new product. Theonly difference between the two product records is the packaging description. Even the SKU numbersare the same. The only way you can issue another record is if you generalize the key to the productdimension table to be something more than the SKU number. A simple technique i s to use the SKU

    number plus two or three version digits. Thus the first instance of the product key for a given SKUmight be SKU# + 01. When, and if, another version is needed, it becomes SKU# + 02, and so on.Notice that you should probably also park t he SKU number in a separate dimension attribute (field)because you never want an application to be parsing the key to extract the underlying SKU number.Note the separate SKU attribute in the Product dimension in Figure 1.

    This technique for tracking slowly changing dimensions is very powerful because new dimensionrecords automatically partition history in the fact table. The old version of the dimension record pointsto all history in the fact table prior to the change. The new version of the dimension record points to allhistory after the change. There is no need for a timestamp in the product table to record the change. Infact, a timestamp in the dimension record may be meaningless because the event of interest is t heactual use of the new product type in a shipment. This is best recorded by a fact table record with the

    correct new product key.

    Another advantage of this technique is that you can gracefully track as many changes to a dimensionalitem as you wish. Each change generates a new dimension record, and each record partitions historyperfectly. The main drawbacks of the technique are the requirement to generalize the dimension key,and the growth of the dimension table itself.

    Pg. 46

    http://www.dbmsmag.com/9604d05.html#figure1http://www.dbmsmag.com/9604d05.html#figure1
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    47/69

    Creating a Current Value FieldYou use the third technique when you want to track a change in a dimension value, but it is legitimate to use the old value

    both before and after the change. This situation occurs most often in the infamous sales force realignments, where althoughyou ha ve changed the names of your sales regions, you still have a need to state today's sales in terms of yesterday's regionnames, just to "see how they would have done" using the old organization. You can attack this requirement, not by creatinga new dimen sion record as in the second technique, but by creating a new "current value" field.

    Suppose in a sales team dimension table, where the records represent sales teams, you have a fieldcalled "region." When you decide to rearrange the sales force and assign each team to newly namedregions, you create a new field in the sales dimension ta ble called "current_region." You shouldprobably rename the old field "previous_region." (See Figure 2.) No alterations are made to the salesdimension record keys or to the number of sales team records. These two fields now allow anapplication to group all sales fact records by either the old sales assignments (previous region) or thenew sales assignments (current region). This schema allows only the most recent sales force change tobe tracked, but it offers the immense flexib ility of being able to state all of the history by either of thetwo sales force assignment schemas. It is conceivable, although somewhat awkward, to generalize thisapproach to the two most recent changes. If many of these sales force realignments take place and it isdesired to track them all, then the second technique should probably be used.

    Choosing a TechniqueThe second and third techniques described here will handle the great majority of applications with slowly changingdimensions. The second technique, creating another dimension record, works very well for dimension tables with up toseveral hundred thousa nd records. Even the addition of many new records to these moderately large dimensions will notcompromise performance in a DBMS with good indexing techniques, such as bit vector indexing. However, eventually a

    point may be reached in very large dimensio ns, such as multimillion record customer lists, where the second techniquecannot be used. In this case, you are forced to resort to a cruder technique, appropriate for Monster Dimensions. This will bethe subject of my column next month.

    Pg. 47

    http://www.dbmsmag.com/9604d05.html#figure2http://www.dbmsmag.com/9604d05.html#figure2
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    48/69

    Figure 1.

    --A typical manufacturing shipments schema with five dimensions, showing the Product dimension expanded. In this articleI show how to track a meaningful change of the package type (pkg_type) attribute over time when the OLTP system refusesto change th e master product key (SKU #).

    Figure 2.

    --A typical Sales Team dimension for almost any company that sells products. In this article I show how to track a change inthe region attribute when you need to see both the old and new versions of the attribute over all historical data.Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, andwindows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of RedBrick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph throughhis Internet web page at http://www.rkimball.com.

    Pg. 48

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    49/69

    Anexo 18

    Surrogate KeysKeep control over record identifiers by generating new keys for the data warehouse

    DBMS - May 1998

    According to the Websters Unabridged Dictionary, a surrogate is an "artificial or synthetic productthat is used as a substitute for a natural product." Thats a great definition for the surrogate keys we usein data warehouses. A surrogate key is an artificial or synthetic key that is used as a substitute for anatural key.

    Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a datawarehouse, a surrogate key is a necessary generalization of the natural production key and is one of thebasic elements of data warehouse design. Lets be very clear: Every join between dimension tables andfact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It isup to the data extract logic to systematically look up and replace every incoming natural key with adata warehouse surrogate key each time either a dimension record or a fact record is brought into thedata warehouse environment.

    In other words, when we have a product dimension joined to a fact table, or a customer dimensionjoined to a fact table, or even a time dimension joined to a fact table, as shown inFigure 1 the actualphysical keys on either end of the joins are not natural keys directly derived from the incoming data.Rather, the keys are surrogate keys that are just anonymous integers. Each one of these keys should bea simple integer, starting with one and going up to the highest number that is needed. The product keyshould be a simple integer, the customer key should be a simple integer, and even the time key shouldbe a simple integer. None of the keys should be:

    Smart, where you can tell something about the record just by looking at the key

    Composed of natural keys glued together

    Implemented as multiple parallel joins between the dimension table and the fact table; so-called double or triplebarreled joins.

    If you are a professional DBA, I probably have your attention. If you are new to data warehousing, youare probably horrified. Perhaps you are saying, "But if I know what my underlying key is, all mytraining suggests that I make my key out of the data I am given." Yes, in the production transactionprocessing environment, the meaning of a product key or a customer key is directly related to the

    Pg. 49

  • 7/30/2019 APOSTILA_ADMDADOS_DW

    50/69

    records content. In the data warehouse environment, however, a dimension key must be ageneralization of what is found in the record.

    As the data warehouse manager, you need to keep your keys independent from the production keys.Production has different priorities from you. Production keys such as product keys or customer keys aregenerated, formatted, updated, deleted, recycled, and reused according to the dictates of production. Ifyou use production keys as your keys, you will be jerked around by changes that can be, at the veryleast, annoying, and at the worst, disastrous. Suppose that you need to keep a three-year history ofproduct sales in your large sales fact table, but production decides to purge their product file every 18months. What do you do then? Lets list some of the ways that production may step on your toes:

    Production may reuse keys that it has purged but that you are still maintaining, as I described.

    Production may make a mistake and reuse a key even when it isnt supposed to. This happens frequently in theworld of UPCs in the retail world, despite everyones best intentions.

    Production may recompact its key space because it has a need to garbage-collect the production system. One of mycustomers was recently handed a data warehouse load tape with all the production customer keys reassigned!

    Production may legitimately overwrite some part of a product description or a customer description with newvalues but not change the product key or the customer key to a new value. You are left holding the bag andwondering what to do about the revised attribute values. This is the Slowly Changing Dimension crisis, which Iwill explain in a moment.

    Production may generalize its key format to handle some new situation in the transaction system. Now theproduction keys that used to be integers become alphanumeric. Or perhaps the 12-byte keys you are used to havebecome 20-byte keys.

    Your company has just made an acquisition, and you need to merge more than a million new customers into themaster customer list. You will now need to extract from two production systems, but the newly acquired

    production system has nasty customer keys that dont look remotely like the others.

    The Slowly Changing Dimension crisis I mentioned earlier is a well-known situation in datawarehousing. Rather than blaming production for not handling its keys better, it is more constructive torecognize that this is an area where the interests of production and the interests of the data warehouselegitimately diverge. Usually, when the data warehouse administrator encounters a changed descriptionin a dimension record such as product or customer, the correct response is to issue a new dimensionrecord. But to do this, the data warehouse must have a more general key structure. Hence the need for asurrogate key. I discussed Slowly Changing Dimensions in my April 1996 column. In next monthscolumn, I will describe the low-level architecture for recognizing and processing Slowly ChangingDimensions at high speed.

    There are still more reasons to use surrogate keys. One of the most important is the need to encode

    uncertain knowledge. You may need to supply a customer key to represent a transaction, but perhapsyou dont know for certain who the customer is. This would be a common occurrence in a retailsituation where cash transactions are anonymous, like most grocery stores. What is the customer keyfor the anonymous customer? Perhaps you have introduced a special key that stands for thisanonymous customer. This is politely referred to as a "hack."

    If you think carefully about the "I dont know" situation, you may want more than just this one specialkey for the anonymous customer. You may also want to describe the situation where "the customer

    Pg. 50

    http://var/www/apps/conversion/tmp/scratch_5/9604d05.htmlhttp://var/www/apps/conversion/tmp/scratch_5/9604d05.html
  • 7/30/2019 APOSTILA_ADMDADOS_DW

    51/69

    identification has not taken place yet." Or maybe, "there was a customer, but the data processingsystem failed to report it correctly." And also, "no customer is possible in this situation." All of thesemetasituations call for a data warehouse customer key that cannot be composed from the transactionproduction customer keys. Dont forget that in the data warehouse you must provide a customer key forevery fact record in the schema shown inFigure 1. A null key automatically turns on the referential

    integrity alarm in your data warehouse because a foreign key (as in the fact table) can never be null.The "I dont know" situation occurs quite frequently for dates. You are probably using date-valuedkeys for your joins between your fact tables and your dimension tables. Once again, if you have donethis you are forced to use some kind of real date to represent the special situations where a date value isnot possible. I hope you have not been usi