WorkshopMaxtera_RevDados_11mar15

105
Revolução dos Dados Adriano Amaral Diretor Tecnologia e Soluções about.me/adriano_amaral

Transcript of WorkshopMaxtera_RevDados_11mar15

Page 1: WorkshopMaxtera_RevDados_11mar15

Revolução dos DadosAdriano Amaral Diretor Tecnologia e Soluções about.me/adriano_amaral

Page 2: WorkshopMaxtera_RevDados_11mar15

Labs

Espaço…

Page 3: WorkshopMaxtera_RevDados_11mar15

Objetivos

Teste

Prototipagem IdéiasTendências

Aprofundamento TecnológicoNovas Tecnologias

Inovação

Page 4: WorkshopMaxtera_RevDados_11mar15

• Explanações com intervenções…tipo bate-papo;

• Não tem como objetivo ser o "dono da verdade";

• Dinâmica de Design Thinking, para solução de problemas;

• Definir temas e prioridades futuras;

Dinâmica

Page 5: WorkshopMaxtera_RevDados_11mar15

Os alquimistas estão chegando….

Page 6: WorkshopMaxtera_RevDados_11mar15
Page 7: WorkshopMaxtera_RevDados_11mar15
Page 8: WorkshopMaxtera_RevDados_11mar15

Teradata Unified Data Architecture™

AUDIO & VIDEO images text Web & social Machine logs crm scm erp

Dual systems

Data marts

Test/ dev

ANALYTICAL ARCHIVE

Languages Math & stats Data Mining BUSINESS INTELLIGENCE ApplicationsVIEWPOINT SUPPORT

INDEPENDENT DATA MART

Discovery platform

INTEGRATED DATA WAREHOUSE

Data lab

Capture | Store | Refine

Engineers

Data Scientists Business Analysts Marketing Front-Line Workers

Operational SystemsCustomers / Partners Executives

Page 9: WorkshopMaxtera_RevDados_11mar15

Como transformar dados em ouro?

Page 10: WorkshopMaxtera_RevDados_11mar15

Liberte os dados…

Dados Informação

Inteligência

Imperativo

Decisões Resultados

Conhecimento

Necessidade Utilidade

Page 11: WorkshopMaxtera_RevDados_11mar15

Realizando Valor Maxtera

ParceirosStream Data

Fast Data

Big Data

Tempo Real

Apps Data

Sensores (IoT)

Não Estruturados

Estruturados

ENGINE DE DADOS

Open Data

Decisões Análises

Previsões Fraudes

Valor

DataScientist Consulting Team

Page 12: WorkshopMaxtera_RevDados_11mar15
Page 13: WorkshopMaxtera_RevDados_11mar15

Nos primórdios...

Page 14: WorkshopMaxtera_RevDados_11mar15

Como armazenar os dados?How do we store data?

5/11/13 Bill Howe, UW 2

Page 15: WorkshopMaxtera_RevDados_11mar15

Como armazenar os dados?How do we store data?

5/11/13 Bill Howe, UW 3

###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1

chr_4[480001-580000].287 4500

chr_4[560001-660000].1 3556

chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subunit

chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, translational repressor

chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, translational repressor

chr_24[160001-260000].65 3542

chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, translational repressor

chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolase

chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_11[1-100000].70 2886

chr_11[80001-180000].100 1523

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

What is the data model?

How do we store data?

5/11/13 Bill Howe, UW 3

###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1

chr_4[480001-580000].287 4500

chr_4[560001-660000].1 3556

chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subunit

chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, translational repressor

chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, translational repressor

chr_24[160001-260000].65 3542

chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, translational repressor

chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolase

chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_11[1-100000].70 2886

chr_11[80001-180000].100 1523

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

What is the data model?

How do we store data?

5/11/13 Bill Howe, UW 3

###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1

chr_4[480001-580000].287 4500

chr_4[560001-660000].1 3556

chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subunit

chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)

chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, translational repressor

chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, translational repressor

chr_24[160001-260000].65 3542

chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, translational repressor

chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolase

chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase family

chr_11[1-100000].70 2886

chr_11[80001-180000].100 1523

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

What is the data model?

Page 16: WorkshopMaxtera_RevDados_11mar15
Page 17: WorkshopMaxtera_RevDados_11mar15

• Simplificar a realidade;

• Definir espaço da amostra;

• Traduzir de forma ao entendimento da máquina;

• Possibilitar a manipulação;

Modelagem!

Page 18: WorkshopMaxtera_RevDados_11mar15
Page 19: WorkshopMaxtera_RevDados_11mar15
Page 20: WorkshopMaxtera_RevDados_11mar15
Page 21: WorkshopMaxtera_RevDados_11mar15

• 3 Componentes

1.Estruturas;

2.Constrições;

3.Operações;

O que é um modelo de dados?

Page 22: WorkshopMaxtera_RevDados_11mar15

1.Estruturas;

• Linhas ou colunas?

• nós ou elementos?

• valores chaves?

• sequência de Bytes?

2.Constrições;

• todas as linhas tem o mesmo número de colunas?

• os valores da cada coluna devem ter o mesmo valor?

• um filho não pode ter 2 pais?

3.Operações;

• Ache o valor da váriavel X

• Ache a linha onde a coluna “sobrenome” tem o valor “Oliveira"

• Pegue os próximos N bytes

O que é um modelo de dados?

Page 23: WorkshopMaxtera_RevDados_11mar15

"Uma coleção de informações organizadas para facilitar a

recuperação da mesma"

O que é um banco de dados?

http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml

Page 24: WorkshopMaxtera_RevDados_11mar15

• Que problemas o Banco de Dados resolve? • Compartilhamento

• Permite o acesso de vários leitores e escritores simultaneamente;

• Forçar a modelagem de dados • Garante que todas as aplicações acessem mesmo formato e

organização de dados; • Escala

• Trabalha com datasets muito grandes para caber na memória;

• Flexibilidade • Usar os dados de um jeito novo, e não imaginados ainda!!!

O que esperar do banco de dados?

Page 25: WorkshopMaxtera_RevDados_11mar15

• Como esse dado é organizado fisicamente no disco?

• Que tipos de consulta são eficientemente suportadas por esse modelo e quais não?

• Quão complexo é adicionar um dado ou atualizá-lo?

• O que acontece quando surgem novas consultas que não havia previsto? Preciso reorganizar os dados? Quão complicado é isso?

Questões importantes!!

Page 26: WorkshopMaxtera_RevDados_11mar15

• Bando de Dados em Rede:

Historico dos Bancos de Dados

Historical Example: Network Databases

5/11/13 Bill Howe, UW 2

Database: A collection of information organized to afford efficient retrieval

Orderer%Customer%

Screw%

Nut%

Washer%

Contact%Rep%

Page 27: WorkshopMaxtera_RevDados_11mar15

• Banco de Dados Hierárquico

Historico dos Bancos de Dados

Historical Example: Hierarchical Databases

5/11/13 Bill Howe, UW 3

Orderer%

Customer%

Screw%

Nut%

Nail%

Contact%Rep%Orderer% Screw%

Nut%

Washer%master

detail

detail

Works great if you want to find all orders for a particular customer. But what if you want to find all Customers who ordered a Nail?

Page 28: WorkshopMaxtera_RevDados_11mar15

"RDBMS - Sistemas Gerenciamento de Banco de Dados relacionais, foram inventados para permitir que

você use o dado de múltiplas formas, incluindo caminhos que não haviam sido determinados quando o banco foi criado e sua primeira aplicação desenhada”

Banco de Dados Relacionais

Codd, 1970

Page 29: WorkshopMaxtera_RevDados_11mar15

Promover independencia física dos dados…

5/11/13 Bill Howe, eScience Institute 2

Key Idea: “Physical Data Independence”

physical data independence

files and pointers

relations

SELECT seq FROM ncbi_sequences WHERE seq = �GATTACGATATTA�;

f = fopen(�table_file�); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .

Page 30: WorkshopMaxtera_RevDados_11mar15

Promover uma álgebra dos registros

5/11/13 Bill Howe, eScience Institute 3

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

Page 31: WorkshopMaxtera_RevDados_11mar15

Relacional X Analítico

Relacional X Analitico

Equivalent logical expressions; different costs

1

σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R))

(σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R)

σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R)

right associative

left associative

cross product

Mesma operação, custos diferentes!

Page 32: WorkshopMaxtera_RevDados_11mar15

Com

ple

xid

ade

Sofisticação do Dado

Atualização Contínua & Sensível ao tempo,

consultas mais importantes

OPERACIONALIZANDO O QUE está

acontecendo?

Comandos baseado em eventos assumem

o ambiente

ATIVANDO FAZENDO acontecer!

Atualização Continua e Consultas Rápidas

Ações baseada em eventos

Cresce os modelos analíticos

PREVENDO O QUE IRÁ acontecer?

Batch

Ad Hoc

Analytics

Aumento de análisesAd Hoc

ANALISANDO PORQUE isso aconteceu?

Batches & Relatórios Ad Hoc

REPORTANDO O QUE

aconteceu?

Evolução de uso Ambiente Analítico

Page 33: WorkshopMaxtera_RevDados_11mar15

“Big Data é qualquer dado que é caro demais para gerenciar e

extrair valor”

Bigdata: Definição…

Michael Franklin Thomas M. Siebel Professor of Computer Science Director of the Algorithms, Machines and People Lab University of Berkeley

Page 34: WorkshopMaxtera_RevDados_11mar15

• Velocidade • latência do dado mediante diversidade de

demandas e o crescimento da interatividade;

• Variedade • diversidade de formatos, qualidade, fontes e

estruturas; • Volume

• Tamanho dos dados;

Bigdata: Desafios…

Page 35: WorkshopMaxtera_RevDados_11mar15

Cadê a pedra filosofal então?

Page 36: WorkshopMaxtera_RevDados_11mar15

Cidadãos/Consumidores Demanda de novos serviços

Governos/Empresas Adaptação para atender as necessidades

Slide, Arturo Muente-Kunigami @n0wh3r3m4n

Page 37: WorkshopMaxtera_RevDados_11mar15

http://www.clickestudante.com/resultado-dos-protestos-pelo-brasil.html

Page 38: WorkshopMaxtera_RevDados_11mar15

1.Governo/Empresas como Plataforma

2.Cidadãos sabem mais e melhor que o governo

(sensores)

3.Sistemas pequenos, baixos acoplamentos

4.Para ganhar a confiança, entregue!

5.Reutilize sistemas e políticas, agregue com inovação

6.Tecnologia como facilitador

Proposta…

Page 39: WorkshopMaxtera_RevDados_11mar15

govLab1.Design 2.Construção 3.Libere 4.Revise

Fonte: Adaptado Modelo Ágil

Page 40: WorkshopMaxtera_RevDados_11mar15
Page 41: WorkshopMaxtera_RevDados_11mar15

OpenLinked Data

Page 42: WorkshopMaxtera_RevDados_11mar15

J U N E 2 0 1 4

Open for Business:How Open Data Can Help Achieve the G20 Growth Target

A Lateral Economics report commissioned by Omidyar Network

Page 43: WorkshopMaxtera_RevDados_11mar15

DadosAbertos e o desenvolvimento

13 trilhões de dólares nos próximos 5 anos

Crescimento de 1.1% do PIB do G20, dentre os

2% previstos nos 5 anos

U$14,5 Billhões por ano, e provavelmente esse valor

esta subestimado…(caso australiano)

Page 44: WorkshopMaxtera_RevDados_11mar15

• Reduz o custos dos serviços do Governo e da

Iniciativa Privada;

• Possibilidade de novos serviços e aumento da

qualidade dos serviços existentes;

• Aumento da confiança no Governo devido o

aumento da governança, transparência e

engajamento dos cidadãos;

Valor dos dados abertos…

Page 45: WorkshopMaxtera_RevDados_11mar15

Dados que geram mais valor…

Educação

Fazenda

Transporte

Varejo

Energia

Saúde

Agricultura

Emprego

Fonte: Open for Business

Page 46: WorkshopMaxtera_RevDados_11mar15

Iniciativa Privada

Empresas como Catalizadores!!!

Governo

Sociedade Civil

Sociedade Civil

Sociedade Civil

Sociedade Civil

Sociedade Civil

$$$

Page 47: WorkshopMaxtera_RevDados_11mar15

200M libras em

prescrições no SUS

Britânico (NHS)

http://www.economist.com/news/britain/21567980-how-scrutiny-freely-available-data-might-save-nhs-money-beggar-thy-neighbour

Page 48: WorkshopMaxtera_RevDados_11mar15
Page 49: WorkshopMaxtera_RevDados_11mar15
Page 50: WorkshopMaxtera_RevDados_11mar15
Page 51: WorkshopMaxtera_RevDados_11mar15
Page 52: WorkshopMaxtera_RevDados_11mar15
Page 53: WorkshopMaxtera_RevDados_11mar15
Page 54: WorkshopMaxtera_RevDados_11mar15
Page 55: WorkshopMaxtera_RevDados_11mar15
Page 56: WorkshopMaxtera_RevDados_11mar15
Page 57: WorkshopMaxtera_RevDados_11mar15
Page 58: WorkshopMaxtera_RevDados_11mar15
Page 59: WorkshopMaxtera_RevDados_11mar15

WhereDoesMyMoneyGo.org

Page 60: WorkshopMaxtera_RevDados_11mar15

Open data index

Page 61: WorkshopMaxtera_RevDados_11mar15

@shevski okfn.org

Page 62: WorkshopMaxtera_RevDados_11mar15

@shevski okfn.org

OpenCorporates.com

http://opencorporates.com/viz/financial/index.html

Page 63: WorkshopMaxtera_RevDados_11mar15

Resolvendo problemas públicos com Tecnologia

Page 64: WorkshopMaxtera_RevDados_11mar15
Page 65: WorkshopMaxtera_RevDados_11mar15

CDO (Chief Data Officer)

Page 66: WorkshopMaxtera_RevDados_11mar15
Page 67: WorkshopMaxtera_RevDados_11mar15

Como fazer então?

Page 68: WorkshopMaxtera_RevDados_11mar15

• Operacionalmente • No Passado: Funcionava, mesmo se o dado não coubesse na

memória; • Agora: Posso utilizar vários pequenos computadores (barato)

• "Algoritmamente" • No Passado: Para uma determinada quantidade de dados (N), tenho

finitas operações; (Nm) - Polinomial • Agora: Para um montante crescente de dados, preciso realizar um

volume maior de operações (Nm/k) - Polinomial Paralelizado • Em breve: Dados fluem em um fluxo contínuo de diversa fontes,

consultas realizadas continuamente (N*log(N)) - (StreamData) • Ex: Telescópios de Varredura (30TB/noite)

Escalando…

Page 69: WorkshopMaxtera_RevDados_11mar15

• Imagine procurar uma seqüência de DNA • Todas as seqüências iguais a:

• GATTACGATATTA

Explorando possibilidades

GATTACGATATTATACCTGCCGTAA

Page 70: WorkshopMaxtera_RevDados_11mar15

GATTACGATATTA

TACCTGCCGTAA = GATTACGATATTA ?

Ciclo = 0

GATTACGATATTA

CCCCCAATGAC = GATTACGATATTA ?

Ciclo = 1

Page 71: WorkshopMaxtera_RevDados_11mar15

GATTACGATATTA

GATTACGATATTA = GATTACGATATTA ?

Ciclo = 40

GATTACGATATTA

40 Registros = 40 Comparações

Page 72: WorkshopMaxtera_RevDados_11mar15

Ordenar a Seqüência?AAAATCCTGCA

AAACGCCTGCA

GATTACGATATTA

TTTTCGTAATT

TTTACGTCAA

Page 73: WorkshopMaxtera_RevDados_11mar15

GATTACGATATTA

CTGTACACAACCT

CTGTACACAACCT < GATTACGATATTA ?

0% 100%

Começamos dividindo ao meio!

Page 74: WorkshopMaxtera_RevDados_11mar15

GATTACGATATTA

GGATACACATTTA

GGATACACATTTA > GATTACGATATTA

0% 100%

Page 75: WorkshopMaxtera_RevDados_11mar15

GATTACGATATTA0% 100%

40 Registros = 4 Comparações N registros = log (N) Comparações

Page 76: WorkshopMaxtera_RevDados_11mar15

Cortar o dado?

Page 77: WorkshopMaxtera_RevDados_11mar15

…….

40 Registros / 6 Trabalhadores (N/k)

Page 78: WorkshopMaxtera_RevDados_11mar15

Cortando, Transformando e Simplificando?

f f f f f f

Page 79: WorkshopMaxtera_RevDados_11mar15

Vários Dados…

Page 80: WorkshopMaxtera_RevDados_11mar15

map map map map map map

reduce reduce reduce reduce

3 5 4

Page 81: WorkshopMaxtera_RevDados_11mar15

MapReduce (2004)

Page 82: WorkshopMaxtera_RevDados_11mar15

5/15/13 Bill Howe, eScience Institute 21

Hadoop in One Slide

src: Huy Vo, NYU Poly

Page 83: WorkshopMaxtera_RevDados_11mar15
Page 84: WorkshopMaxtera_RevDados_11mar15

• Nigredo: ou Operação Negra, é o estágio em que a matéria é dissolvida e putrefacta (associada ao calor e ao fogo);

• Albedo: ou Operação Branca, é o estágio em que a substância é purificada (associada à ablução com Aquae Vitae, à luz da lua, feminina e à prata);

• Citrinitas: ou Operação Amarela, é o estágio em que se opera a transmutação dos metais, da prata em ouro, ou da luz da lua, passiva, em luz solar, ativa;

Processo Alquimico

http://pt.wikipedia.org/wiki/Alquimia

Page 85: WorkshopMaxtera_RevDados_11mar15

Map • Input = (inputkey, value) • Output = (intermediatekey, value) - distribuidos

Reduce • Input = (intermediatekey, value) • Output = (outputkey, value) - reagrupados

Simplificação do Modelo de Dados

Dados = Arquivo = saco de pares (key, value)

Page 86: WorkshopMaxtera_RevDados_11mar15

Simplificação do Modelo de Dados…outro….

RDF Resource Description Framework

Page 87: WorkshopMaxtera_RevDados_11mar15

Implementação de Joins usando Map-Reduce

Nome ID

Adriano 11111

José Rodrigo 22222

Empregados

EmpID Setor11111 Tecnologia2222 Vendas2222 Marketing

Setor Associado

Empregados ⋈ SetorAssociado

Nome ID EmpID SetorAdriano 11111 11111 TecnologiaJosé Rodrigo 22222 2222 VendasJosé Rodrigo 22222 2222 Marketing

Page 88: WorkshopMaxtera_RevDados_11mar15

Joins: Antes do Mapeamento

Nome ID

Adriano 11111

José Rodrigo 22222

Empregados

EmpID Setor11111 Tecnologia2222 Vendas2222 Marketing

Setor Associado

Empregado, Adriano, 11111 Empregado, José Rodrigo, 22222 Setor, 11111, Tecnologia Setor, 22222, Vendas Setor, 22222, Marketing

Juntar os dados

em um grande

bloco de dados

Page 89: WorkshopMaxtera_RevDados_11mar15

Joins: Função de Mapeamento…

Empregado, Adriano, 11111 Empregado, José Rodrigo, 22222 Setor, 11111, Tecnologia Setor, 22222, Vendas Setor, 22222, Marketing

Pares:

(chave, valor)

chave=11111, valor= (Empregado, Adriano, 11111) chave=2222, valor= (Empregado, José Rodrigo, 22222) chave=11111, valor= (Setor, 11111, Tecnologia) chave=22222, valor= (Setor, 22222, Vendas) chave=22222, valor= (Setor, 22222, Marketing)

Page 90: WorkshopMaxtera_RevDados_11mar15

Joins: Fase da Redução

chave=11111, valor= [(Empregado, Adriano, 11111), (Setor, 11111, Tecnologia)]

chave=2222, valor= [(Empregado, José Rodrigo, 22222), (Setor, 22222, Vendas), (Setor, 22222, Marketing)]

Adriano, 11111, 11111, Tecnologia

José Rodrigo, 22222, 22222,Vendas, José Rodrigo, 22222, 22222,Marketing

Page 91: WorkshopMaxtera_RevDados_11mar15

• DFS - Distributed File System • Processamento Paralelo Massivo - MPP • Tolerância a falha pela duplicação de “chunks” em

nós paralelos • Nó Master e Trabalhadores se dividem nas fases

de Mapping e Reducing

Implementações Map Reduce

Page 92: WorkshopMaxtera_RevDados_11mar15

Implementações Map Reduce

http://hao-deng.blogspot.com.br/2013/05/map-reduce-logical-data-flow.html

Page 93: WorkshopMaxtera_RevDados_11mar15

MPP - Massive Parallel Processing

Page 94: WorkshopMaxtera_RevDados_11mar15

MPP - Massive Parallel Processing

Page 95: WorkshopMaxtera_RevDados_11mar15

Arquiteturas

A survey of Shared-Nothing Parallel DatabaseManagement Systems

[Comparison between Teradata, Greenplum and Netezza implementations]

Thomas MüselerUniversity of Applied Science Darmstadt

Haardtring 10064295 Darmstadt, Germany

[email protected]

ABSTRACTDistributed database systems can be implemented in a manydifferent ways. Mostly, they are customized for a specialenvironment to handle big data problems. The data ware-house sector relies on these amounts, but has changed froma data storage to a real time management support duringthe last years [3]. The resulting increase of compution andstorage capacity poses new requirements to the databasesystems. Previous approaches of a parallel database envi-ronment tried to solve this problem with shared disk andmemory approaches.The main contribution of this paper is the presentation ofthe current technology in the shared-nothing database sec-tor. The concepts of the manufacturers Teradata, Green-plum and Netezza will be discussed for data warehouse re-quirements. Based on an architectural overview is a detailedinsight of the index functionality given which is a crucialperformance factor. Also data distribution algorithms ofthe manufacturers are analysed under data warehouse con-ditions.At the end is a comparison to other shared concepts (shared-disk, shared-everything) given and the question raised, if theactual approach can be fulfilled by the manufacturers.

1. INTRODUCTIONRising data rates and big data volumes are the new chal-lenges for the today’s database systems. With computerunions is tried to counteract and therefore to distribute thecomputing loads to several instances.

An approach of distributed systems is the shared-nothingarchitecture in which each node can operate independentlyand separated from the other nodes. In contrast to the clas-sical shared database concepts in which the main memory orhard disks are shared, each node has its own hardware com-ponents. In association with a high-speed network results apowerful computer network which is becoming increasinglypopular through the excellent scalability.This paper gives an analysis of the existing architecture con-cepts in shared-nothing database systems. It gives an con-ceptual insight of the manufacturer Teradata, Netezza andGreenplum in chapter 2. These three vendors are relativelysmall companies compared to the market leader (Oracle,IBM, SAP) and pursue interesting ideas in this sector.A focal point of this paper is to the index implementationin chapter 4 which takes a decisive influence on the data

distribution to each node. Based on the application rangein the data warehouse environment (chapter 5), a compari-son is made to other architectural models and the scaling ofthese kinds of networks. Furthermore is the question raised,if the shared-nothing can be adapted to other applicationfields and is therefore a good opportunity for future imple-mentations.

2. ARCHITECTUREThe classification of distributed database systems in terms oftheir architecture can be done at different levels. The mostwidely used approach of Michael Stonebraker [18] builds thebasis for further architectural views and will be extended.Basically, we distinguish between three main approaches.The shared-everything (SE), the shared-disk (SD) and theshared-nothing (SN) architecture are one of the basic con-cepts in a shared database environment. While SE-systemsare sharing the processors (P) / memory resources (M) andthus constitute a closed circuit, require the SD / SN variantsa communication network (N) to integrate their components.

Figure 1: Stonebraker Architecture withshared-everything, shared-disk, shared-nothing

Within the shared-nothing architecture, each processor usesits own main memory and disk. A high-speed network isconnecting the various machines and is required for sharingand organization operations. This makes it more complexcompared to the other two variants concerning the manage-ment of a large number of processing nodes.The construction of a shared-nothing architecture can be di-vided into the following steps [14]:

1. Create a partitioning schema2. Data distribution to the instances3. Load balancing setup4. Repeat the process in case of a re-partitioning

A survey of Shared-Nothing Parallel DatabaseManagement Systems

[Comparison between Teradata, Greenplum and Netezza implementations]

Thomas MüselerUniversity of Applied Science Darmstadt

Haardtring 10064295 Darmstadt, Germany

[email protected]

ABSTRACTDistributed database systems can be implemented in a manydifferent ways. Mostly, they are customized for a specialenvironment to handle big data problems. The data ware-house sector relies on these amounts, but has changed froma data storage to a real time management support duringthe last years [3]. The resulting increase of compution andstorage capacity poses new requirements to the databasesystems. Previous approaches of a parallel database envi-ronment tried to solve this problem with shared disk andmemory approaches.The main contribution of this paper is the presentation ofthe current technology in the shared-nothing database sec-tor. The concepts of the manufacturers Teradata, Green-plum and Netezza will be discussed for data warehouse re-quirements. Based on an architectural overview is a detailedinsight of the index functionality given which is a crucialperformance factor. Also data distribution algorithms ofthe manufacturers are analysed under data warehouse con-ditions.At the end is a comparison to other shared concepts (shared-disk, shared-everything) given and the question raised, if theactual approach can be fulfilled by the manufacturers.

1. INTRODUCTIONRising data rates and big data volumes are the new chal-lenges for the today’s database systems. With computerunions is tried to counteract and therefore to distribute thecomputing loads to several instances.

An approach of distributed systems is the shared-nothingarchitecture in which each node can operate independentlyand separated from the other nodes. In contrast to the clas-sical shared database concepts in which the main memory orhard disks are shared, each node has its own hardware com-ponents. In association with a high-speed network results apowerful computer network which is becoming increasinglypopular through the excellent scalability.This paper gives an analysis of the existing architecture con-cepts in shared-nothing database systems. It gives an con-ceptual insight of the manufacturer Teradata, Netezza andGreenplum in chapter 2. These three vendors are relativelysmall companies compared to the market leader (Oracle,IBM, SAP) and pursue interesting ideas in this sector.A focal point of this paper is to the index implementationin chapter 4 which takes a decisive influence on the data

distribution to each node. Based on the application rangein the data warehouse environment (chapter 5), a compari-son is made to other architectural models and the scaling ofthese kinds of networks. Furthermore is the question raised,if the shared-nothing can be adapted to other applicationfields and is therefore a good opportunity for future imple-mentations.

2. ARCHITECTUREThe classification of distributed database systems in terms oftheir architecture can be done at different levels. The mostwidely used approach of Michael Stonebraker [18] builds thebasis for further architectural views and will be extended.Basically, we distinguish between three main approaches.The shared-everything (SE), the shared-disk (SD) and theshared-nothing (SN) architecture are one of the basic con-cepts in a shared database environment. While SE-systemsare sharing the processors (P) / memory resources (M) andthus constitute a closed circuit, require the SD / SN variantsa communication network (N) to integrate their components.

Figure 1: Stonebraker Architecture withshared-everything, shared-disk, shared-nothing

Within the shared-nothing architecture, each processor usesits own main memory and disk. A high-speed network isconnecting the various machines and is required for sharingand organization operations. This makes it more complexcompared to the other two variants concerning the manage-ment of a large number of processing nodes.The construction of a shared-nothing architecture can be di-vided into the following steps [14]:

1. Create a partitioning schema2. Data distribution to the instances3. Load balancing setup4. Repeat the process in case of a re-partitioning

A survey of Shared-Nothing Parallel DatabaseManagement Systems

[Comparison between Teradata, Greenplum and Netezza implementations]

Thomas MüselerUniversity of Applied Science Darmstadt

Haardtring 10064295 Darmstadt, Germany

[email protected]

ABSTRACTDistributed database systems can be implemented in a manydifferent ways. Mostly, they are customized for a specialenvironment to handle big data problems. The data ware-house sector relies on these amounts, but has changed froma data storage to a real time management support duringthe last years [3]. The resulting increase of compution andstorage capacity poses new requirements to the databasesystems. Previous approaches of a parallel database envi-ronment tried to solve this problem with shared disk andmemory approaches.The main contribution of this paper is the presentation ofthe current technology in the shared-nothing database sec-tor. The concepts of the manufacturers Teradata, Green-plum and Netezza will be discussed for data warehouse re-quirements. Based on an architectural overview is a detailedinsight of the index functionality given which is a crucialperformance factor. Also data distribution algorithms ofthe manufacturers are analysed under data warehouse con-ditions.At the end is a comparison to other shared concepts (shared-disk, shared-everything) given and the question raised, if theactual approach can be fulfilled by the manufacturers.

1. INTRODUCTIONRising data rates and big data volumes are the new chal-lenges for the today’s database systems. With computerunions is tried to counteract and therefore to distribute thecomputing loads to several instances.

An approach of distributed systems is the shared-nothingarchitecture in which each node can operate independentlyand separated from the other nodes. In contrast to the clas-sical shared database concepts in which the main memory orhard disks are shared, each node has its own hardware com-ponents. In association with a high-speed network results apowerful computer network which is becoming increasinglypopular through the excellent scalability.This paper gives an analysis of the existing architecture con-cepts in shared-nothing database systems. It gives an con-ceptual insight of the manufacturer Teradata, Netezza andGreenplum in chapter 2. These three vendors are relativelysmall companies compared to the market leader (Oracle,IBM, SAP) and pursue interesting ideas in this sector.A focal point of this paper is to the index implementationin chapter 4 which takes a decisive influence on the data

distribution to each node. Based on the application rangein the data warehouse environment (chapter 5), a compari-son is made to other architectural models and the scaling ofthese kinds of networks. Furthermore is the question raised,if the shared-nothing can be adapted to other applicationfields and is therefore a good opportunity for future imple-mentations.

2. ARCHITECTUREThe classification of distributed database systems in terms oftheir architecture can be done at different levels. The mostwidely used approach of Michael Stonebraker [18] builds thebasis for further architectural views and will be extended.Basically, we distinguish between three main approaches.The shared-everything (SE), the shared-disk (SD) and theshared-nothing (SN) architecture are one of the basic con-cepts in a shared database environment. While SE-systemsare sharing the processors (P) / memory resources (M) andthus constitute a closed circuit, require the SD / SN variantsa communication network (N) to integrate their components.

Figure 1: Stonebraker Architecture withshared-everything, shared-disk, shared-nothing

Within the shared-nothing architecture, each processor usesits own main memory and disk. A high-speed network isconnecting the various machines and is required for sharingand organization operations. This makes it more complexcompared to the other two variants concerning the manage-ment of a large number of processing nodes.The construction of a shared-nothing architecture can be di-vided into the following steps [14]:

1. Create a partitioning schema2. Data distribution to the instances3. Load balancing setup4. Repeat the process in case of a re-partitioning

Page 96: WorkshopMaxtera_RevDados_11mar15

Paralelismo

Consultas Distribuidas

Consultas em Paralelo

Page 97: WorkshopMaxtera_RevDados_11mar15

Distribuindo…

5/15/13 Bill Howe, eScience Institute 38

Parallel Query Example: Teradata

AMP = unit of parallelism

Page 98: WorkshopMaxtera_RevDados_11mar15

Teradata Database Architecture Page 2

1 5/15/2013 Copyright © 2013 by Teradata Corporation

Teradata and MPP SystemsTeradata is the software that makes a MPP system appear to be a single system to users and administrators.

BYNET 0 BYNET 1

Node 0

PEPE

AMP AMP

AMP AMP

: :

AMP AMP

PDEO.S.

PEPE

AMP AMP

AMP AMP

: :

AMP AMP

PDEO.S.

PEPE

AMP AMP

AMP AMP

: :

AMP AMP

PDEO.S.

PEPE

AMP AMP

AMP AMP

: :

AMP AMP

PDEO.S.

Node 1 Node 2 Node 3

The major components of the Teradata Database are implemented as virtual processors (vproc).

• Parsing Engine (PE)

• Access Module Processor (AMP)

The Communication Layer or Message Passing Layer (MPL) consists of PDE and BYNET SW/HW and connects multiple nodes together.

MPP - Massive Parallel Processing

AMPAccess Module Processor

Page 99: WorkshopMaxtera_RevDados_11mar15

Teradata Database Architecture Page 6

1 5/15/2013 Copyright © 2013 by Teradata Corporation

The Parsing Engine

The Parsing Engine is responsible for:

• Managing individual sessions (up to 120)

• Parsing and Optimizing your SQL requests

• Dispatching the optimized plan to the AMPs

• Input conversion (EBCDIC / ASCII) -if necessary

• Sending the answer set response back to the requesting client

Answer Set Response

Parsing Engine

SQL Request

Parser

Optimizer

Dispatcher

Message Passing Layer

AMP AMP AMP AMP

Núcleo de Inteligência Parser Engines

Page 100: WorkshopMaxtera_RevDados_11mar15

Hardware: Achando dado…

Storing and Accessing Data Rows Page 12

1 5/16/2013 Copyright © 2013 by Teradata Corporation

Which AMP has the Row?

Hashing Algorithm

RH Data

Table ID Row Hash PI valuesHBN and data

PARSER

Data Table

Message Passing Layer (Hash Maps)

AMP 1 AMP n - 1AMP x... ...AMP 0 AMP n

PI value = 197190

Hashing Algorithm

000A1F4A

SQL with primary index valuesand data.

For example: Assume PI value is 197190

Summary

The MPL accesses the Hash Map using Hash Bucket Number (HBN) of 000A1.

Bucket # 000A1 contains the AMP number that has this hash value – effectively the AMP with this row.

HBN – Hash Bucket Number

HBN

Hash Maps

AMP #

Row ID Row DataRow Hash Uniq Value

x '00000000'

x'000A1F4A' 0000 0001 38

x 'FFFFFFFF'

Page 101: WorkshopMaxtera_RevDados_11mar15

BIG DATA

WEBPetabytes

CRMTerabytes

Gigabytes ERP

Exabytes

INCREASING Data Variety and Complexity

User Generated Content

Mobile Web

SMS/MMS

Sentiment

External Demographics

HD Video

Speech to Text

Product/Service Logs

Social Network

Business Data Feeds

User Click Stream

Web Logs

Offer History A/B Testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic FunnelsPayment Record Support Contacts

Customer TouchesPurchase Detail

Purchase Record

Offer Details

Segmentation

Big Data: de transações para interações

Análise de Comportamento

ALL DATA

Como extrair valor de negócio?

Page 102: WorkshopMaxtera_RevDados_11mar15

5/15/13 Bill Howe, eScience Institute 31

Design Space

31"

Throughput"Latency"

Internet"

Private"data"center"

Data&"parallel"

Shared"memory"

The area we’re discussing

inspired by a slide by Michael Isard at Microsoft Research

In a few weeks

Page 103: WorkshopMaxtera_RevDados_11mar15

Google 2010

Page 104: WorkshopMaxtera_RevDados_11mar15

Graph vs. SQL and SQL-MR

B has high betweenness. You get that from a graph

Caller Recipient # of calls madeA B 10A C 25A D 32A E 3B I 7C D 5

A B

DC

E

GFH

K

J

L

M

I

SQL or SQL-MR will tell you A makes a lot of

phone calls

Page 105: WorkshopMaxtera_RevDados_11mar15

?Adriano Amaral Diretor Tecnologia e Soluções about.me/adriano_amaral [email protected]