INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion...

INF550 - Computacao em Nuvem I

Apache Spark

Islene Calciolari Garcia

Instituto de Computacao - Unicamp

Julho de 2016

Sumario

Revisao da aula passada...ObjetivosHDFSMapReduce

Ecossistema do Hadoop

SparkRDDs e SparkContextComo testar?

Revisao rapida de Pythonpyspark

Laboratorio

Revisao da aula passada...

Objetivos

I Primeira parte do cursoI Cloud computingI Data centers

I Segunda parte do cursoI Modelo de Programacao MapReduceI Spark e Ecossistema do Hadoop

Arquitetura do HDFS

Fonte: http://hadoop.apache.org

http://hadoop.apache.org

HDFSLeitura de arquivo

Data Flow

Anatomy of a File ReadTo get an idea of how data flows between the client interacting with HDFS, the name‐node, and the datanodes, consider Figure 3-2, which shows the main sequence of eventswhen reading a file.

Figure 3-2. A client reading data from HDFS

The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), todetermine the locations of the first few blocks in the file (step 2). For each block, thenamenode returns the addresses of the datanodes that have a copy of that block. Fur‐thermore, the datanodes are sorted according to their proximity to the client (accordingto the topology of the cluster’s network; see “Network Topology and Hadoop” on page

70). If the client is itself a datanode (in the case of a MapReduce task, for instance), the

client will read from the local datanode if that datanode hosts a copy of the block (seealso Figure 2-2 and “Short-circuit local reads” on page 308).

The DistributedFileSystem returns an FSDataInputStream (an input stream thatsupports file seeks) to the client for it to read data from. FSDataInputStream in turnwraps a DFSInputStream, which manages the datanode and namenode I/O.

The client then calls read() on the stream (step 3). DFSInputStream, which has storedthe datanode addresses for the first few blocks in the file, then connects to the first

Data Flow | 69

Fonte: Hadoop—The Definitive Guide, Tom White

HDFSEscrita em arquivo

Anatomy of a File WriteNext we’ll look at how files are written to HDFS. Although quite detailed, it is instructiveto understand the data flow because it clarifies HDFS’s coherency model.

We’re going to consider the case of creating a new file, writing data to it, then closingthe file. This is illustrated in Figure 3-4.

Figure 3-4. A client writing data to HDFS

The client creates the file by calling create() on DistributedFileSystem (step 1 inFigure 3-4). DistributedFileSystem makes an RPC call to the namenode to create anew file in the filesystem’s namespace, with no blocks associated with it (step 2). Thenamenode performs various checks to make sure the file doesn’t already exist and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file; otherwise, file creation fails and the client is thrown anIOException. The DistributedFileSystem returns an FSDataOutputStream for theclient to start writing data to. Just as in the read case, FSDataOutputStream wraps aDFSOutputStream, which handles communication with the datanodes and namenode.

As the client writes data (step 3), the DFSOutputStream splits it into packets, which itwrites to an internal queue called the data queue. The data queue is consumed by theDataStreamer, which is responsible for asking the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline, and here we’ll assume the replication level is three, so there are three nodes in

72 | Chapter 3: The Hadoop Distributed Filesystem

Fonte: Hadoop—The Definitive Guide, Tom White

MapReduceVisao colorida

http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html

http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html

Word Count

http://www.cs.uml.edu/~jlu1/doc/source/report/img/MapReduceExample.png

http://www.cs.uml.edu/~jlu1/doc/source/report/img/MapReduceExample.png


http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Hadoop 1.0



Hadoop 1.0

I JobTrackerI Gerenciamento dos Task Trackers (recursos e falhas)I Gerenciamento do ciclo de vida dos jobs

I TaskTrackerI iniciar e encerrar taskI enviar status para o JobTracker

I Escalabilidade?

I Outros modelos de programacao?

YARNYet Another Resource Negotiator



YARN

I JobTracker estava sobrecarregadoI Gerenciamento de recursosI Gerenciamento de aplicacoes

I Container: abstracao que incorpora recursos como cpu,memoria, disco, rede...

I ResourceManagerI Escalonador de recursos

I NodeManager

I ApplicationMaster

Testando o YARN

$ sbin/start-yarn.sh

$ bin/hadoop dfs -put input /input

$ bin/yarn jar \

share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \

wordcount /input /output

$ bin/hadoop dfs -get /output output

I Verifique os jobs em http://localhost:8088/

I Quando terminar de usar

$ sbin/stop-yarn.sh

http://localhost:8088/

Gerenciamento de multiplas aplicacoes

http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/

http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/

Como o Spark consegue ser tao mais rapido?Modelo de processamento MapReduce

© 2014 MapR Technologies 12

MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you


Typical MapReduce Workflows

Input to

Job 1

SequenceFile

Last Job

Maps Reduces

SequenceFile

Job 1

Maps Reduces

SequenceFile

Job 2

Maps Reduces

Output from

Job 1

Output from

Job 2

Input to

last job

Output from

last job

HDFS

Carol McDonald: An Overview of Apache Spark

Como o Spark consegue ser tao mais rapido?Resilient Distributed Datasets


Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk Carol McDonald: An Overview of Apache Spark

I RDD: principal abstracao em Spark

I Imutavel

I Tolerante a falhas

SparkContext


Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node


SparkContextCluster overview

http://spark.apache.org/docs/latest/cluster-overview.html

http://spark.apache.org/docs/latest/cluster-overview.html

SparkContextRDDs e particoes


Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Operacoes com RDDs

I Muito mais do que Map e Reduce

I Transformacoes e Acoes

I Processamento lazy das transformacoes

��

��

��

��

��

��

http://databricks.com

http://databricks.com

Como testar?

Como testar?Spark e Python

I Spark pode facilmente ser utilizado com Scala, Java ouPython

I Veja Spark Quick Start

I ShellsI python shellI pyspark

http://spark.apache.org/docs/latest/quick-start.html

Revisao rapida de PythonOperacoes com Strings

Comandos Saıda

astring = "Spark"

print astring Spark

print astring.len 5

print astring[0] S

print astring[1:3] pa

print astring[3:] rk

print astring[0:5:2] Sak

print astring[::-1] krapS

Python: Mais operacoes com Strings

Comandos Saıda

line = " GNU is not Unix. "

line = line.strip()

print line GNU is not Unix.

words = line.split()

print words [’GNU’, ’is’, ’not’, ’Unix’]

print words[1] is

Revisao rapida de PythonFuncoes

def soma(a,b):

return a + b

def mult(a,b):

return a * b

def invertString(s):

return s[::-1]

Revisao rapida de PythonFuncoes lambda

Funcoes que nao recebem um nome em tempo de execucao

>>> lista = range(1,10)

>>> print lista

[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> def impar(i) :

... return i % 2 != 0

>>> filter (impar, lista)

[1, 3, 5, 7, 9]

>>> filter (lambda x: x % 2 != 0, lista)

[1, 3, 5, 7, 9]

pysparkPrimeiro SparkContext


Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Carol McDonald: An Overview of Apache SparkWelcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ ‘/ __/ ’_/

/__ / .__/\_,_/_/ /_/\_\ version 1.6.1

/_/

Using Python version 2.7.11 (default, Jun 20 2016 14:45:23)

SparkContext available as sc, HiveContext available as sqlContext.

>>> lines = sc.textFile("tcpdump.list");

1998 DARPA Intrusion Detection Evaluation

Start Start Src Dest Src Dest Attack

Date Time Duration Serv Port Port IP IP Score Name

1 01/27/1998 00:00:01 00:00:23 ftp 1755 21 192.168.1.30 192.168.0.20 0.31 -

2 01/27/1998 05:04:43 67:59:01 telnet 1042 23 192.168.1.30 192.168.0.20 0.42 -

3 01/27/1998 06:04:36 00:00:59 smtp 43590 25 192.168.1.30 192.168.0.40 12.0 -

4 01/27/1998 08:45:01 00:00:01 finger 1050 79 192.168.0.40 192.168.1.30 2.56 guess

5 01/27/1998 09:23:45 00:00:01 http 1031 80 192.168.1.30 192.168.0.40 -1.3 -

7 01/27/1998 15:11:32 00:00:12 sunrpc 2025 111 192.168.1.30 192.168.0.20 3.10 rpc

8 01/27/1998 21:53:17 00:00:45 exec 2032 512 192.168.1.30 192.168.0.40 2.95 exec

9 01/27/1998 21:58:21 00:00:01 http 1031 80 192.168.1.30 192.168.0.20 0.45 -

10 01/27/1998 22:57:53 26:59:00 login 2031 513 192.168.0.40 192.168.1.20 7.00 -

11 01/27/1998 23:57:28 130:23:08 shell 1022 514 192.168.1.30 192.168.0.20 0.52 guess

13 01/27/1998 25:38:00 00:00:01 eco/i - - 192.168.0.40 192.168.1.30 0.01 -

1998 DARPA Intrusion Detection Evaluation

https://www.ll.mit.edu/ideval/docs/index.html

https://www.ll.mit.edu/ideval/docs/index.html

Filter


Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)


Como imprimir e filtrar um RDD

>>> lines = sc.textFile("tcpdump.list")

>>> lines.take(10)

>>> for x in lines.collect():

... print x

>>> telnet = lines.filter(lambda x: "telnet" in x)

>>> for x in telnet.collect():

... print x

Algumas transformacoes simples

map(func) todo elemento do RDD original seratransformado por func

flatmap(func) todo elemento do RDD original seratransformado em 0 ou mais itens por func

filter(func) retorna apenas elementos selecionados por funcgroupByKey() Dado um dataset (k, v)

retorna (k, Iterable<v>)reduceByKey(func) Dado um dataset (k, v)

retorna outro, com chaves agrupadas por funcsortByKey(ascending) Dado um dataset (k,v) retorna outro

ordenado em ordem ascendente ou descendente

Veja mais em Spark Programming Guide

http://spark.apache.org/docs/latest/programming-guide.html

Algumas acoes

count() retorna o numero de elementos no datasetcollect() retorna todos elementos do datasettake(n) retorna os n primeiros elementos do dataset

Veja mais em Spark Programming Guide


Laboratorio

I Instale o Spark

I Obtenha uma versao do darpa dataset

I Elabore questoes interessantes e opere com os dados

I Entrega de codigo e relatorio via Moodle

Exemplo simples

Como ordenar os acessos por tipo de servico

>>> lines = sc.textFile("tcpdump.list")

>>> servicePairs = lines.map(lambda x: (str(x.split()[4]), str(x)))

>>> sortedServ = servicePairs.sortByKey()

Como poderıamos rastrear os dados para identificar os ataques???

Referencias

I Python TutorialI Apache Spark

I Spark Programming Guide

I Clash of Titans: MapReduce vs. Spark for Large Scale DataAnalytics, Juwei Shi e outros, IBM Research, China

https://docs.python.org/2/tutorial

spark.apache.org


INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion...

Documents

Transcript of INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion...