INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion...
Transcript of INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion...
INF550 - Computacao em Nuvem I
Apache Spark
Islene Calciolari Garcia
Instituto de Computacao - Unicamp
Julho de 2016
Sumario
Revisao da aula passada...ObjetivosHDFSMapReduce
Ecossistema do Hadoop
SparkRDDs e SparkContextComo testar?
Revisao rapida de Pythonpyspark
Laboratorio
Revisao da aula passada...
Objetivos
I Primeira parte do cursoI Cloud computingI Data centers
I Segunda parte do cursoI Modelo de Programacao MapReduceI Spark e Ecossistema do Hadoop
HDFSLeitura de arquivo
Data Flow
Anatomy of a File ReadTo get an idea of how data flows between the client interacting with HDFS, the name‐node, and the datanodes, consider Figure 3-2, which shows the main sequence of eventswhen reading a file.
Figure 3-2. A client reading data from HDFS
The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), todetermine the locations of the first few blocks in the file (step 2). For each block, thenamenode returns the addresses of the datanodes that have a copy of that block. Fur‐thermore, the datanodes are sorted according to their proximity to the client (accordingto the topology of the cluster’s network; see “Network Topology and Hadoop” on page
70). If the client is itself a datanode (in the case of a MapReduce task, for instance), the
client will read from the local datanode if that datanode hosts a copy of the block (seealso Figure 2-2 and “Short-circuit local reads” on page 308).
The DistributedFileSystem returns an FSDataInputStream (an input stream thatsupports file seeks) to the client for it to read data from. FSDataInputStream in turnwraps a DFSInputStream, which manages the datanode and namenode I/O.
The client then calls read() on the stream (step 3). DFSInputStream, which has storedthe datanode addresses for the first few blocks in the file, then connects to the first
Data Flow | 69
Fonte: Hadoop—The Definitive Guide, Tom White
HDFSEscrita em arquivo
Anatomy of a File WriteNext we’ll look at how files are written to HDFS. Although quite detailed, it is instructiveto understand the data flow because it clarifies HDFS’s coherency model.
We’re going to consider the case of creating a new file, writing data to it, then closingthe file. This is illustrated in Figure 3-4.
Figure 3-4. A client writing data to HDFS
The client creates the file by calling create() on DistributedFileSystem (step 1 inFigure 3-4). DistributedFileSystem makes an RPC call to the namenode to create anew file in the filesystem’s namespace, with no blocks associated with it (step 2). Thenamenode performs various checks to make sure the file doesn’t already exist and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file; otherwise, file creation fails and the client is thrown anIOException. The DistributedFileSystem returns an FSDataOutputStream for theclient to start writing data to. Just as in the read case, FSDataOutputStream wraps aDFSOutputStream, which handles communication with the datanodes and namenode.
As the client writes data (step 3), the DFSOutputStream splits it into packets, which itwrites to an internal queue called the data queue. The data queue is consumed by theDataStreamer, which is responsible for asking the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline, and here we’ll assume the replication level is three, so there are three nodes in
72 | Chapter 3: The Hadoop Distributed Filesystem
Fonte: Hadoop—The Definitive Guide, Tom White
MapReduceVisao colorida
http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
Word Count
http://www.cs.uml.edu/~jlu1/doc/source/report/img/MapReduceExample.png
Ecossistema do Hadoop
Ecossistema do Hadoop
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
Hadoop 1.0
I JobTrackerI Gerenciamento dos Task Trackers (recursos e falhas)I Gerenciamento do ciclo de vida dos jobs
I TaskTrackerI iniciar e encerrar taskI enviar status para o JobTracker
I Escalabilidade?
I Outros modelos de programacao?
YARN
I JobTracker estava sobrecarregadoI Gerenciamento de recursosI Gerenciamento de aplicacoes
I Container: abstracao que incorpora recursos como cpu,memoria, disco, rede...
I ResourceManagerI Escalonador de recursos
I NodeManager
I ApplicationMaster
Testando o YARN
$ sbin/start-yarn.sh
$ bin/hadoop dfs -put input /input
$ bin/yarn jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \
wordcount /input /output
$ bin/hadoop dfs -get /output output
I Verifique os jobs em http://localhost:8088/
I Quando terminar de usar
$ sbin/stop-yarn.sh
Gerenciamento de multiplas aplicacoes
http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/
Como o Spark consegue ser tao mais rapido?Modelo de processamento MapReduce
© 2014 MapR Technologies 12
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you
© 2014 MapR Technologies 37
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
Carol McDonald: An Overview of Apache Spark
Como o Spark consegue ser tao mais rapido?Resilient Distributed Datasets
© 2014 MapR Technologies 38
Iterations
Step Step Step Step Step
In-memory Caching
• Data Partitions read from RAM instead of disk Carol McDonald: An Overview of Apache Spark
I RDD: principal abstracao em Spark
I Imutavel
I Tolerante a falhas
SparkContext
© 2014 MapR Technologies 56
Spark Programming Model
sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map
Driver Program
SparkContext
cluster
Worker Node
Task Task
Task Worker Node
Carol McDonald: An Overview of Apache Spark
SparkContextCluster overview
http://spark.apache.org/docs/latest/cluster-overview.html
SparkContextRDDs e particoes
© 2014 MapR Technologies 57
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• Fault-tolerant
• read only collection of
elements
• operated on in parallel
• Cached in memory
• Or on disk
http://www.cs.berkeley.edu/~matei/papers/
2012/nsdi_spark.pdf
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Operacoes com RDDs
I Muito mais do que Map e Reduce
I Transformacoes e Acoes
I Processamento lazy das transformacoes
����������������������������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������
���������������������
������ �����
���������
��������������� ���
http://databricks.com
Como testar?
Como testar?Spark e Python
I Spark pode facilmente ser utilizado com Scala, Java ouPython
I Veja Spark Quick Start
I ShellsI python shellI pyspark
Revisao rapida de PythonOperacoes com Strings
Comandos Saıda
astring = "Spark"
print astring Spark
print astring.len 5
print astring[0] S
print astring[1:3] pa
print astring[3:] rk
print astring[0:5:2] Sak
print astring[::-1] krapS
Python: Mais operacoes com Strings
Comandos Saıda
line = " GNU is not Unix. "
line = line.strip()
print line GNU is not Unix.
words = line.split()
print words [’GNU’, ’is’, ’not’, ’Unix’]
print words[1] is
Revisao rapida de PythonFuncoes
def soma(a,b):
return a + b
def mult(a,b):
return a * b
def invertString(s):
return s[::-1]
Revisao rapida de PythonFuncoes lambda
Funcoes que nao recebem um nome em tempo de execucao
>>> lista = range(1,10)
>>> print lista
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> def impar(i) :
... return i % 2 != 0
>>> filter (impar, lista)
[1, 3, 5, 7, 9]
>>> filter (lambda x: x % 2 != 0, lista)
[1, 3, 5, 7, 9]
pysparkPrimeiro SparkContext
© 2014 MapR Technologies 58
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
Carol McDonald: An Overview of Apache SparkWelcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ ‘/ __/ ’_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Python version 2.7.11 (default, Jun 20 2016 14:45:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> lines = sc.textFile("tcpdump.list");
1998 DARPA Intrusion Detection Evaluation
Start Start Src Dest Src Dest Attack
Date Time Duration Serv Port Port IP IP Score Name
1 01/27/1998 00:00:01 00:00:23 ftp 1755 21 192.168.1.30 192.168.0.20 0.31 -
2 01/27/1998 05:04:43 67:59:01 telnet 1042 23 192.168.1.30 192.168.0.20 0.42 -
3 01/27/1998 06:04:36 00:00:59 smtp 43590 25 192.168.1.30 192.168.0.40 12.0 -
4 01/27/1998 08:45:01 00:00:01 finger 1050 79 192.168.0.40 192.168.1.30 2.56 guess
5 01/27/1998 09:23:45 00:00:01 http 1031 80 192.168.1.30 192.168.0.40 -1.3 -
7 01/27/1998 15:11:32 00:00:12 sunrpc 2025 111 192.168.1.30 192.168.0.20 3.10 rpc
8 01/27/1998 21:53:17 00:00:45 exec 2032 512 192.168.1.30 192.168.0.40 2.95 exec
9 01/27/1998 21:58:21 00:00:01 http 1031 80 192.168.1.30 192.168.0.20 0.45 -
10 01/27/1998 22:57:53 26:59:00 login 2031 513 192.168.0.40 192.168.1.20 7.00 -
11 01/27/1998 23:57:28 130:23:08 shell 1022 514 192.168.1.30 192.168.0.20 0.52 guess
13 01/27/1998 25:38:00 00:00:01 eco/i - - 192.168.0.40 192.168.1.30 0.01 -
1998 DARPA Intrusion Detection Evaluation
https://www.ll.mit.edu/ideval/docs/index.html
Filter
© 2014 MapR Technologies 59
Working With RDDs
RDD RDD RDD RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
textFile = sc.textFile(”SomeFile.txt”)
Carol McDonald: An Overview of Apache Spark
Como imprimir e filtrar um RDD
>>> lines = sc.textFile("tcpdump.list")
>>> lines.take(10)
>>> for x in lines.collect():
... print x
>>> telnet = lines.filter(lambda x: "telnet" in x)
>>> for x in telnet.collect():
... print x
Algumas transformacoes simples
map(func) todo elemento do RDD original seratransformado por func
flatmap(func) todo elemento do RDD original seratransformado em 0 ou mais itens por func
filter(func) retorna apenas elementos selecionados por funcgroupByKey() Dado um dataset (k, v)
retorna (k, Iterable<v>)reduceByKey(func) Dado um dataset (k, v)
retorna outro, com chaves agrupadas por funcsortByKey(ascending) Dado um dataset (k,v) retorna outro
ordenado em ordem ascendente ou descendente
Veja mais em Spark Programming Guide
Algumas acoes
count() retorna o numero de elementos no datasetcollect() retorna todos elementos do datasettake(n) retorna os n primeiros elementos do dataset
Veja mais em Spark Programming Guide
Laboratorio
I Instale o Spark
I Obtenha uma versao do darpa dataset
I Elabore questoes interessantes e opere com os dados
I Entrega de codigo e relatorio via Moodle
Exemplo simples
Como ordenar os acessos por tipo de servico
>>> lines = sc.textFile("tcpdump.list")
>>> servicePairs = lines.map(lambda x: (str(x.split()[4]), str(x)))
>>> sortedServ = servicePairs.sortByKey()
Como poderıamos rastrear os dados para identificar os ataques???
Referencias
I Python TutorialI Apache Spark
I Spark Programming Guide
I Clash of Titans: MapReduce vs. Spark for Large Scale DataAnalytics, Juwei Shi e outros, IBM Research, China