Dynamic Language VMs - Inside Ruby (Sapo Codebits 2009)

Post on 28-Jan-2015

107 views 0 download

description

The only efficient way to make the most of something is understanding it's mechanics - a pilot has deep knowledge of many scientific factors and its effects on a plane. Why do so many developers fly blind? We'll take a peek into the Ruby 1.9 VM's internals with DTrace and observe the effect of some core components on memory, IO and CPU subsystems. No prior knowledge of Virtual Machines/Interpreters is assumed. Interpreter specific subjects touched upon: * Source to runtime : Loading files, parsing to Nodes and eval * VM : Symbol table, method cache, frames, method dispatch and optimizations * Object model : Core types, Modules and variables * Closures : Blocks and procedures * POSIX, IO and Contexts : Signals, system calls and Thread / Fiber switches * Garbage Collection : Heap space, alloc / dealloc and GC patterns

Transcript of Dynamic Language VMs - Inside Ruby (Sapo Codebits 2009)

Dynamic Language VMsInside Ruby - Lourens Naudé

sexta-feira, 4 de Dezembro de 2009

Background• Freelance Ruby/C/Systems Developer• http://github.com/methodmissing• Contractor at Trade2Win Ltd.• Realtime Forex / Autotrading Platform

sexta-feira, 4 de Dezembro de 2009

ProcessFront-end (parsing)

Semantics

Back-end (runtime)

sexta-feira, 4 de Dezembro de 2009

Roadmap

• Source to Nodes and the AST• VM: Symbol table, caches, opcode dispatch and optimizations• Object Model : Objects, methods and variables• Garbage Collection• Contexts : Threads, GVL, Fibers

sexta-feira, 4 de Dezembro de 2009

Source to AST

sexta-feira, 4 de Dezembro de 2009

Lexical Analysis• Converts source code to a token stream• Token identification (keyword_class, keyword_module etc.)

sexta-feira, 4 de Dezembro de 2009

Grammar• Describes program syntax structure• Semantics of a program is defined by it’s syntax• Production rules : name and use case• Object#block_arg(&block)• Object#opt_block_arg(arg1, arg2, &block)

sexta-feira, 4 de Dezembro de 2009

Abstract Syntax Tree

sexta-feira, 4 de Dezembro de 2009

VM Architecture• Reuse of some 1.8.x series architecture : parsing, AST nodes, Object, GC etc.

• Introduces a code generation phase to convert the AST to instruction sequences for better optimization hooks and faster runtime

• No speedup for inherited MRI features such as string processing etc.

sexta-feira, 4 de Dezembro de 2009

• Represents grammar• Sometimes referred to as an annotated AST• Annotations / attributes attach semantics to nodes• Literals, values, statements, callsite info ( file and line number )• Can be augmented with semantic analysis

AST Annotations

sexta-feira, 4 de Dezembro de 2009

AST Transformation

• Removes AST noise• Refactor to features that map closer to machine instructions• Usually yields more AST nodes, but reduces overall complexity

sexta-feira, 4 de Dezembro de 2009

Intermediate Tree Nodes

• Minimal subset required for code generation• Expressions and assignments • Method calls, arguments and return values• Conditional jumps - if/else, iterators• Unconditional jumps - exceptions, retry, catch/throw

sexta-feira, 4 de Dezembro de 2009

Code Generation

• Converts AST to code segments - a linear instruction set• Selection : Which tree sections to rewrite ?• AST Node -> instruction ordering• Narrow tree scope considers only small subsets of the AST to reduce the inherent complexity of code generation

sexta-feira, 4 de Dezembro de 2009

Codegen Workflow

• Preprocessing : AST node refactorings ( YARV doesn’t do this )• Codegen : Nodes to instruction sequences• Postprocessing : Generated instruction sequences replaced with optimal ones - compiled instruction sequences and peephole optimization

• Pre and Postprocessing phases may benefit from multiple passes

sexta-feira, 4 de Dezembro de 2009

VM Internals

sexta-feira, 4 de Dezembro de 2009

Symbol (Hash) Table• Access to int/char indexed values in almost constant time with a hash table• Lookup of methods, ivars, global vars, encodings, VM instructions etc.• Table defaults to 11 bins and max 5 entries per bin.Bins count can increase.

• Sequential Lookup inside bins, thus slow down for a density of > 5

sexta-feira, 4 de Dezembro de 2009

Symbols - VMNS :-)• An entity with both a String and Number representation• It does NOT contain a String or Number, simply points to a hash table entry• Developer identifies by name, VM identifies by it’s numeric representation• Immutable (4 bytes per Symbol) for performance benefits • DNS anology : developers prefer named entities, runtime prefers numerical representations

sexta-feira, 4 de Dezembro de 2009

VM Opcodes• Stateless functions that operate on a Stack Machine• 79 instructions as of Dec 4, 2009• Notation : instruction / opcode / operands

sexta-feira, 4 de Dezembro de 2009

Instruction Categories

• variable : get or set local variable• put : push an object onto the stack• stack : pop from stack, empty the stack• setting : is a given variable defined ?• class/module : define a class / module• method/iterator : invoking methods, calling blocks• exception : • jump : control flow branching• optimization : redefines +, <<, * etc. in some cases

sexta-feira, 4 de Dezembro de 2009

Pure Stack Machine

• 2 instruction types• Move / copy value(s) between top of stack and elsewhere• Operate on the top stack element(s)• SP: top of stack pointer• BP: beginning of stack pointer

sexta-feira, 4 de Dezembro de 2009

Stack Machine • Put 3 strings on the stack, “a”, “b” and “c”• Fetch the top 3 stack elements and create an Array from them

sexta-feira, 4 de Dezembro de 2009

Instruction Sequence

• Flat instruction sequences structure is much faster than traversing tree nodes, but instruction dispatch from this pipeline can be a bottleneck

• Ability to optimize simple instructions is very important• Native code / language extensions is usually only a small subset of the hot path

• Native DB socket layer VS multi-model ORM in Ruby• Direct Threaded Dispatch : fastest way to the next VM instruction• Switch Dispatch : slower, but more portable

sexta-feira, 4 de Dezembro de 2009

Switch Dispatch• Most portable, but much slower due to excessive CPU branch mispredictions

• Executes more native instructions per opcode dispatch• Average 50% slower than Threaded Dispatch

sexta-feira, 4 de Dezembro de 2009

Direct Threaded Dispatch• Represents an instruction by the address of the routine that implements it • Jumps context to the address of the current instruction and bumps the PC • Requires first class labels and some GCC help - thus portability concern

sexta-feira, 4 de Dezembro de 2009

VM Versioning

• Each VM instance has a state counter used to scope caches to the current VM state

• Lazy cache invalidation: bumping the version value avoids any cache expiry overhead

• Expired on : const definition, constant removal, method definition, method removal and method cache changes (covered later)

sexta-feira, 4 de Dezembro de 2009

Common Optimizations*

• Constant folding• Constant propagation• Dead code elimination• Subexpression elimination• Method in-lining

sexta-feira, 4 de Dezembro de 2009

Static Analysis Notes

• Examining source code without execution• Dynamic analysis : Runtime introspection• Cannot assume much beyond literals in Ruby ...• Constants can be redefined• Open classes imply methods can be redefined at any time• Object#method_missing• Methods don't have an explicit return type

sexta-feira, 4 de Dezembro de 2009

Constant Folding

• Compile time constant expression evaluation

• Strength reductions : replace operationswith cheaper ones

• Null sequences : operations that can beremoved

• Very hard to pull off due to the dynamicnature of the Ruby spec

sexta-feira, 4 de Dezembro de 2009

• Remove code segments without data flow• Works very well with static analysis, but tricky to pull off in Ruby

Code Elimination

sexta-feira, 4 de Dezembro de 2009

• Expression reuse by extractingto a temporary variable

Subexpression elimination

sexta-feira, 4 de Dezembro de 2009

• Replace a literal variable referencewith it’s value

Constant Propagation

sexta-feira, 4 de Dezembro de 2009

• Replaces a method call with it’s body to reduce function calloverhead

• Very efficient in iterator contexts• Opportunity for further optimization• Not a silver bullet - excessive in-liningcan overload instruction cache

• Some cases change semantics

In-lining

sexta-feira, 4 de Dezembro de 2009

• Copies a method to replace a commoncall pattern

• Identified with static analysis, thusof limited use to Ruby

Cloning

sexta-feira, 4 de Dezembro de 2009

• Replace generated instruction sequenceswith more efficient ones

• Benefits is directly proportional tothe quality of the code generated

• Removes useless flow control

Peephole Optimization

sexta-feira, 4 de Dezembro de 2009

Object Model

sexta-feira, 4 de Dezembro de 2009

Object Requirements

• Identity : unique identifier to represent the object at runtime• Stateful : ability to maintain state• Methods : exposes methods to change / query object state

sexta-feira, 4 de Dezembro de 2009

Base Object Structure• Pointer type that represent addresses to language structures• Pointer cast dereferences VALUE to an object structure• RBASIC(obj)->flags; / * ((struct RBasic *)obj) -> flags * /• Flags: frozen, marked, tainted etc.

sexta-feira, 4 de Dezembro de 2009

Classes / modules• Symbol tables for methods, class and instance variables• Class / module distinction through flags• RCLASS(a_str)->ptr.super #=> Object• RCLASS(a_fixnum)->ptr.super #=> Integer

sexta-feira, 4 de Dezembro de 2009

Immediates

• Small enough to fit in a VALUE• No Runtime casting overheads• nil = 4• true = 2• false = 0 • Symbols• Fixnums <= 30 bits• Float, Bignum are complex objects, hence poor FP benchmarks• RFLOAT(float_obj)->float_value #=> a double

sexta-feira, 4 de Dezembro de 2009

Object Layout

• Assuming a 32bit architecture ....• sizeof(VALUE) is 4 bytes• Objects are even - multiples of 4• Symbols are even - multiples of 8• Integers are odd• Immediates < 4

sexta-feira, 4 de Dezembro de 2009

Mutable Objects• Mutable Strings and Arrays require the ability to shrink / grow capacity • Allocates slightly more memory than is required to represent object data in order to avoid malloc, realloc and memmove operations in common cases.

• Capacity for short strings and small arrays : “str” and %w(s t r)sexta-feira, 4 de Dezembro de 2009

Shared Objects• Literal declarations of Arrays and Strings is shared amongst instances • Avoids duplicates with this “copy-on-write” (COW) scheme• Attempt to modify creates a copy to the object, and modifies the copy

sexta-feira, 4 de Dezembro de 2009

Object Method Dispatch• Loose typing and open classes means that method calls could never be reduced to a single CALL instruction

• Method dispatch in OO languages requires methods to be searched for, on the object itself, superclasses etc.

sexta-feira, 4 de Dezembro de 2009

Call VS Send

• object.__send__(:method)• We don’t call functions / routines, rather send a command or query message to an object

• Ruby methods always return a value, thus RPC style messaging• Method cache is like a router • Method redefinition clears the method cache / router • “Routing” overhead for subsequent method calls

sexta-feira, 4 de Dezembro de 2009

Cache - before include

sexta-feira, 4 de Dezembro de 2009

Cache - after include• ALL methods on ALL classes invoked since VM startup is expired• DON’T extend / include in a request / response cycle• Rails busts the method cache multiple times on boot

sexta-feira, 4 de Dezembro de 2009

Method cache - Warm• Average 95% hit rate

sexta-feira, 4 de Dezembro de 2009

Instance Variables

• Optimization : the first 3 ivars is embedded on the object, iow. no symbol table lookups required

• Index table per class VS a symbol table per object on MRI 1.8• Index table is shared by all instances of the same class• Saves on the memory footprint of a table per instance

sexta-feira, 4 de Dezembro de 2009

Garbage Collection

sexta-feira, 4 de Dezembro de 2009

Process Memory Layout

• Code segment : executable code, read only area• Stack segment : stack storage, addressed with stack pointers• Heap : stretch of memory available for program / developer use

sexta-feira, 4 de Dezembro de 2009

malloc / free layout• Free chunks == the free list• Linear search overhead to find free chunks

sexta-feira, 4 de Dezembro de 2009

a better layout• Free chunks indexed by size intervals

sexta-feira, 4 de Dezembro de 2009

Garbage Collection

• Objects allocated explicitly on the heap• Automatically reclaim memory chunks not accessible from the root set

• Root set : C stack, global vars, global constants (accessible without pointer scanning)

• Unreachable hooks : variable assignment (nil), method return etc.• Stop the World : halts execution to reclaim memory, very disruptive when in the hot path

• Incremental : some collection actions occur for each allocation, smoother and suitable for realtime requirements

sexta-feira, 4 de Dezembro de 2009

GC Algorhitms

• Most scripting languages implements either of the following• Mark and Sweep : identifies reachable chunks and assume remainder is garbage (concerned with garbage)

• Stop and Copy : 2 heap spaces, copies reachable chunks to the new active heap area (concerned with live chunks)

sexta-feira, 4 de Dezembro de 2009

GC Issues

• Memory fragmentation• Dangling pointers• Memory leaks form incomplete recycling (circular garbage and conservative GC)

• Bursty allocation• Knowledge of pointer and chunk layouts required

sexta-feira, 4 de Dezembro de 2009

Ruby heap layout• Multiple heaps, referenced through the heap list• Heaps are freed when empty, IF all slots is tagged free• Ballpark : Rails allocates 4 to 6 heaps on startup

sexta-feira, 4 de Dezembro de 2009

Per heap slots layout• Each slot references a single object• 10 000 slots per Ruby heap• Threshold of 4096 free slots per heap• Free list points to the next free slot

sexta-feira, 4 de Dezembro de 2009

Heaps and slots layout

sexta-feira, 4 de Dezembro de 2009

Pointer Layout• Pointer layout of both the program data area and heap is self describing• RVALUE union can accommodate any ruby object, Ruby frames, global variable structure etc. is well defined

• 20 bytes (32bit arch) of Ruby heap space is require to represent a slot sexta-feira, 4 de Dezembro de 2009

Ruby Heap VS OS Heap• Slot points to the actual object data, on the OS / system heap• 20 byte (32bit arch) slot references an eg. 2MB chunk on the system heap• RVALUE union can accommodate any ruby object, Ruby frames, global variable structure etc. is well defined

• 20 bytes of Ruby heap space is require to represent a slot sexta-feira, 4 de Dezembro de 2009

CRuby: Mark and Sweep

• Conservative : cannot determine with certainty if a given value is a pointer or not and assume it’s in use

• Two phase implementation• Mark phase : marks all reachable objects from the current program context

• Sweep phase : iterate through the object space and frees all objects not marked + unmark the marked ones

sexta-feira, 4 de Dezembro de 2009

Pros and Cons

• Pauses program execution• Work is proportional to the heap size• Prone to memory fragmentation (no compaction)• Recursive• Every 8MB allocated triggers GC• 8m malloc calls also triggers GC• Frees all* memory that can be freed

sexta-feira, 4 de Dezembro de 2009

Source representation

sexta-feira, 4 de Dezembro de 2009

Objectspace

sexta-feira, 4 de Dezembro de 2009

Objectspace - marked

sexta-feira, 4 de Dezembro de 2009

Objectspace after sweep

sexta-feira, 4 de Dezembro de 2009

Generational GC

• Vast majority of objects are short lived ( 80% + )• Expensive to continuously account for long lived objects• Partition objects by age and collect short lived ones more frequently OR

• Restrict GC to the most recently modified slots• Perform a full GC only when the younger generation fails to meet current memory requirements

sexta-feira, 4 de Dezembro de 2009

Context Switches

sexta-feira, 4 de Dezembro de 2009

Threading

• First CRuby to support native OS Threads• Ruby thread == pthread• Scheduling, synchronization and create delegated to syscalls, which implies a user / kernel space context switch

• Can use multiple CPU cores - NOT at the same time though• No parallel execution - Global VM Lock (GVL)• ... although MacRuby doesn’t have a GVL

sexta-feira, 4 de Dezembro de 2009

Global VM Lock (GVL)

• Thread that owns the GVL is allowed to execute• Blocking operations should release the GVL to not block the process• Also released during scheduling• Allows for easy C extensions - author doesn’t have to concern with synchronization

• The Kernel’s better suited for load balancing multiple processes than most developers can squeeze from a single process

• Constraintless Threading is a weapon of mass destruction• Effect on existing app performance that rely on user space threads from MRI 1.8 may be significant

• Unix pipes are often the best scheduler ....

sexta-feira, 4 de Dezembro de 2009

Releasing the GVL

• Internal API exposed to release the GVL• Blocking function : slow system call / computation• Unblock function : called on Thread interrupt • Dangerous territory - look for alternatives first• Cannot access Ruby VALUEs in blocking functions • No exception handling

sexta-feira, 4 de Dezembro de 2009

Blocking VM Operations

• IO : potentially blocking reads / writes• DNS resolution / connects : often has a lot more handshake overhead

• Expensive Bignum computations blocked 1.8 interpreters• File locking• Process#waitpid

sexta-feira, 4 de Dezembro de 2009

Fibers

• Coroutines for lightweight concurrency (4k stack size)• Very fast user space context switches• Cooperative scheduling required - also not concurrent• Common use cases being generators or blocking IO eg. Neverblock• Fiber.yield pauses the activation record, which keeps context across multiple calls

sexta-feira, 4 de Dezembro de 2009

The Road Ahead

• MVM: Multiple Virtual Machines• Shared process space, cannot share state• Distribute VMs across multiple cores• Message passing / channel API for inter VM communication• Many Ruby deployments are not thread safe - MVM is better suited for this use case

• Thread safe framework does not guarantee a thread safe application ...

sexta-feira, 4 de Dezembro de 2009

Questions ?

sexta-feira, 4 de Dezembro de 2009

Thanks for Listening !

@methodmissinghttp://github.com/methodmissinghttp://www.methodmissing.com

sexta-feira, 4 de Dezembro de 2009