<?xml version="1.0"?>
<!--
    * Licensed to the Apache Software Foundation (ASF) under one
    * or more contributor license agreements.  See the NOTICE file
    * distributed with this work for additional information
    * regarding copyright ownership.  The ASF licenses this file
    * to you under the Apache License, Version 2.0 (the
    * "License"); you may not use this file except in compliance
    * with the License.  You may obtain a copy of the License at
    * 
    *   http://www.apache.org/licenses/LICENSE-2.0
    * 
    * Unless required by applicable law or agreed to in writing,
    * software distributed under the License is distributed on an
    * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    * KIND, either express or implied.  See the License for the
    * specific language governing permissions and limitations
    * under the License.    
-->
<document>
  <properties>
    <title>The Java Virtual Machine</title>
  </properties>

  <body>
    <section name="The Java Virtual Machine">
      <p>
        Readers already familiar with the Java Virtual Machine and the
        Java class file format may want to skip this section and proceed
        with <a href="bcel-api.html">section 3</a>.
      </p>

      <p>
        Programs written in the Java language are compiled into a portable
        binary format called <em>byte code</em>. Every class is
        represented by a single class file containing class related data
        and byte code instructions. These files are loaded dynamically
        into an interpreter (<a
              href="http://docs.oracle.com/javase/specs/">Java
        Virtual Machine</a>, aka. JVM) and executed.
      </p>

      <p>
        <a href="#Figure 1">Figure 1</a> illustrates the procedure of
        compiling and executing a Java class: The source file
        (<tt>HelloWorld.java</tt>) is compiled into a Java class file
        (<tt>HelloWorld.class</tt>), loaded by the byte code interpreter
        and executed. In order to implement additional features,
        researchers may want to transform class files (drawn with bold
        lines) before they get actually executed. This application area
        is one of the main issues of this article.
      </p>

      <p align="center">
        <a name="Figure 1">
          <img src="../images/jvm.gif"/>
          <br/>
          Figure 1: Compilation and execution of Java classes</a>
      </p>

      <p>
        Note that the use of the general term "Java" implies in fact two
        meanings: on the one hand, Java as a programming language, on the
        other hand, the Java Virtual Machine, which is not necessarily
        targeted by the Java language exclusively, but may be used by <a
              href="http://www.robert-tolksdorf.de/vmlanguages.html">other
        languages</a> as well. We assume the reader to be familiar with
        the Java language and to have a general understanding of the
        Virtual Machine.
      </p>

    <subsection name="Java class file format">
      <p>
        Giving a full overview of the design issues of the Java class file
        format and the associated byte code instructions is beyond the
        scope of this paper. We will just give a brief introduction
        covering the details that are necessary for understanding the rest
        of this paper. The format of class files and the byte code
        instruction set are described in more detail in the <a
              href="http://docs.oracle.com/javase/specs/">Java
        Virtual Machine Specification</a>. Especially, we will not deal
        with the security constraints that the Java Virtual Machine has to
        check at run-time, i.e. the byte code verifier.
      </p>

      <p>
        <a href="#Figure 2">Figure 2</a> shows a simplified example of the
        contents of a Java class file: It starts with a header containing
        a "magic number" (<tt>0xCAFEBABE</tt>) and the version number,
        followed by the <em>constant pool</em>, which can be roughly
        thought of as the text segment of an executable, the <em>access
        rights</em> of the class encoded by a bit mask, a list of
        interfaces implemented by the class, lists containing the fields
        and methods of the class, and finally the <em>class
        attributes</em>, e.g.,  the <tt>SourceFile</tt> attribute telling
        the name of the source file. Attributes are a way of putting
        additional, user-defined information into class file data
        structures. For example, a custom class loader may evaluate such
        attribute data in order to perform its transformations. The JVM
        specification declares that unknown, i.e., user-defined attributes
        must be ignored by any Virtual Machine implementation.
      </p>

      <p align="center">
        <a name="Figure 2">
          <img src="../images/classfile.gif"/>
          <br/>
          Figure 2: Java class file format</a>
      </p>

      <p>
        Because all of the information needed to dynamically resolve the
        symbolic references to classes, fields and methods at run-time is
        coded with string constants, the constant pool contains in fact
        the largest portion of an average class file, approximately
        60%. In fact, this makes the constant pool an easy target for code
        manipulation issues. The byte code instructions themselves just
        make up 12%.
      </p>

      <p>
        The right upper box shows a "zoomed" excerpt of the constant pool,
        while the rounded box below depicts some instructions that are
        contained within a method of the example class. These
        instructions represent the straightforward translation of the
        well-known statement:
      </p>

      <p align="center">
        <source>System.out.println("Hello, world");</source>
      </p>

      <p>
        The first instruction loads the contents of the field <tt>out</tt>
        of class <tt>java.lang.System</tt> onto the operand stack. This is
        an instance of the class <tt>java.io.PrintStream</tt>. The
        <tt>ldc</tt> ("Load constant") pushes a reference to the string
        "Hello world" on the stack. The next instruction invokes the
        instance method <tt>println</tt> which takes both values as
        parameters (instance methods always implicitly take an instance
        reference as their first argument).
      </p>

      <p>
        Instructions, other data structures within the class file and
        constants themselves may refer to constants in the constant pool.
        Such references are implemented via fixed indexes encoded directly
        into the instructions. This is illustrated for some items of the
        figure emphasized with a surrounding box.
      </p>

      <p>
        For example, the <tt>invokevirtual</tt> instruction refers to a
        <tt>MethodRef</tt> constant that contains information about the
        name of the called method, the signature (i.e., the encoded
        argument and return types), and to which class the method belongs.
        In fact, as emphasized by the boxed value, the <tt>MethodRef</tt>
        constant itself just refers to other entries holding the real
        data, e.g., it refers to a <tt>ConstantClass</tt> entry containing
        a symbolic reference to the class <tt>java.io.PrintStream</tt>.
        To keep the class file compact, such constants are typically
        shared by different instructions and other constant pool
        entries. Similarly, a field is represented by a <tt>Fieldref</tt>
        constant that includes information about the name, the type and
        the containing class of the field.
      </p>

      <p>
        The constant pool basically holds the following types of
        constants: References to methods, fields and classes, strings,
        integers, floats, longs, and doubles.
      </p>

    </subsection>

    <subsection name="Byte code instruction set">
      <p>
        The JVM is a stack-oriented interpreter that creates a local stack
        frame of fixed size for every method invocation. The size of the
        local stack has to be computed by the compiler. Values may also be
        stored intermediately in a frame area containing <em>local
        variables</em> which can be used like a set of registers. These
        local variables are numbered from 0 to 65535, i.e., you have a
        maximum of 65536 of local variables per method. The stack frames
        of caller and callee method are overlapping, i.e., the caller
        pushes arguments onto the operand stack and the called method
        receives them in local variables.
      </p>

      <p>
        The byte code instruction set currently consists of 212
        instructions, 44 opcodes are marked as reserved and may be used
        for future extensions or intermediate optimizations within the
        Virtual Machine. The instruction set can be roughly grouped as
        follows:
      </p>

      <p>
        <b>Stack operations:</b> Constants can be pushed onto the stack
        either by loading them from the constant pool with the
        <tt>ldc</tt> instruction or with special "short-cut"
        instructions where the operand is encoded into the instructions,
        e.g.,  <tt>iconst_0</tt> or <tt>bipush</tt> (push byte value).
      </p>

      <p>
        <b>Arithmetic operations:</b> The instruction set of the Java
        Virtual Machine distinguishes its operand types using different
        instructions to operate on values of specific type. Arithmetic
        operations starting with <tt>i</tt>, for example, denote an
        integer operation. E.g., <tt>iadd</tt> that adds two integers
        and pushes the result back on the stack. The Java types
        <tt>boolean</tt>, <tt>byte</tt>, <tt>short</tt>, and
        <tt>char</tt> are handled as integers by the JVM.
      </p>

      <p>
        <b>Control flow:</b> There are branch instructions like
        <tt>goto</tt>, and <tt>if_icmpeq</tt>, which compares two integers
        for equality. There is also a <tt>jsr</tt> (jump to sub-routine)
        and <tt>ret</tt> pair of instructions that is used to implement
        the <tt>finally</tt> clause of <tt>try-catch</tt> blocks.
        Exceptions may be thrown with the <tt>athrow</tt> instruction.
        Branch targets are coded as offsets from the current byte code
        position, i.e., with an integer number.
      </p>

      <p>
        <b>Load and store operations</b> for local variables like
        <tt>iload</tt> and <tt>istore</tt>. There are also array
        operations like <tt>iastore</tt> which stores an integer value
        into an array.
      </p>

      <p>
        <b>Field access:</b> The value of an instance field may be
        retrieved with <tt>getfield</tt> and written with
        <tt>putfield</tt>. For static fields, there are
        <tt>getstatic</tt> and <tt>putstatic</tt> counterparts.
      </p>

      <p>
        <b>Method invocation:</b> Static Methods may either be called via
        <tt>invokestatic</tt> or be bound virtually with the
        <tt>invokevirtual</tt> instruction. Super class methods and
        private methods are invoked with <tt>invokespecial</tt>. A
        special case are interface methods which are invoked with
        <tt>invokeinterface</tt>.
      </p>

      <p>
        <b>Object allocation:</b> Class instances are allocated with the
        <tt>new</tt> instruction, arrays of basic type like
        <tt>int[]</tt> with <tt>newarray</tt>, arrays of references like
        <tt>String[][]</tt> with <tt>anewarray</tt> or
        <tt>multianewarray</tt>.
      </p>

      <p>
        <b>Conversion and type checking:</b> For stack operands of basic
        type there exist casting operations like <tt>f2i</tt> which
        converts a float value into an integer. The validity of a type
        cast may be checked with <tt>checkcast</tt> and the
        <tt>instanceof</tt> operator can be directly mapped to the
        equally named instruction.
      </p>

      <p>
        Most instructions have a fixed length, but there are also some
        variable-length instructions: In particular, the
        <tt>lookupswitch</tt> and <tt>tableswitch</tt> instructions, which
        are used to implement <tt>switch()</tt> statements.  Since the
        number of <tt>case</tt> clauses may vary, these instructions
        contain a variable number of statements.
      </p>

      <p>
        We will not list all byte code instructions here, since these are
        explained in detail in the <a
              href="http://docs.oracle.com/javase/specs/">JVM
        specification</a>. The opcode names are mostly self-explaining,
        so understanding the following code examples should be fairly
        intuitive.
      </p>

    </subsection>

    <subsection name="Method code">
      <p>
        Non-abstract (and non-native) methods contain an attribute
        "<tt>Code</tt>" that holds the following data: The maximum size of
        the method's stack frame, the number of local variables and an
        array of byte code instructions. Optionally, it may also contain
        information about the names of local variables and source file
        line numbers that can be used by a debugger.
      </p>

      <p>
        Whenever an exception is raised during execution, the JVM performs
        exception handling by looking into a table of exception
        handlers. The table marks handlers, i.e., code chunks, to be
        responsible for exceptions of certain types that are raised within
        a given area of the byte code. When there is no appropriate
        handler the exception is propagated back to the caller of the
        method. The handler information is itself stored in an attribute
        contained within the <tt>Code</tt> attribute.
      </p>

    </subsection>

    <subsection name="Byte code offsets">
      <p>
        Targets of branch instructions like <tt>goto</tt> are encoded as
        relative offsets in the array of byte codes. Exception handlers
        and local variables refer to absolute addresses within the byte
        code.  The former contains references to the start and the end of
        the <tt>try</tt> block, and to the instruction handler code. The
        latter marks the range in which a local variable is valid, i.e.,
        its scope. This makes it difficult to insert or delete code areas
        on this level of abstraction, since one has to recompute the
        offsets every time and update the referring objects. We will see
        in <a href="bcel-api.html#ClassGen">section 3.3</a> how <font
              face="helvetica,arial">BCEL</font> remedies this restriction.
      </p>

    </subsection>

    <subsection name="Type information">
      <p>
        Java is a type-safe language and the information about the types
        of fields, local variables, and methods is stored in so called
        <em>signatures</em>. These are strings stored in the constant pool
        and encoded in a special format. For example the argument and
        return types of the <tt>main</tt> method
      </p>

      <p align="center">
        <source>public static void main(String[] argv)</source>
      </p>

      <p>
        are represented by the signature
      </p>

      <p align="center">
        <source>([java/lang/String;)V</source>
      </p>

      <p>
        Classes are internally represented by strings like
        <tt>"java/lang/String"</tt>, basic types like <tt>float</tt> by an
        integer number. Within signatures they are represented by single
        characters, e.g., <tt>I</tt>, for integer. Arrays are denoted with
        a <tt>[</tt> at the start of the signature.
      </p>

    </subsection>

    <subsection name="Code example">
      <p>
        The following example program prompts for a number and prints the
        factorial of it. The <tt>readLine()</tt> method reading from the
        standard input may raise an <tt>IOException</tt> and if a
        misspelled number is passed to <tt>parseInt()</tt> it throws a
        <tt>NumberFormatException</tt>. Thus, the critical area of code
        must be encapsulated in a <tt>try-catch</tt> block.
      </p>

      <source>
import java.io.*;

public class Factorial {
    private static BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

    public static int fac(int n) {
        return (n == 0) ? 1 : n * fac(n - 1);
    }

    public static int readInt() {
        int n = 4711;
        try {
            System.out.print("Please enter a number&gt; ");
            n = Integer.parseInt(in.readLine());
        } catch (IOException e1) {
            System.err.println(e1);
        } catch (NumberFormatException e2) {
            System.err.println(e2);
        }
        return n;
    }
    
    public static void main(String[] argv) {
        int n = readInt();
        System.out.println("Factorial of " + n + " is " + fac(n));
    }
}
      </source>

      <p>
        This code example typically compiles to the following chunks of
        byte code:
      </p>

      <source>
        0:  iload_0
        1:  ifne            #8
        4:  iconst_1
        5:  goto            #16
        8:  iload_0
        9:  iload_0
        10: iconst_1
        11: isub
        12: invokestatic    Factorial.fac (I)I (12)
        15: imul
        16: ireturn

        LocalVariable(start_pc = 0, length = 16, index = 0:int n)
      </source>

      <p><b>fac():</b>
        The method <tt>fac</tt> has only one local variable, the argument
        <tt>n</tt>, stored at index 0. This variable's scope ranges from
        the start of the byte code sequence to the very end.  If the value
        of <tt>n</tt> (the value fetched with <tt>iload_0</tt>) is not
        equal to 0, the <tt>ifne</tt> instruction branches to the byte
        code at offset 8, otherwise a 1 is pushed onto the operand stack
        and the control flow branches to the final return.  For ease of
        reading, the offsets of the branch instructions, which are
        actually relative, are displayed as absolute addresses in these
        examples.
      </p>

      <p>
        If recursion has to continue, the arguments for the multiplication
        (<tt>n</tt> and <tt>fac(n - 1)</tt>) are evaluated and the results
        pushed onto the operand stack.  After the multiplication operation
        has been performed the function returns the computed value from
        the top of the stack.
      </p>

      <source>
        0:  sipush        4711
        3:  istore_0
        4:  getstatic     java.lang.System.out Ljava/io/PrintStream;
        7:  ldc           "Please enter a number&gt; "
        9:  invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V
        12: getstatic     Factorial.in Ljava/io/BufferedReader;
        15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;
        18: invokestatic  java.lang.Integer.parseInt (Ljava/lang/String;)I
        21: istore_0
        22: goto          #44
        25: astore_1
        26: getstatic     java.lang.System.err Ljava/io/PrintStream;
        29: aload_1
        30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
        33: goto          #44
        36: astore_1
        37: getstatic     java.lang.System.err Ljava/io/PrintStream;
        40: aload_1
        41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
        44: iload_0
        45: ireturn

        Exception handler(s) =
        From    To      Handler Type
        4       22      25      java.io.IOException(6)
        4       22      36      NumberFormatException(10)
      </source>

      <p><b>readInt():</b> First the local variable <tt>n</tt> (at index 0)
        is initialized to the value 4711.  The next instruction,
        <tt>getstatic</tt>, loads the references held by the static
        <tt>System.out</tt> field onto the stack. Then a string is loaded
        and printed, a number read from the standard input and assigned to
        <tt>n</tt>.
      </p>

      <p>
        If one of the called methods (<tt>readLine()</tt> and
        <tt>parseInt()</tt>) throws an exception, the Java Virtual Machine
        calls one of the declared exception handlers, depending on the
        type of the exception.  The <tt>try</tt>-clause itself does not
        produce any code, it merely defines the range in which the
        subsequent handlers are active. In the example, the specified
        source code area maps to a byte code area ranging from offset 4
        (inclusive) to 22 (exclusive).  If no exception has occurred
        ("normal" execution flow) the <tt>goto</tt> instructions branch
        behind the handler code. There the value of <tt>n</tt> is loaded
        and returned.
      </p>

      <p>
        The handler for <tt>java.io.IOException</tt> starts at
        offset 25. It simply prints the error and branches back to the
        normal execution flow, i.e., as if no exception had occurred.
      </p>

    </subsection>
    </section>
  </body>

</document>