Transforming Bytecode to the Intermediate Representation

The translation from bytecode to the intermediate representation is essentially a sim-plified version of the algorithm presented by Demange, Jensen, and Pichardie [DJP10].

Their algorithm considers general JVM bytecode, and is generic in the program prop-erty that shall be shown. Since I adapt the approach specifically for the DSD verification framework, a number simplifications can be made:

• Neither error states nor the order of errors are treated — noninterference is defined on successful executions only.

• There is no return instruction, hence we do not need to consider special return states.

• One does not have to consider event traces, therefore there are no IR pseudo-instructions that write such a trace.

• There are no constructor methods. The allocation and initialization of objects are both performed by a single instruction. Therefore, we do not need to handle uninitialized objects and the order of their initialization with care.

• The original transformation by Demange et al. may create, for a single bytecode instruction, a sequence of IR assignment instructions, which then all have to be linked back to the same original instruction for the proof. The IR language, in contrast, has these assignment sequences built into the IR syntax as single blockainstructions.

• Only special bytecode programs are considered by the translation: the operand stack must be empty during a jump.

The last condition may appear to be a major restriction; however, the DSD compiler produces only such bytecode. In fact, empty-stack requirements have been given for previous bytecode verification techniques [Ler02], as it has been noticed that current Java compilers rarely take advantage of elaborate uses of stacks and local variables.

Nevertheless, the extended version of the paper by Demange et al. shows how to deal with arbitrary bytecode, which requires a proper extension of the concepts presented here. Besides these simplifications, I also need to consider the pseudo-instructions cjmpjandcpushjin the translation, which are not present in JVM bytecode.

5.3.1 The Bytecode Transformation Algorithm

The transformation operates on four levels: single bytecode instructions, multiple bytecode instructions, method bodies, and full bytecode programs. I will now present the parts in this order.

5.3 Transforming Bytecode to the Intermediate Representation

bytecode input stack IR output stack conditions

nop as block² as

pushc as block_² c::as

pop e::as block_² as

primop e₂::e₁::as block² (e₁ope₂) ::as

loadx as block_² x::as

getfx e::as block_² e.f ::as

newC e::as block[t_i⁰:=newC(e)] t_i⁰::as t_i⁰6∈as

bnzj e::as ife j as

jmpj as jmpj as

cpushj as cpushj as

cjmpj as cjmpj as

storex e::as block[t_i⁰:=x;x:=e] as[t_i⁰/x] t_i⁰6∈as putff e⁰::e::as block[ti:=as;e.f:=e⁰] ti ti6∈as callm e::e::as block[ti:=as;t_i⁰:=e.m(e)] t_i⁰::ti t_i⁰,ti6∈as

Table 5.1:BC2IR_instr: Transformation of a single bytecode instruction Algorithm 1BC2IR_rng(BC,m,I)

for alli∈sort(I)do ifi∈jmpTgt^BC_m then

ASin[m,i] :=² end if

(ASout[m,i],IR[m,i]) :=BC2IRinstr(i,BC(m,i),ASin[m,i]) ifASout[m,i]6=²and∃j∈succ(m,i)∩jmpTgt^BC_m then

fail end if

ifi+1∈succ(m,i)then ASin[m,i+1] :=ASout[m,i] end if

end for return IR

Algorithm 2BC2IR_mtd(BC,m) AS_in[m,mentry(m)] :=² BC2IRrng(BC,m,domBC(m))

Transformation of single bytecode instructions The transformation is based on abstract stacks, which are stacks of high-level expressions fromExp:

abstract stack: as ∈ Exp^∗

Table 5.1 on the preceding page defines the functionBC2IRinstrwhich takes a byte-code address, a bytebyte-code instruction and an abstract input stack, and produces an IR instruction and an abstract output stack. The function reconstructs the high-level expressions from operand-stack manipulating bytecode instructions. The bytecode instructionsstorex,putff,newC, andcallmare translated to assignments blocks, where the result of the operation is immediately written to a temporary variable. There is a one-to-one correspondence between the BC instructions and the IR instructions.

The function assumes for each methodmand for each instruction addressithe existence of arbitrarily many temporary variablest_i⁰,t_i¹,t_i², . . . that are available for the instructionIR[m,i]. This way, each temporary variablet_i^kis assigned at exactly one point in the IR program, namely at addressi. The side conditions require that an assigned temporary variablet_i^k must not have occurred in the input stack of the instruction at addressi.

Special care has to be taken for instructions that possibly invalidate the contents of the abstract stack. Forstorexoperations, the contents of the variablexis first stored in a temporary variablet_i⁰beforexis overwritten. In the output stack, all occurrences ofxare replaced byt_i⁰, which holds the saved old value ofx. Forputff andcallm, the entire abstract input stackasis saved in a sequence of temporary variablesti, and this sequence is then the output stack.

Transformation of multiple instructions Ranges of bytecode instructions are trans-formed withBC2IRrng(BC,m,I), shown as Algorithm 1 on the previous page. The function traverses a set of instruction addressesIsorted in ascending order, and writes the compiled instructions into the arrayIR[m,i]. At the same time, it defines the arrays ASin[m,i] (input stack for instructioni) andASout[m,i] (output stack for instructioni).

The functionBC2IR_rng(BC,m,I) relies onBC2IR_instr, and chains the abstract stacks, such that the output stack of an instruction becomes the input stack of its immediate successor. Additionally, it ensures that the abstract input stacks at jump targets are empty, and fails if a jump instruction produces an output stack that is not empty.

Note that it cannot happen that a side condition ofBC2IRinstrfails when used within BC2IRrng. This would only be possible if any of the variablest_i^koccurs in the input stack ASin[m,i], but no instruction preceding the one at addressigenerates the variablet_i^k. Transformation of method bodies The functionBC2IRmtd(BC,m), shown as Algo-rithm 2 on the preceding page, transforms methods by setting the input stack of the entry point to², and then usingBC2IR_rngfor the method body.

5.3 Transforming Bytecode to the Intermediate Representation Transformation of bytecode programs Finally, the full program translation function BC2IR(PBC)=PIRis defined as

BC2IR(≺,fields,methods,margs,BC,mentry,mexit)= (≺,fields,methods,margs,tvars,IR,mentry,mexit) such that for all methodsm∈dom(BC),

IR(m) = BC2IR_mtd(BC,m) tvars(m) = {t|toccurs inIR(m)}

In particular, all method bodies are translated using the algorithmBC2IR_mtd. Properties of the algorithm We observe that a translated IR program covers the same range of instructions addresses as the original bytecode program; that the abstract stack is empty at jump targets; and that abstract stacks are passed through instructions.

Proposition 5.3 Let IR=BC2IRrng(BC,m, [i0,i1[). Thendom(IR)=[i0,i1[.

PROOF By definition ofBC2IRrng.

Proposition 5.4 For all i∈jmpTgt^BC_m ,ASin[m,i]=². Also, ASin[m,mentry(m)]=².

PROOF By definition of theBC2IRalgorithm.

Proposition 5.5 If(m,i)7−−−−−→^IR[m,i] (m,i⁰), then ASout[m,i]=ASin[m,i⁰].

PROOF Ifi⁰∈jmpTgt^BC_m , theni⁰∈succ(m,i)∩jmpTgt^BC_m . As the algorithm did not fail, it must beASout[[,m],i]=²by definition ofBC2IR. With Proposition 5.4, we have ASin[m,i⁰]=²=ASout[m,i].

Ifi⁰ 6∈jmpTgt^BC_m , then since i⁰ ∈succ(m,i), it must bei⁰=i+1 by definition of jmpTgt^BC_m and by definition of IR semantics. Withi+1∈succ(m,i) andi⁰=i+16∈

jmpTgt^BC_m , we getASin[m,i⁰]=ASout[m,i] by definition ofBC2IR. 5.3.2 Translation of the Example Program

Our example program in bytecode form, as presented at the end of Section 4.2 on page 56, is translated to an IR program

PIR=(≺,fields,methods,margs,tvars,IR,mentry,mexit)

such that≺,fields,methods,margs,mentry, andmexitare as in the bytecode program.

The only method that needs to be translated to the bytecode program issendFile. Ta-ble 5.3.2 shows the method in bytecode form and the corresponding IR form, including the abstract input and output stacks for each instruction. Note how the IR instructions are a partial reconstruction of the original high-level method body from Section 2.2 on page 22. The translation also definestvars(sendFile)=[t₈⁰,t₁₃⁰ ,t₁₄⁰,t₁₉⁰ ,t₂₀⁰ ], because these are the temporary variables that occur inIR(sendFile).

iBC(sendFile,i)ASin[sendFile,i]IR[sendFile,i]ASout[sendFile,i] 0cpush22²cpush22²1loadfile²block²[file]2getffδ[file]block²[file.fδ]3loadsrv[file.fδ]block²[srv,file.fδ]4getffδ[srv,file.fδ]block²[srv.fδ,file.fδ]5primv[srv.fδ,file.fδ]block²[file.fδvsrv.fδ]6bnz10[file.fδvsrv.fδ]if(file.fδvsrv.fδ)8²7push0²block²[0]8storeret[0]block[t 08:=0;ret:=0]²9cjmp22²cjmp22²10loadfile²block²[file]11loadfile[file]block²[file,file]12getffδ[file,file]block²[file.fδ,file]13callread[file.fδ,file]block[t 013:=file.read(file.fδ)][t 013]14storedata[t 013]block[t 014:=t 013;data:=t 013]²15loadsrv²block²[srv]16loadfile[srv]block²[file,srv]17getffδ[file,srv]block²[file.fδ,srv]18loaddata[file.fδ,srv]block²[data,file.fδ,srv]19callwrite[data,file.fδ,srv]block[t 019:=srv.write(file.fδ,data)][t 019]20storeret[t 019]block[t 020:=t 019;ret:=t 019]²21cjmp22²cjmp22²

Table5.2:TranslationoftheexampleprogramfrombytecodetoIR

Im Dokument Information flow analysis for mobile code in dynamic security environments (Seite 80-85)