Writing Type-Stable Code in Julia
For many of the people I talk to, Julia’s main appeal is speed. But achieving peak performance in Julia requires that programmers absorb a few subtle concepts that are generally unfamiliar to users of weakly typed languages.
One particularly subtle performance pitfall is the need to write type-stable code. Code is said to be type-stable if the type of every variable does not vary over time. To clarify this idea, consider the following two closely related function definitions:
function sumofsins1(n::Integer)
r = 0
for i in 1:n
r += sin(3.4)
end
return r
end
function sumofsins2(n::Integer)
r = 0.0
for i in 1:n
r += sin(3.4)
end
return r
end
The only difference between these function definitions is that sumofsins1
initializes r
to 0
, whereas sumofsins2
initializes r
to 0.0
.
This seemingly minor distinction has important practical implications because the initialization of r
to 0
means that the main loop of sumofsins1
begins with a single iteration in which the computer adds 0
to sin(3.4)
. This single addition step transforms the type of r
from Int
, which is the type of 0
, to Float64
, which is the type of sin(3.4)
. This means that the type of r
is not stable over the course of this loop.
This instability has considerable effects on the performance of sumofsins1
. To see this, let’s run some naive benchmarks. As always in Julia, we’ll start with a dry run to get the JIT to compile the functions being compared:
sumofsins1(100_000)
sumofsins2(100_000)
@time [sumofsins1(100_000) for i in 1:100];
@time [sumofsins2(100_000) for i in 1:100];
The results of this timing comparison are quite striking:
julia> @time [sumofsins1(100_000) for i in 1:100];
elapsed time: 0.412261722 seconds (320002496 bytes allocated)
julia> @time [sumofsins2(100_000) for i in 1:100];
elapsed time: 0.008509995 seconds (896 bytes allocated)
As you can see, the type-unstable code in sumofsins1
is 50x slower than the type-stable code. What might have seemed like a nitpicky point about the initial value of r
has enormous performance implications.
To understand the reasons for this huge performance gap, it’s worth considering what effect type-instability has on the compiler. In this case, the compiler can’t optimize the contents of the main loop of sumofsins1
because it can’t be certain that the type of r
will remain invariant throughout the entire loop. Without this crucial form of invariance, the compiler has to check the type of r
on every iteration of the loop, which is a much more intensive computation than repeatedly adding a constant value to a Float64
.
You can confirm for yourself that the compiler produces more complex code by examining the LLVM IR for both of these functions.
First, we’ll examine the LLVM IR for sumofsins1
:
julia> code_llvm(sumofsins1, (Int, ))
define %jl_value_t* @julia_sumofsins11067(i64) {
top:
%1 = alloca [5 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0
%2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !5145
store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8
%3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145
%4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !5145
%.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !5145
store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !5145
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145
%5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3
store %jl_value_t* null, %jl_value_t** %5, align 8
%6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %6, align 8
store %jl_value_t* inttoptr (i64 140379580131904 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !5150
%7 = icmp slt i64 %0, 1, !dbg !5151
br i1 %7, label %L2, label %pass, !dbg !5151
pass: ; preds = %top, %pass
%8 = phi %jl_value_t* [ %13, %pass ], [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ]
%"#s6.03" = phi i64 [ %14, %pass ], [ 1, %top ]
store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !5152
%9 = call %jl_value_t* @alloc_2w(), !dbg !5152
%10 = getelementptr inbounds %jl_value_t* %9, i64 0, i32 0, !dbg !5152
store %jl_value_t* inttoptr (i64 140379580056656 to %jl_value_t*), %jl_value_t** %10, align 8, !dbg !5152
%11 = getelementptr inbounds %jl_value_t* %9, i64 1, i32 0, !dbg !5152
%12 = bitcast %jl_value_t** %11 to double*, !dbg !5152
store double 0xBFD05AC910FF4C6C, double* %12, align 8, !dbg !5152
store %jl_value_t* %9, %jl_value_t** %6, align 8, !dbg !5152
%13 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140379586379936 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !5152
store %jl_value_t* %13, %jl_value_t** %2, align 8, !dbg !5152
%14 = add i64 %"#s6.03", 1, !dbg !5152
%15 = icmp sgt i64 %14, %0, !dbg !5151
br i1 %15, label %L2, label %pass, !dbg !5151
L2: ; preds = %pass, %top
%.lcssa = phi %jl_value_t* [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ], [ %13, %pass ]
%16 = load %jl_value_t** %4, align 8, !dbg !5153
%17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0, !dbg !5153
store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5153
ret %jl_value_t* %.lcssa, !dbg !5153
}
Then we’ll examine the LLVM IR for sumofsins2
:
julia> code_llvm(sumofsins2, (Int, ))
define double @julia_sumofsins21068(i64) {
top:
%1 = icmp slt i64 %0, 1, !dbg !5151
br i1 %1, label %L2, label %pass, !dbg !5151
pass: ; preds = %top, %pass
%"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]
%r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]
%2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156
%3 = add i64 %"#s6.04", 1, !dbg !5156
%4 = icmp sgt i64 %3, %0, !dbg !5151
br i1 %4, label %L2, label %pass, !dbg !5151
L2: ; preds = %pass, %top
%r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]
ret double %r.0.lcssa, !dbg !5157
}
The difference in size and complexity of code between these two functions in compiled form is considerable. And this difference is entirely atttributable to the compiler’s need to recheck the type of r
on every iteration of the main loop in sumofsins1
, which can be optimized out in sumofsins2
, where r
has a stable type.
Given the potential performance impacts of type-instability, every aspiring Julia programmer needs to learn to recognize potential sources of type-instability in their own code. Future versions of Julia may be configured to issue warnings when type-unstable code is encountered, but, for now, the responsibility lies with the programmer. Thankfully, once you learn about type-stability, it becomes easy to recognize in most cases.