How to affect Delphi XEx code generation for Android/ARM targets?

AndroidDelphiAndroid NdkArmLlvm

Android Problem Overview


Update 2017-05-17. I no longer work for the company where this question originated, and do not have access to Delphi XEx. While I was there, the problem was solved by migrating to mixed FPC+GCC (Pascal+C), with NEON intrinsics for some routines where it made a difference. (FPC+GCC is highly recommended also because it enables using standard tools, particularly Valgrind.) If someone can demonstrate, with credible examples, how they are actually able to produce optimized ARM code from Delphi XEx, I'm happy to accept the answer.


Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code. Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM is supposed to be capable of producing reasonably tight and sensible code, but it seems that Delphi is using its facilities in a weird way. Delphi wants to use the stack very heavily, and it generally only utilizes the processor's registers r0-r3 as temporary variables. Perhaps the craziest of all, it seems to be loading normal 32 bit integers as four 1-byte load operations. How to make Delphi produce better ARM code, and without the byte-by-byte hassle it is making for Android?

At first I thought the byte-by-byte loading was for swapping byte order from big-endian, but that was not the case, it is really just loading a 32 bit number with 4 single-byte loads.* It might be to load the full 32 bits without doing an unaligned word-sized memory load. (whether it SHOULD avoid that is another thing, which would hint to the whole thing being a compiler bug)*

Let's look at this simple function:

function ReadInteger(APInteger : PInteger) : Integer;
begin
  Result := APInteger^;
end;

Even with optimizations switched on, Delphi XE7 with update pack 1, as well as XE6, produce the following ARM assembly code for that function:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
   0:	b580      	push	{r7, lr}
   2:	466f      	mov	r7, sp
   4:	b083      	sub	sp, #12
   6:	9002      	str	r0, [sp, #8]
   8:	78c1      	ldrb	r1, [r0, #3]
   a:	7882      	ldrb	r2, [r0, #2]
   c:	ea42 2101 	orr.w	r1, r2, r1, lsl #8
  10:	7842      	ldrb	r2, [r0, #1]
  12:	7803      	ldrb	r3, [r0, #0]
  14:	ea43 2202 	orr.w	r2, r3, r2, lsl #8
  18:	ea42 4101 	orr.w	r1, r2, r1, lsl #16
  1c:	9101      	str	r1, [sp, #4]
  1e:	9000      	str	r0, [sp, #0]
  20:	4608      	mov	r0, r1
  22:	b003      	add	sp, #12
  24:	bd80      	pop	{r7, pc}

Just count the number of instructions and memory accesses Delphi needs for that. And constructing a 32 bit integer from 4 single-byte loads... If I change the function a little bit and use a var parameter instead of a pointer, it is slightly less convoluted:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
   0:	b580      	push	{r7, lr}
   2:	466f      	mov	r7, sp
   4:	b083      	sub	sp, #12
   6:	9002      	str	r0, [sp, #8]
   8:	6801      	ldr	r1, [r0, #0]
   a:	9101      	str	r1, [sp, #4]
   c:	9000      	str	r0, [sp, #0]
   e:	4608      	mov	r0, r1
  10:	b003      	add	sp, #12
  12:	bd80      	pop	{r7, pc}

I won't include the disassembly here, but for iOS, Delphi produces identical code for the pointer and var parameter versions, and they are almost but not exactly the same as the Android var parameter version. Edit: to clarify, the byte-by-byte loading is only on Android. And only on Android, the pointer and var parameter versions differ from each other. On iOS both versions generate exactly the same code.

For comparison, here's what FPC 2.7.1 (SVN trunk version from March 2014) thinks of the function with optimization level -O2. The pointer and var parameter versions are exactly the same.

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:

   0:	6800      	ldr	r0, [r0, #0]
   2:	46f7      	mov	pc, lr

I also tested an equivalent C function with the C compiler that comes with the Android NDK.

int ReadInteger(int *APInteger)
{
	return *APInteger;
}

And this compiles into essentially the same thing FPC made:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>:
   0:	6800      	ldr	r0, [r0, #0]
   2:	4770      	bx	lr

Android Solutions


Solution 1 - Android

> We are investigating the issue. In short, it depends on the potential mis-alignment (to 32 boundary) of the Integer referenced by a pointer. Need a little more time to have all of the answers... and a plan to address this. > > Marco Cantù, moderator on Delphi Developers

Also reference https://stackoverflow.com/questions/27821277/why-are-the-delphi-zlib-and-zip-libraries-so-slow-under-64-bit as Win64 libraries are shipped built without optimizations.


In the QP Report: RSP-9922 Bad ARM code produced by the compiler, $O directive ignored?, Marco added following explanation:

>There are multiple issues here: > > * As indicated, optimization settings apply only to entire unit files and not to individual functions. Simply put, turning optimization on and off in the same file will have no effect. > * Furthermore, simply having "Debug information" enabled turns off optimization. Thus, when one is debugging, explicitly turning on optimizations will have no effect. Consequently, the CPU view in the IDE will not be able to display a disassembled view of optimized code. >* Third, loading non-aligned 64bit data is not safe and does result in errors, hence the separate 4 one byte operations that are needed in given scenarios.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSide S. FreshView Question on Stackoverflow
Solution 1 - AndroidKirk StrobeckView Answer on Stackoverflow