How to efficiently convert an audio sample in 16-bit signed integer format to a 32-bit float value on an ARM NEON CPU? And how to achieve bit-exact results?
There are several ways to do it in different projects:
s16 -> float |
float -> s16 |
|
| PulseAudio | flt = sample / (float) 0x7fff; |
sample = lrintf(clip_flt(flt) * 0x7fff) |
| libavresample | flt = sample / (float) (1<<15); |
sample = (s16) clip_s16(lrintf(flt * (1 << 15))); |
| RtAudio | flt = (sample + 0.5f) * (1 / 32767.5f); |
sample = (s16) (flt * 32767.5f - 0.5f); |
clip_s16() saturates a 16-bit short integer (-32768..32767); clip_flt() returns a float -1.0..1.0.
Observations regarding PulseAudio:
flt_to_s16(s16_to_flt(x)) != x for x == -32768
x / (float) 0x7fff != x * (1.0f / 0x7fff)
on the other hand
x / (float) (1<<15) == x * (1.0f / (1<<15))
and the second form would allow to avoid division in favour of
multiplication of the inverse; the problem with the first form is a slight deviation for certain input values
lrintf() rounds according to the current rounding mode which by default is
round-toward-nearest integer, toward-even for tie breaking). For example, this means:
12.3 -> 12 12.5 -> 12 (!) 12.7 -> 13 13.3 -> 13 13.5 -> 14 (!) 13.7 -> 14So .5 values are rounded to an even value.
static void float_to_s16(const float *src, int16_t *dst) {
__asm__ __volatile__ (
"vdup.f32 q2, %[two23] \n\t"
"vdup.f32 q3, %[scale] \n\t"
"vdup.u32 q4, %[mask] \n\t"
"vdup.f32 q5, %[mone] \n\t"
"vld1.32 {q0}, [%[src]]! \n\t" /* load x */
"vmaxq.f32 q0, q0, q5 \n\t" /* clip at -1.0 */
"vmul.f32 q0, q0, q3 \n\t" /* scale */
"vand.u32 q1, q0, q4 \n\t" /* get sign bit */
"vorr.u32 q1, q1, q2 \n\t" /* put sign on 2^23 */
"vadd.f32 q0, q1, q0 \n\t" /* sgn(x)*2^23 + x ... */
"vsub.f32 q0, q0, q1 \n\t" /* ... - sgn(x)*2^23 */
"vcvt.s32.f32 q0, q0 \n\t" /* convert to int */
"vqmovn.s32 d0, q0 \n\t" /* saturate and narrow */
"vst1.16 {d0}, [%[dst]]! \n\t"
: [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
: [scale] "r" (32767.0f), [two23] "r" (8.3886080000e+06f), [mask] "r" (0x80000000), [mone] "r" (-1.0f) /* input operands */
: "memory", "cc", "q0", "q1", "q2", "q3", "q4", "q5" /* clobber list */
);
}
Observations:
vmaxq instruction is needed to match PulseAudio's clipping sematics (clip -1.0f to -32767 instead of -32768); otherwise, vqmovn takes care of narrowing a 32-bit signed integer to 16-bit and saturation.
vand) and or-ing the sign to the two23 value.
lrintf() rounding we have: one extra vand, vor, vadd, vsub -- not bad!
sample / (float) 0x7fff; without actually performing the costly division. I am not saying that this makes much sense, but hey, we can
First we observe that a discrepancy (i.e. sample / (float) 0x7fff != sample * (1.0f / 0x7fff)) occurs when the binary representation of the input value
converted to float ends in 0x4000 (that is, q0 & 0xffff == 0x4000after the vcvt instruction). There are 1536 such problematic values over all possible inputs.
static void s16_to_float(const int16_t *src, float *dst) {
__asm__ __volatile__ (
"vdup.f32 q1, %[invscale] \n\t"
"vdup.u16 q3, %[mask] \n\t"
"vdup.u32 q4, %[one] \n\t"
"vld1.16 {d0}, [%[src]]! \n\t" /* load x */
"vmovl.s16 q0, d0 \n\t" /* s16 -> s32 */
"vcvt.f32.s32 q0, q0 \n\t" /* s32 -> float */
"vceq.u16 q2, q0, q3 \n\t" /* check for defect */
"vand.u32 q2, q2, q4 \n\t" /* prepare 1 if defect */
"vmul.f32 q0, q0, q1 \n\t" /* multiply by invscale */
"vadd.u32 q0, q0, q2 \n\t" /* correct if defect */
"vst1.32 {q0}, [%[dst]]! \n\t"
: [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
: [invscale] "r" (invscale), [mask] "r" (0x4000), [one] "r" (1) /* input operands */
: "memory", "cc", "q0", "q1", "q2", "q3", "q4" /* clobber list */
);
}
Observations:
vceq check for the problem condition; it sets each matching 16-bit word to 0xffff if is 0x4000.
vand just keeps the LSB in each 32-bit word, hence a 1 indicated the problem condition for each float.
vadd adds the correction bit to the multiplication result -- making multiplication by the inverse of 0x7fff identical to divsion by 0x7fff.
vceq, vand, vadd -- not bad!
posted at: 11:17 | path: /programming | permanent link | 0 comments
Hacked a very crude Linux driver for the Leopard Imaging LI-LBCM1M1 camera board with Aptina MT9M112 sensor (1.3 MPixel, 1280x1024).
Driver is based on patches by Laurent Pinchart and Linux 3.5-rc2. I have no clue how this media-controller, v4l2-subdev stuff really works... The host side is a beagleboard XM with an TI OMAP3 processor and a parallel synchronous data interface. The following captures a frame:
media-ctl -r -l '"mt9m112 2-0048":0->"OMAP3 ISP CCDC":0[1], "OMAP3 ISP CCDC":1->"OMAP3 ISP CCDC output":0[1]' media-ctl -f '"mt9m112 2-0048":0[YUYV2X8 1280x1024], "OMAP3 ISP CCDC":1[YUYV2X8 1280x1024]' yavta -p -f YUYV -s 1280x1024 -n 4 --capture=50 --skip 49 -F `media-ctl -e "OMAP3 ISP CCDC output"`
Let's see if I manage to clean-up and upstream the code...
Oh, a captures frame can be seen here. Looks good
Here is my professional setup:
posted at: 17:31 | path: /programming | permanent link | 0 comments
Note to myself:
convert -depth 8 -size 1280x1024 -sampling-factor 4:2:2 yuv:frame.bin -colorspace rgb frame.png
Calls ImageMagick's convert; frame.bin (2621440 Bytes) is in UYVY format, result is a PNG image.
Taken with a Leopard Imaging LI-LBCM1M1 camera board with Aptina MT9M112 sensor (1.3 MPixel, 1280x1024), click to enlarge.
posted at: 17:09 | path: /programming | permanent link | 0 comments
Some ActionScript code (triangle.as) which displays a triangle and a rotating rectangle:
_root.createEmptyMovieClip('triangle', 1);
with (_root.triangle) {
lineStyle(10, 0xff0000, 100);
moveTo(200, 200);
lineTo(300, 300);
lineTo(100, 300);
lineTo(200, 200);
};
var r:MovieClip = createEmptyMovieClip('rectangle', 2);
r.beginFill(0xcc00cc, 100);
r.moveTo(20, 20);
r.lineTo(140, 20);
r.lineTo(140, 140);
r.lineTo(20, 140);
r.endFill();
r._x = 230;
r._y = 190;
r.onEnterFrame = function() {
this._rotation += 5;
};
Compile it using ming (a library for generating Macromedia Flash files, .swf):
makeswf -r 12 -v 9 -o triangle.swf triangle.aswhere -r specifies the frames-per-second and -v gives the Flash format/version to create (up to 9 is supported).
Finally, embed the scary Flash thing:
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" width="550" height="400" id="home" align=""> <param name="movie" value='/as/triangle.swf'> <param name="quality" value="high"> <param name="bgcolor" value="#ffffff"> <embed src='/as/triangle.swf' quality="high" bgcolor="#ffffff" width="550" height="400" name="home" align="" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer"> </object>
posted at: 09:57 | path: /programming | permanent link | 0 comments
The following code converts float values to 16-bit signed integer values using ARM NEON intrinsics (assuming n is a multiple of 4) -- for instance audio samples.
The vcvtq_s32_f32 instruction rounds towards zero, not towards
the nearest integer. In C, the semantics would be trunc() instead
of lrintf().
To overcome the issue, one could implement:
float a; short b = trunc(a + ((a > 0) ? 0.5 : - 0.5));To get rid of the condition, the trick is to get the sign bit (the MSB of a float) and
or it to the constant 0.5 before adding it to a.
In C:
float a; short b = trunc(a + float((uint32(a) & 0x8000000) | uint32(0.5)));
The complete code using ARM NEON intrinsics looks as follows:
void conv_s16_from_float(unsigned n, const float *a, short *b) {
unsigned i;
const float32x4_t plusone4 = vdupq_n_f32(1.0f);
const float32x4_t minusone4 = vdupq_n_f32(-1.0f);
const float32x4_t half4 = vdupq_n_f32(0.5f);
const float32x4_t scale4 = vdupq_n_f32(32767.0f);
const uint32x4_t mask4 = vdupq_n_u32(0x80000000);
for (i = 0; i < n/4; i++) {
float32x4_t v4 = ((float32x4_t *)a)[i];
v4 = vmulq_f32(vmaxq_f32(vminq_f32(v4, plusone4) , minusone4), scale4);
const float32x4_t w4 = vreinterpretq_f32_u32(vorrq_u32(vandq_u32(
vreinterpretq_u32_f32(v4), mask4), vreinterpretq_u32_f32(half4)));
((int16x4_t *)b)[i] = vmovn_s32(vcvtq_s32_f32(vaddq_f32(v4, w4)));
}
}
posted at: 13:35 | path: /programming | permanent link | 0 comments
http://apidocs.meego.com/1.0/mtf/gestures.html
http://www.enricoros.com/blog/2009/12/im-going-multi-touch/
http://who-t.blogspot.com/
http://doc.qt.nokia.com/latest/gestures-imagegestures.html
http://doc.qt.nokia.com/4.8/qtouchevent.html
http://www.slideshare.net/qtbynokia/using-multitouch-and-gestures-with-qt
posted at: 17:08 | path: /programming | permanent link | 0 comments
All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 10.22 | 11.56 | 14.2 | 1.0 | 10.92 | 16.16 | 2.82 | 2.87 |
| 1024 | 4.5 | 5.2 | 5.61 | 0.46 | 5.11 | 7.22 | 1.16 | 1.16 |
| 512 | 2.07 | 2.3 | 2.98 | 0.2 | 2.59 | 2.89 | 0.36 | 0.34 |
| 256 | 0.88 | 1.0 | 1.12 | 0.08 | 1.01 | 1.12 | 0.12 | 0.11 |
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 5.37 | - | 6.91 | 0.7 | 4.71 | 7.37 | 7.38 | 7.38 |
| 1024 | 2.49 | - | 3.45 | 0.32 | 2.19 | 3.14 | 3.13 | 3.13 |
| 512 | 1.09 | - | 1.43 | 0.2 | 1.09 | 1.2 | 1.2 | 1.2 |
| 256 | 0.49 | - | 0.72 | 0.08 | 0.41 | 0.46 | 0.46 | 0.47 |
oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/.
fftw3 is GPL licensed (version 3.2.2).
fftw3/neon is based on fftw 3.2.2 and has
ARM/NEON patches added.
fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.
posted at: 14:00 | path: /programming | permanent link | 0 comments