pmeerw's blog

Tweets

Mon, 29 Oct 2012

Adventures in fast 16-bit audio sample to float conversion

How to efficiently convert an audio sample in 16-bit signed integer format to a 32-bit float value on an ARM NEON CPU? And how to achieve bit-exact results?

There are several ways to do it in different projects:
s16 -> float float -> s16
PulseAudio flt = sample / (float) 0x7fff; sample = lrintf(clip_flt(flt) * 0x7fff)
libavresample flt = sample / (float) (1<<15); sample = (s16) clip_s16(lrintf(flt * (1 << 15)));
RtAudio flt = (sample + 0.5f) * (1 / 32767.5f); sample = (s16) (flt * 32767.5f - 0.5f);
clip_s16() saturates a 16-bit short integer (-32768..32767); clip_flt() returns a float -1.0..1.0.

Observations regarding PulseAudio:

  1. flt_to_s16(s16_to_flt(x)) != x for x == -32768
  2. x / (float) 0x7fff != x * (1.0f / 0x7fff) on the other hand x / (float) (1<<15) == x * (1.0f / (1<<15)) and the second form would allow to avoid division in favour of multiplication of the inverse; the problem with the first form is a slight deviation for certain input values
  3. CPUs saturation instructions cannot be directly used; float_to_s16() in PulseAudio never produces -32768

lrintf() rounds according to the current rounding mode which by default is round-toward-nearest integer, toward-even for tie breaking). For example, this means:

12.3 -> 12
12.5 -> 12 (!)
12.7 -> 13
13.3 -> 13
13.5 -> 14 (!)
13.7 -> 14
So .5 values are rounded to an even value.

Efficient float_to_s16() on ARM NEON

I'm just showing the initialization and the code in the inner loop, processing 4 samples at a time.
static void float_to_s16(const float *src, int16_t *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q2, %[two23]            \n\t"
    "vdup.f32   q3, %[scale]            \n\t"
    "vdup.u32   q4, %[mask]             \n\t"
    "vdup.f32   q5, %[mone]             \n\t"

    "vld1.32    {q0}, [%[src]]!         \n\t" /* load x */
    "vmaxq.f32  q0, q0, q5              \n\t" /* clip at -1.0 */
    "vmul.f32   q0, q0, q3              \n\t" /* scale */
    "vand.u32   q1, q0, q4              \n\t" /* get sign bit */
    "vorr.u32   q1, q1, q2              \n\t" /* put sign on 2^23 */
    "vadd.f32   q0, q1, q0              \n\t" /* sgn(x)*2^23 + x ... */
    "vsub.f32   q0, q0, q1              \n\t" /* ... - sgn(x)*2^23 */
    "vcvt.s32.f32 q0, q0                \n\t" /* convert to int */
    "vqmovn.s32  d0, q0                 \n\t" /* saturate and narrow */
    "vst1.16    {d0}, [%[dst]]!         \n\t"
                                                                       
    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [scale] "r" (32767.0f), [two23] "r" (8.3886080000e+06f), [mask] "r" (0x80000000), [mone] "r" (-1.0f) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4", "q5" /* clobber list */                                        
  );
}
Observations: So for lrintf() rounding we have: one extra vand, vor, vadd, vsub -- not bad!

Efficient s16_to_float() on ARM NEON

On we turn to the inverse operation in a similar manner. Goal is to obtain bit-extact results for the operation sample / (float) 0x7fff; without actually performing the costly division. I am not saying that this makes much sense, but hey, we can :-)

First we observe that a discrepancy (i.e. sample / (float) 0x7fff != sample * (1.0f / 0x7fff)) occurs when the binary representation of the input value converted to float ends in 0x4000 (that is, q0 & 0xffff == 0x4000after the vcvt instruction). There are 1536 such problematic values over all possible inputs.

static void s16_to_float(const int16_t *src, float *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q1, %[invscale]         \n\t"
    "vdup.u16   q3, %[mask]             \n\t"
    "vdup.u32   q4, %[one]              \n\t"

    "vld1.16    {d0}, [%[src]]!         \n\t" /* load x */
    "vmovl.s16  q0, d0                  \n\t" /* s16 -> s32 */
    "vcvt.f32.s32 q0, q0                \n\t" /* s32 -> float */
                  
    "vceq.u16   q2, q0, q3              \n\t" /* check for defect */
    "vand.u32   q2, q2, q4              \n\t" /* prepare 1 if defect */
                                  
    "vmul.f32   q0, q0, q1              \n\t" /* multiply by invscale */
    "vadd.u32   q0, q0, q2              \n\t" /* correct if defect */
    "vst1.32    {q0}, [%[dst]]!         \n\t"

    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [invscale] "r" (invscale), [mask] "r" (0x4000), [one] "r" (1) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4" /* clobber list */
  );
}
Observations: So for the exact conversion / division we have: one extra vceq, vand, vadd -- not bad!

posted at: 11:17 | path: /programming | permanent link | 0 comments

Thu, 21 Jun 2012

LI-LBCM1M1 / MT9M112 Linux driver

Hacked a very crude Linux driver for the Leopard Imaging LI-LBCM1M1 camera board with Aptina MT9M112 sensor (1.3 MPixel, 1280x1024).

Driver is based on patches by Laurent Pinchart and Linux 3.5-rc2. I have no clue how this media-controller, v4l2-subdev stuff really works... The host side is a beagleboard XM with an TI OMAP3 processor and a parallel synchronous data interface. The following captures a frame:

media-ctl -r -l '"mt9m112 2-0048":0->"OMAP3 ISP CCDC":0[1], "OMAP3 ISP CCDC":1->"OMAP3 ISP CCDC output":0[1]'
media-ctl -f '"mt9m112 2-0048":0[YUYV2X8 1280x1024], "OMAP3 ISP CCDC":1[YUYV2X8 1280x1024]'
yavta -p -f YUYV -s 1280x1024 -n 4 --capture=50 --skip 49 -F `media-ctl -e "OMAP3 ISP CCDC output"`

Let's see if I manage to clean-up and upstream the code... Oh, a captures frame can be seen here. Looks good :-)

Here is my professional setup:

posted at: 17:31 | path: /programming | permanent link | 0 comments

Converting YUV to RGB with ImageMagick

Note to myself: convert -depth 8 -size 1280x1024 -sampling-factor 4:2:2 yuv:frame.bin -colorspace rgb frame.png

Calls ImageMagick's convert; frame.bin (2621440 Bytes) is in UYVY format, result is a PNG image.

Taken with a Leopard Imaging LI-LBCM1M1 camera board with Aptina MT9M112 sensor (1.3 MPixel, 1280x1024), click to enlarge.

posted at: 17:09 | path: /programming | permanent link | 0 comments

Wed, 30 May 2012

ActionScript with Linux

Some ActionScript code (triangle.as) which displays a triangle and a rotating rectangle:

_root.createEmptyMovieClip('triangle', 1);

with (_root.triangle) {
  lineStyle(10, 0xff0000, 100);
  moveTo(200, 200);
  lineTo(300, 300);
  lineTo(100, 300);
  lineTo(200, 200);
};
          
var r:MovieClip = createEmptyMovieClip('rectangle', 2);
          
r.beginFill(0xcc00cc, 100);
r.moveTo(20, 20);
r.lineTo(140, 20);
r.lineTo(140, 140);
r.lineTo(20, 140);
r.endFill();
          
r._x = 230;
r._y = 190;
          
r.onEnterFrame = function() {
  this._rotation += 5;
};

Compile it using ming (a library for generating Macromedia Flash files, .swf):

makeswf -r 12 -v 9 -o triangle.swf triangle.as
where -r specifies the frames-per-second and -v gives the Flash format/version to create (up to 9 is supported).

Finally, embed the scary Flash thing:

<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" width="550" height="400" id="home" align="">
<param name="movie" value='/as/triangle.swf'>
<param name="quality" value="high">
<param name="bgcolor" value="#ffffff">
<embed src='/as/triangle.swf' quality="high" bgcolor="#ffffff" width="550" height="400" name="home" align="" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer">
</object>

Here's the code and html.

posted at: 09:57 | path: /programming | permanent link | 0 comments

Mon, 09 Jan 2012

Convert float-to-int with ARM NEON intrinsics

The following code converts float values to 16-bit signed integer values using ARM NEON intrinsics (assuming n is a multiple of 4) -- for instance audio samples.

The vcvtq_s32_f32 instruction rounds towards zero, not towards the nearest integer. In C, the semantics would be trunc() instead of lrintf().

To overcome the issue, one could implement:

float a;
short b = trunc(a + ((a > 0) ? 0.5 : - 0.5));
To get rid of the condition, the trick is to get the sign bit (the MSB of a float) and or it to the constant 0.5 before adding it to a. In C:
float a;
short b = trunc(a + float((uint32(a) & 0x8000000) | uint32(0.5)));

The complete code using ARM NEON intrinsics looks as follows:

void conv_s16_from_float(unsigned n, const float *a, short *b) {
    unsigned i;
    
    const float32x4_t plusone4 = vdupq_n_f32(1.0f);
    const float32x4_t minusone4 = vdupq_n_f32(-1.0f);
    const float32x4_t half4 = vdupq_n_f32(0.5f);
    const float32x4_t scale4 = vdupq_n_f32(32767.0f);
    const uint32x4_t mask4 = vdupq_n_u32(0x80000000);
                        
    for (i = 0; i < n/4; i++) {
        float32x4_t v4 = ((float32x4_t *)a)[i];
        v4 = vmulq_f32(vmaxq_f32(vminq_f32(v4, plusone4) , minusone4), scale4);
                                           
        const float32x4_t w4 = vreinterpretq_f32_u32(vorrq_u32(vandq_u32(
            vreinterpretq_u32_f32(v4), mask4), vreinterpretq_u32_f32(half4)));
                                                                   
        ((int16x4_t *)b)[i] = vmovn_s32(vcvtq_s32_f32(vaddq_f32(v4, w4)));
    }
}

posted at: 13:35 | path: /programming | permanent link | 0 comments

Thu, 20 Oct 2011

Linux, multi-touch, Qt

http://apidocs.meego.com/1.0/mtf/gestures.html
http://www.enricoros.com/blog/2009/12/im-going-multi-touch/
http://who-t.blogspot.com/
http://doc.qt.nokia.com/latest/gestures-imagegestures.html
http://doc.qt.nokia.com/4.8/qtouchevent.html
http://www.slideshare.net/qtbynokia/using-multitouch-and-gestures-with-qt

posted at: 17:08 | path: /programming | permanent link | 0 comments

Mon, 19 Sep 2011

FFT performance on ARM Cortex-A8

All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.

complex-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 10.22 11.56 14.2 1.0 10.92 16.16 2.82 2.87
1024 4.5 5.2 5.61 0.46 5.11 7.22 1.16 1.16
512 2.07 2.3 2.98 0.2 2.59 2.89 0.36 0.34
256 0.88 1.0 1.12 0.08 1.01 1.12 0.12 0.11

real-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 5.37 - 6.91 0.7 4.71 7.37 7.38 7.38
1024 2.49 - 3.45 0.32 2.19 3.14 3.13 3.13
512 1.09 - 1.43 0.2 1.09 1.2 1.2 1.2
256 0.49 - 0.72 0.08 0.41 0.46 0.46 0.47

oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/. fftw3 is GPL licensed (version 3.2.2). fftw3/neon is based on fftw 3.2.2 and has ARM/NEON patches added. fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.

posted at: 14:00 | path: /programming | permanent link | 0 comments

Made with PyBlosxom