pmeerw's blog

Tweets

Mon, 09 Jan 2012

Convert float-to-int with ARM NEON intrinsics

The following code converts float values to 16-bit signed integer values using ARM NEON intrinsics (assuming n is a multiple of 4) -- for instance audio samples.

The vcvtq_s32_f32 instruction rounds towards zero, not towards the nearest integer. In C, the semantics would be trunc() instead of lrintf().

To overcome the issue, one could implement:

float a;
short b = trunc(a + ((a > 0) ? 0.5 : - 0.5));
To get rid of the condition, the trick is to get the sign bit (the MSB of a float) and or it to the constant 0.5 before adding it to a. In C:
float a;
short b = trunc(a + float((uint32(a) & 0x8000000) | uint32(0.5)));

The complete code using ARM NEON intrinsics looks as follows:

void conv_s16_from_float(unsigned n, const float *a, short *b) {
    unsigned i;
    
    const float32x4_t plusone4 = vdupq_n_f32(1.0f);
    const float32x4_t minusone4 = vdupq_n_f32(-1.0f);
    const float32x4_t half4 = vdupq_n_f32(0.5f);
    const float32x4_t scale4 = vdupq_n_f32(32767.0f);
    const uint32x4_t mask4 = vdupq_n_u32(0x80000000);
                        
    for (i = 0; i < n/4; i++) {
        float32x4_t v4 = ((float32x4_t *)a)[i];
        v4 = vmulq_f32(vmaxq_f32(vminq_f32(v4, plusone4) , minusone4), scale4);
                                           
        const float32x4_t w4 = vreinterpretq_f32_u32(vorrq_u32(vandq_u32(
            vreinterpretq_u32_f32(v4), mask4), vreinterpretq_u32_f32(half4)));
                                                                   
        ((int16x4_t *)b)[i] = vmovn_s32(vcvtq_s32_f32(vaddq_f32(v4, w4)));
    }
}

posted at: 13:35 | path: /programming | permanent link | 0 comments

Mon, 05 Dec 2011

Gestures and proximity sensors

GestureWatch: Using proximity detectors to control mobile devices with touchless hand gestures http://www.youtube.com/watch?v=IZCJRJRkhF8 Contactless gesture recognition system using proximity sensors http://www.google.at/url?sa=t&rct=j&q=contactless%20gesture%20recognition%20system%20using%20proximity%20sensors&source=web&cd=1&ved=0CCMQFjAA&url=http%3A%2F%2Frepository.cmu.edu%2Fcgi%2Fviewcontent.cgi%3Farticle%3D1017%26context%3Dsilicon_valley&ei=bS7dTrH1BMiWOuXm9cEO&usg=AFQjCNEgqYB3jihDa-BQTjzmDIQ5gqX1tw&sig2=bRt0nc46x6VOgJqoblEfiQ

posted at: 22:08 | path: /projects | permanent link | 0 comments

Thu, 20 Oct 2011

Linux, multi-touch, Qt

http://apidocs.meego.com/1.0/mtf/gestures.html
http://www.enricoros.com/blog/2009/12/im-going-multi-touch/
http://who-t.blogspot.com/
http://doc.qt.nokia.com/latest/gestures-imagegestures.html
http://doc.qt.nokia.com/4.8/qtouchevent.html
http://www.slideshare.net/qtbynokia/using-multitouch-and-gestures-with-qt

posted at: 17:08 | path: /programming | permanent link | 0 comments

Mon, 19 Sep 2011

FFT performance on ARM Cortex-A8

All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.

complex-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 10.22 11.56 14.2 1.0 10.92 16.16 2.82 2.87
1024 4.5 5.2 5.61 0.46 5.11 7.22 1.16 1.16
512 2.07 2.3 2.98 0.2 2.59 2.89 0.36 0.34
256 0.88 1.0 1.12 0.08 1.01 1.12 0.12 0.11

real-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 5.37 - 6.91 0.7 4.71 7.37 7.38 7.38
1024 2.49 - 3.45 0.32 2.19 3.14 3.13 3.13
512 1.09 - 1.43 0.2 1.09 1.2 1.2 1.2
256 0.49 - 0.72 0.08 0.41 0.46 0.46 0.47

oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/. fftw3 is GPL licensed (version 3.2.2). fftw3/neon is based on fftw 3.2.2 and has ARM/NEON patches added. fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.

posted at: 14:00 | path: /programming | permanent link | 0 comments

Fri, 16 Sep 2011

KissFFT and ARM NEON

I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4 parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does :-))

Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
float float
(RunFast)
float32x2_t float32x4_t
1.62 1.22 0.66 0.98
Note: float32x2_t and float32x4_t, respectively, compute two and four FFTs in parallel!

posted at: 15:33 | path: /programming | permanent link | 0 comments

ARM floating point performance & RunFast

The following code (from math_runfast.c) improves kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).

void enable_runfast() {
#ifdef __arm__
    static const unsigned int x = 0x04086060;
    static const unsigned int y = 0x03000000;
    int r;
    asm volatile (
        "fmrx   %0, fpscr                       \n\t"   //r0 = FPSCR
        "and    %0, %0, %1                      \n\t"   //r0 = r0 & 0x04086060
        "orr    %0, %0, %2                      \n\t"   //r0 = r0 | 0x03000000
        "fmxr   fpscr, %0                       \n\t"   //FPSCR = r0
        : "=r"(r)
        : "r"(x), "r"(y) );
#endif
}

In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.

Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937

posted at: 13:13 | path: /programming | permanent link | 0 comments

How to use ARM NEON sqrt and reciprocal approximation

This is how I use ARM NEON intrinsics to speed up division and square root operations...

#include "arm_neon.h"

// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
    float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
    
    return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
}
        
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
    return x * invsqrtv(x);
}
            
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
    float32x4_t reciprocal = vrecpeq_f32(x);
    reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
                                
    return reciprocal;
}
                                    
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
    float32x4_t reciprocal = vrecpeq_f32(y);
    reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
                                                
    return x * invv(y);
}

// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
    static const float32x2_t f0 = vdup_n_f32(0.0f);
    return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);
}

posted at: 10:39 | path: /programming | permanent link | 0 comments

Made with PyBlosxom