The following code converts float values to 16-bit signed integer values using ARM NEON intrinsics (assuming n is a multiple of 4) -- for instance audio samples.
The vcvtq_s32_f32 instruction rounds towards zero, not towards
the nearest integer. In C, the semantics would be trunc() instead
of lrintf().
To overcome the issue, one could implement:
float a; short b = trunc(a + ((a > 0) ? 0.5 : - 0.5));To get rid of the condition, the trick is to get the sign bit (the MSB of a float) and
or it to the constant 0.5 before adding it to a.
In C:
float a; short b = trunc(a + float((uint32(a) & 0x8000000) | uint32(0.5)));
The complete code using ARM NEON intrinsics looks as follows:
void conv_s16_from_float(unsigned n, const float *a, short *b) {
unsigned i;
const float32x4_t plusone4 = vdupq_n_f32(1.0f);
const float32x4_t minusone4 = vdupq_n_f32(-1.0f);
const float32x4_t half4 = vdupq_n_f32(0.5f);
const float32x4_t scale4 = vdupq_n_f32(32767.0f);
const uint32x4_t mask4 = vdupq_n_u32(0x80000000);
for (i = 0; i < n/4; i++) {
float32x4_t v4 = ((float32x4_t *)a)[i];
v4 = vmulq_f32(vmaxq_f32(vminq_f32(v4, plusone4) , minusone4), scale4);
const float32x4_t w4 = vreinterpretq_f32_u32(vorrq_u32(vandq_u32(
vreinterpretq_u32_f32(v4), mask4), vreinterpretq_u32_f32(half4)));
((int16x4_t *)b)[i] = vmovn_s32(vcvtq_s32_f32(vaddq_f32(v4, w4)));
}
}
posted at: 13:35 | path: /programming | permanent link | 0 comments
GestureWatch: Using proximity detectors to control mobile devices with touchless hand gestures http://www.youtube.com/watch?v=IZCJRJRkhF8 Contactless gesture recognition system using proximity sensors http://www.google.at/url?sa=t&rct=j&q=contactless%20gesture%20recognition%20system%20using%20proximity%20sensors&source=web&cd=1&ved=0CCMQFjAA&url=http%3A%2F%2Frepository.cmu.edu%2Fcgi%2Fviewcontent.cgi%3Farticle%3D1017%26context%3Dsilicon_valley&ei=bS7dTrH1BMiWOuXm9cEO&usg=AFQjCNEgqYB3jihDa-BQTjzmDIQ5gqX1tw&sig2=bRt0nc46x6VOgJqoblEfiQ
posted at: 22:08 | path: /projects | permanent link | 0 comments
http://apidocs.meego.com/1.0/mtf/gestures.html
http://www.enricoros.com/blog/2009/12/im-going-multi-touch/
http://who-t.blogspot.com/
http://doc.qt.nokia.com/latest/gestures-imagegestures.html
http://doc.qt.nokia.com/4.8/qtouchevent.html
http://www.slideshare.net/qtbynokia/using-multitouch-and-gestures-with-qt
posted at: 17:08 | path: /programming | permanent link | 0 comments
All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 10.22 | 11.56 | 14.2 | 1.0 | 10.92 | 16.16 | 2.82 | 2.87 |
| 1024 | 4.5 | 5.2 | 5.61 | 0.46 | 5.11 | 7.22 | 1.16 | 1.16 |
| 512 | 2.07 | 2.3 | 2.98 | 0.2 | 2.59 | 2.89 | 0.36 | 0.34 |
| 256 | 0.88 | 1.0 | 1.12 | 0.08 | 1.01 | 1.12 | 0.12 | 0.11 |
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 5.37 | - | 6.91 | 0.7 | 4.71 | 7.37 | 7.38 | 7.38 |
| 1024 | 2.49 | - | 3.45 | 0.32 | 2.19 | 3.14 | 3.13 | 3.13 |
| 512 | 1.09 | - | 1.43 | 0.2 | 1.09 | 1.2 | 1.2 | 1.2 |
| 256 | 0.49 | - | 0.72 | 0.08 | 0.41 | 0.46 | 0.46 | 0.47 |
oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/.
fftw3 is GPL licensed (version 3.2.2).
fftw3/neon is based on fftw 3.2.2 and has
ARM/NEON patches added.
fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.
posted at: 14:00 | path: /programming | permanent link | 0 comments
I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4
parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does
)
Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
| float | float (RunFast) |
float32x2_t | float32x4_t |
| 1.62 | 1.22 | 0.66 | 0.98 |
posted at: 15:33 | path: /programming | permanent link | 0 comments
The following code (from math_runfast.c) improves
kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).
void enable_runfast() {
#ifdef __arm__
static const unsigned int x = 0x04086060;
static const unsigned int y = 0x03000000;
int r;
asm volatile (
"fmrx %0, fpscr \n\t" //r0 = FPSCR
"and %0, %0, %1 \n\t" //r0 = r0 & 0x04086060
"orr %0, %0, %2 \n\t" //r0 = r0 | 0x03000000
"fmxr fpscr, %0 \n\t" //FPSCR = r0
: "=r"(r)
: "r"(x), "r"(y) );
#endif
}
In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.
Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937
posted at: 13:13 | path: /programming | permanent link | 0 comments
This is how I use ARM NEON intrinsics to speed up division and square root operations...
#include "arm_neon.h"
// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
}
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
return x * invsqrtv(x);
}
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
float32x4_t reciprocal = vrecpeq_f32(x);
reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
return reciprocal;
}
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
float32x4_t reciprocal = vrecpeq_f32(y);
reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
return x * invv(y);
}
// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
static const float32x2_t f0 = vdup_n_f32(0.0f);
return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);
}
posted at: 10:39 | path: /programming | permanent link | 0 comments