First steps with zephyr

Looking for RTOS alternatives about 6 months ago I had found zephyr. I read the doc and started to feel it is project with future. In my current job we have many embedded boards running FreeRTOS and it feels like if you are in the past. If you need any "advanced" feature you have to write it by yourself, the contribution to the project feels like 90s or well 2000s. Don't get me wrong I think that FreeRTOS is or was a good project, the mailing list is very responsive but they lack the majority of features of a good open source project.

Anyway, back to zephyr. The first thing that I really loved about zephyr was the possibility of build itself and don't let that responsability to the user, digging and spending many hours trying to build a simple hello world... To be continued...

Hacking my power supply

A few days ago, I got up (as usually, very early) and prepared mate for breakfast. After that, I seated in my computer and started to hack. But after one hour I realized that the power supply wasn't charging the battery of my laptop. The first thing that I thought was, it's unplugged of the electricity network. But no, it was plugged in, then I thought that something was wrong with the wire, but neither. So I decided start to disassemble it. Let me tell you that the power supply of the computers are not built for be repaired them, they are sealed. You need to break the package (plastic coverage) for get into the power supply board. I used a common knife cutting on the line marked when it was assembled. After a half hour and a lot of patience I get my first touch with the power supply board.


After that, I plugged in my iron soldering and then disassembled the faraday jail that cover the board.


When I was young I earned money repairing electronic stuff. So using my experience in this field, my first try was measure all the input components, fuse, varactor, diode, transistor, resistors and all were fine except the fuse.

../galleries/hacking-my-power-supply/third_step.thumbnail.jpg ../galleries/hacking-my-power-supply/fourth_step.thumbnail.jpg

So I tried with a fuse made in home (a very thin wire) and guess what. It worked, the power supply was working! But It started to make a weird noise, and when plug in it into the laptop the noise was worst. So I guessed that it could be the rectification capacitor, a lot of times when I repaired TVs, Monitors, Videos, etc in the past, this fails was very common. So I tried it with a big capacitor :), the first that I found.


When I plugged in the power supply with the capacitor changed, it worked like a charm :). I found another capacitor smaller than original and with similar characteristics. Then I tried again sucessfully :), now testing if the laptop is charging ok.


Finally I repacked everything and end of story.


I am going to DebConf13

I'm so happy to announce that I'm going to DebConf 13 in Vaumarcus Switzerland.

Yeah! It's my first time of two really big things

  • My first DebConf.
  • My first travel away from my country.

In the middle of all this time, I've become Debian Developer and I'm so happy for this too. So, as a colleague says, this post is "Debianthurbation" :).

I want to thank my wife for her patience and sharing feelings with me all the way and my mentor and friend dererk! And I also want to thank a lot to Luis Canali, for making possible that the CIII be my travel sponsor.

Done, this post is very personal but well... I want to spread my happiness :)

Optimizing using SSE

If you need optimize a cpu intensive application, then you need to know SSE, a SIMD instruction set.

There are four types that you need to familiarize:

__m64 MM register
__m128 packed single precision (XMM register)
__m128d packed double precision (XMM register)
__m128i packed integer (XMM register)

The intrinsics have the next format _mm_op_suffix() where op is the operation that performs the intrinsic and suffix indicate the data type that will operate (the four types mentioned above).

Ok then, no wait! Please do not rewrite your software using SSE intrinsics. Let the dirty job to gcc :).

First of all, the intrinsics function are not friendly. If you write some code using them, I promise you that you'll need rethink what do you wanted to do when you take a look at your own code written just a week ago.

Second your code will not be portable to machines that not supports SSE instructions.

Let me show you a small program that performs a dot product between two vectors.

#include <limits>
#include <cstdlib>
#include <stdint.h>

#define VECTOR_SIZE 1000000

int32_t dotProduct(size_t ndim, const int16_t *a, const int16_t *b) {
  int32_t acc = 0;
  for (size_t i = 0; i < ndim; ++i) {
    acc += a[i] * b[i];
  return acc;

void populateVector(size_t dim, int16_t *v) {
  for(size_t i = 0; i < dim; ++i) {
    v[i] = static_cast<int16_t>((std::rand() * RAND_MAX) % std::numeric_limits<int16_t>::max());

int main(int argc, char *argv[]) {
  int16_t *a = NULL;
  int16_t *b = NULL;
  posix_memalign(reinterpret_cast<void **>(&a), 16, VECTOR_SIZE * sizeof(int16_t));
  posix_memalign(reinterpret_cast<void **>(&b), 16, VECTOR_SIZE * sizeof(int16_t));
  populateVector(VECTOR_SIZE, a);
  populateVector(VECTOR_SIZE, b);
  return dotProduct(VECTOR_SIZE, a, b);

If you build this program passing -msse2 flag to gcc, then your code is written using SSE instructions automagically. For better performance, SSE needs to allocate memory aligned to 16 bytes, in posix systems we can use the posix_memalign function. Don't forget use the const keyword!

$ gcc -o dot_product -O3 -g -msse2
$ objdump -dS dot_product > dot_product.s

Disassembling the code

int32_t dotProduct(size_t ndim, const int16_t *a, const int16_t *b) {
8048561: 89 44 24 0c           mov    %eax,0xc(%esp)
8048565: 8b 04 24              mov    (%esp),%eax
8048568: 01 ed                 add    %ebp,%ebp
804856a: 8d 0c 2e              lea    (%esi,%ebp,1),%ecx
804856d: 31 db                 xor    %ebx,%ebx
804856f: 66 0f ef c9           pxor   %xmm1,%xmm1
8048573: 01 fd                 add    %edi,%ebp
8048575: 8d 76 00              lea    0x0(%esi),%esi
8048578: f3 0f 6f 45 00        movdqu 0x0(%ebp),%xmm0
804857d: 66 0f f5 01           pmaddwd (%ecx),%xmm0
8048581: 83 c3 01              add    $0x1,%ebx
8048584: 83 c1 10              add    $0x10,%ecx
8048587: 83 c5 10              add    $0x10,%ebp
804858a: 39 c3                 cmp    %eax,%ebx
804858c: 66 0f fe c8           paddd  %xmm0,%xmm1
8048590: 72 e6                 jb     8048578 <_Z10dotProductjPKsS0_+0x98>
8048592: 66 0f 6f c1           movdqa %xmm1,%xmm0
8048596: 66 0f 73 d8 08        psrldq $0x8,%xmm0
804859b: 8b 44 24 0c           mov    0xc(%esp),%eax
804859f: 66 0f fe c8           paddd  %xmm0,%xmm1
80485a3: 8b 4c 24 04           mov    0x4(%esp),%ecx
80485a7: 66 0f 6f c1           movdqa %xmm1,%xmm0
80485ab: 66 0f 73 d8 04        psrldq $0x4,%xmm0
80485b0: 66 0f fe c8           paddd  %xmm0,%xmm1
80485b4: 66 0f 7e 0c 24        movd   %xmm1,(%esp)
80485b9: 03 54 24 04           add    0x4(%esp),%edx
80485bd: 03 04 24              add    (%esp),%eax
80485c0: 39 4c 24 08           cmp    %ecx,0x8(%esp)
80485c4: 74 1e                 je     80485e4 <_Z10dotProductjPKsS0_+0x104>
80485c6: 8b 6c 24 30           mov    0x30(%esp),%ebp
80485ca: 8d b6 00 00 00 00     lea    0x0(%esi),%esi
80485d0: 0f bf 0c 56           movswl (%esi,%edx,2),%ecx
80485d4: 0f bf 1c 57           movswl (%edi,%edx,2),%ebx

As you can see, the above code performs the algorithm using SSE instructions. You can help to gcc for generate more optimized code. For example if you are sure that you memory will be aligned, then you can indicate this as follows:

const int16_t *a_ = static_cast<const int16_t *>(__builtin_assume_aligned(a, 16));
const int16_t *b_ = static_cast<const int16_t *>(__builtin_assume_aligned(b, 16));

Then gcc will generate assembly code using the operations assuming that the memory is aligned. You can see here and compare the cost of the instructions with memory aligned and memory unaligned.

Also if you want to vectorize the algorithm, generally first rewrite your code without using SSE intrinsics. If you do that, then you are helping gcc to vectorize the algorithm. You need disassemble the code and take a look if gcc is doing right it job. I recomend read this presentation that gives good tips for helping to gcc.

I can conclude that just when it actually necessary, you need to write code using intrinsics. But first, you need to do profiling and find your bottle neck. After that, disassemble the code and try to understand how gcc is resolving the algorithm. Then, try to point to gcc as much as possible. If you don't make that gcc do the work for you, then sadly you have to write code using the intrinsics functions. But I warn you, this just happend in weird and complex cases.

If you need perform a dot product, use std::inner_product. The code above is only for get a simple example.