[Prev][Next][Index][Thread]

xqcam speed figures



While it's certainly true that optimization should wait until
after the program is working correctly, it's also the case that
the actual frame-reading protocol (which consumes the bulk of the
time) probably won't change; and I got that old profiling itch, so...

I modified libqcam to inline the actual port I/O instructions 
(eliminates function call overhead, register spilling, 
lets the optimizer work, etc.); here are some speed figures.
I was mostly interested in the comment in the Makefile saying that
using optimization levels past 1 made xqcam slower. I found the reverse:
that -O2 is measurably faster than -O1 (and -O4 and -O6 are pretty
much indistinguishable from -O2). For what it's worth, I'm also
using gcc 2.7.0.

Frames per second on an otherwise unloaded 486DX4-100,
unidirectional mode, 320x240, 6BPP, whitebal=60, bright=135, contrast=200:
   
           (xqcam)         (scan loop)
        old     inline    old     inline
       ------   ------   ------   ------  
   --  0.853    0.848    0.895    0.898
   -O  1.234    1.282    1.315    1.382
   -02 1.270    1.334    1.357    1.430
   -O4 1.272    1.335    1.358    1.431
   -O6 1.272    1.333    1.358    1.422 (*)

And, at 80x20, 4BPP:
   --  2.497    2.685    2.535    2.731
   -O  5.415    5.475    5.621    5.674
   -O6 6.500    6.562    6.746    6.838

Random notes:
The "scan loop" column is the result of timing
a loop that does 100 calls to qc_scan() and discards the
results; it's the same as "xqcam" but without any output
to the screen.

All these numbers are repeatable to within a percent or so,
some to within +- 1 in the last digit (<one part in a thousand).
Variation is probably due to daemons waking up during the
test runs. In particular, I think the point marked with a (*) is
a bobble: I reran the inline, -O6, 320x240x6 test and got 1.334
for xqcam and 1.431 for scan loop, which is more in keeping with
the other numbers (and only about 1% different.)

The image parameters have a significant effect on the frame rate;
the "exposure time" is not negligible. I can get a more than
fourfold change in frame rate (and a blank image) just by diddling 
the parameters.

A quick calculation shows that it takes a minimum of 320x240x(3/2)x6 
or 691200 I/O operations to fetch a 320x240x6 frame, which means
my fastest numbers are doing about 1 million I/O ops per second. 
My (unsupported) hunch is that the I/O is the big bottleneck.
My parallel port is on a card, not on the motherboard, and might
well be particularly slow.

This is the "stock" qcam-0.3; looks like it should be possible
to reduce the number of I/O ops by a third (especially nice
on systems where I/O requires a syscall!) as well as making
all the bit-shuffling more efficient, as has been discussed
here. But I have to go to work now...

Follow-Ups: