[Prev][Next][Index][Thread]
xqcam speed figures
While it's certainly true that optimization should wait until
after the program is working correctly, it's also the case that
the actual frame-reading protocol (which consumes the bulk of the
time) probably won't change; and I got that old profiling itch, so...
I modified libqcam to inline the actual port I/O instructions
(eliminates function call overhead, register spilling,
lets the optimizer work, etc.); here are some speed figures.
I was mostly interested in the comment in the Makefile saying that
using optimization levels past 1 made xqcam slower. I found the reverse:
that -O2 is measurably faster than -O1 (and -O4 and -O6 are pretty
much indistinguishable from -O2). For what it's worth, I'm also
using gcc 2.7.0.
Frames per second on an otherwise unloaded 486DX4-100,
unidirectional mode, 320x240, 6BPP, whitebal=60, bright=135, contrast=200:
(xqcam) (scan loop)
old inline old inline
------ ------ ------ ------
-- 0.853 0.848 0.895 0.898
-O 1.234 1.282 1.315 1.382
-02 1.270 1.334 1.357 1.430
-O4 1.272 1.335 1.358 1.431
-O6 1.272 1.333 1.358 1.422 (*)
And, at 80x20, 4BPP:
-- 2.497 2.685 2.535 2.731
-O 5.415 5.475 5.621 5.674
-O6 6.500 6.562 6.746 6.838
Random notes:
The "scan loop" column is the result of timing
a loop that does 100 calls to qc_scan() and discards the
results; it's the same as "xqcam" but without any output
to the screen.
All these numbers are repeatable to within a percent or so,
some to within +- 1 in the last digit (<one part in a thousand).
Variation is probably due to daemons waking up during the
test runs. In particular, I think the point marked with a (*) is
a bobble: I reran the inline, -O6, 320x240x6 test and got 1.334
for xqcam and 1.431 for scan loop, which is more in keeping with
the other numbers (and only about 1% different.)
The image parameters have a significant effect on the frame rate;
the "exposure time" is not negligible. I can get a more than
fourfold change in frame rate (and a blank image) just by diddling
the parameters.
A quick calculation shows that it takes a minimum of 320x240x(3/2)x6
or 691200 I/O operations to fetch a 320x240x6 frame, which means
my fastest numbers are doing about 1 million I/O ops per second.
My (unsupported) hunch is that the I/O is the big bottleneck.
My parallel port is on a card, not on the motherboard, and might
well be particularly slow.
This is the "stock" qcam-0.3; looks like it should be possible
to reduce the number of I/O ops by a third (especially nice
on systems where I/O requires a syscall!) as well as making
all the bit-shuffling more efficient, as has been discussed
here. But I have to go to work now...
Follow-Ups: