Theorarm

Theorarm is an Ogg Theora/Vorbis decoding library optimised for use on ARM processors. It is based on the latest (at time of writing) Theora decoder as supplied by xiph.org, and my Tremolo library (which is in turn based upon the Tremor decoder also supplied by xiph.org).

Theora and Vorbis are (supposedly) patent unemcumbered/royalty free video/audio compression schemes. Vorbis pretty much beats the pants off everything else out there (but of course the thing it gets compared to most is MP3). Theora is not quite as good as some of the alternatives out there (currently), but encoder optimisation is drastically closing the gap between it and its closest competitors. Everything you could possibly want to know about Ogg/Theora/Vorbis (and lots you don't) can again be found at xiph.org.

What Theorarm is:

The standard Theora decoder as supplied, currently contains no ARM code whatsoever. Furthermore, it relies on various support libraries including libogg and libvorbis (available from the same source) to do ogg bitstream handling and vorbis decoding. Unfortunately libvorbis relies on floating point operation, which makes it a non-starter on the ARM platform.

The obvious solution to this would therefore seem to be to use the Tremor lib (the integer only vorbis decoder) with Theora, but this bundles its own versions of libogg and libvorbis within its code. These versions are not instantly compatible with the versions required by Theora.

For Theorarm, I have therefore made the Theora and Tremolo (my version of the Tremor lib) libraries compatible, and rewritten bits of Theora to work more efficiently on ARM machines. In some cases this involves tweaks to the C - in other cases, this means rewriting speed critical sections entirely in ARM code.

Options can be set in an an assembler header (lib/dec/common.s) to inform the compiler which type of code should be generated (ranging from vanilla ARMv4, thru ARMv4+LDRD/STRD, ARMv6 and NEON). The code will run on anything from ARMv4 upwards, but enabling the newer options gives an increase in performance. There is still scope for more optimisation here. The bit reading sections of the code assume a little endian memory system, but this can probably be changed if required.

The API to the library is broadly the same as before (with the differences being due to the different version of libogg present in Tremor).

I Am Not A Lawyer:

Previous versions of Theorarm have been released under the GNU GPL, but thanks to a grant from Google it has now been rereleased under the same BSD style license as the original Theora library was. Many thanks to everyone at Google for making this happen!

The nice thing about this is that it means we can start to roll the Theorarm changes back into the mainline Theora code, so everyone benefits. In particular it means that when the merge completes you should be able to just get the latest version from Xiph, and not have to worry that the two will have diverged.

A significant benefit of this merge will be that the mainline theora encoder/decoder libs builds using much more standard tools than my rather thrown together makefile. Theorarm in it's current form is definitely best used by people used to handling sharp objects, wheras the merged trunk version should be much more approachable.

The current state of play can be seen by checking out: http://svn.xiph.org/branches/theorarm-merge-branch

Other things:

There are still some non standard Cisms in there (the tremor lib uses alloca, for example). VS2005 compiles and runs it fine though.

The assembly code is in ARM format, with a simple script to convert it automatically into gcc format at compile time.

Timings etc:

You'd probably like some timings, to show how good Theorarm is, right?

Release 0.03

My initial ARMv4 test device is an imate JAM running WinCE (that's a 416MHz XScale PXA272). My simple test app plays a 320x240, 25fps film, with 48KHz stereo audio (the interrogation chapter ripped from the R2 Matrix DVD and reencoded using the latest (at time of writing) Thusnelda encoder). Without post processing enabled, the code manages 38.5fps (i.e. comfortably full speed). With full post processing enabled, it only manages 23fps. (These figures are correct as of release 0.03)

I have done some limited, simplistic profiling, using a tool included in the source distribution. These figures should be taken as indicative, if possibly not 100% accurate.

When playing with post processing enabled, 28% of the time is spent in the YUV to RGB conversion code, 30% is spent in the (optional) post processing deblocker, and 9% is spent in the (optional) post processing dering code. 3.5% of the time is "unaccounted for" (presumably in system calls, such as reading data). Every routine that accounts for more than 1.5% of CPU time has been ARM coded.

Without the overhead of post processing, the figures are correspondingly changed, and the YUV to RGB conversion code becomes the dominating factor, accounting for 55% of CPU time. The largest other contributors are oc_frag_copy_list at 8.5%, oc_frag_recon_inter2 at 4% and the idct at 3% (all of which have been ARM coded).

Release 0.04

I've now moved my primary testing platform to be a beagleboard (a Cortex-A8 based ARM development board) running at (I beleive 500MHz).

With post processing disabled, I can play a PAL DVD sized film (720x576x25fps, 48kHz stereo audio track) in realtime with software YUV2RGB. The limited profiling I've done, along with some back-of-an-envelope maths suggests that we should just about be able to do 720p films if the YUV2RGB process is done by hardware.

Release 0.05

Updated with some of the bugfixes from the tremor trunk, and a couple of fixes to the mdct ARM code. Relicensed under BSD. No change to performance expected.

More details may be added here later as they become clear.

So is that it?

This should be considered a work in progress. There is lots more I'd like to do to the code, but I think the changes here are significant enough to make them available. Work on this project is likely to be punctuated by significant delays as real life gets in the way.

Obvious things to try next, are to continue investigating the use of ARMv6/NEON extensions to speed the hotspots and to remove the use of alloca.

Warranty:

The original Tremor lib included the following disclaimer:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The same applies to Theorarm.

ChangeLog:

v0.01: First version released to net.
v0.02: Minor tweaks, including updated YUV 2 RGB code (with temporal and spatial dithering).
v0.03: Added support to testtheora.c for YUV444 and YUV420 (along with optimised ARM code for those cases).
v0.04: Added first batch of ARMv6/NEON optimisations, along with simple unix makefile to build for beagleboard.
v0.05: First "post Google" release under a BSD license.

Related Pages:

Siryn Tremolo Theorarm YUV 2 RGB