Well, first I need to do clipping (Right now, triangles aren't clipped to the screen, so if they extend off of it wierd **** happens) then I can release a little demo with arc-ball rotation and zooming and all that neat stuff. Then I need to make it fast; that means converting most of the vector operations in inline asm with SSH instructions, and converting the filler to fixed point, and then to asm. I should be able to parallize most of that too - for example, I can increment all three interpolated values used for perspective correct texture mapping with a single SSH vector add operation.
The goal is to atleast meet the performance of JK's software rasterizer at the same level of quality. However, I won't be using colormaps because I don't plan on dropping down to 8 bit. I'll drop to 16bit if I have to too get the performance I need, though.