OK, ok! I'll admit that the heading isn't entirely original, but what the hey- If you're reading this sentence, then it has done its job, with sincere apologies to a nocturnal vaudeville comedian named Dave.
Top Ten Stupid PET Tricks!
Copyright (c) - 1995, by Todd Elliott
Last Updated - 10/28/96.
I will attempt to illustrate top ten down and dirty tricks to get the best out of your ML programming in Commodore computers, (Well, mostly for the C64/128.) if you'll forgive the pun on my part about the PET series computers. Now, onward!
TOP TEN DOWN AND DIRTY ML TRICKS From the home office of Baltimore, MD.
I was going to put the undocumented opcodes here, but I felt that they were covered adequately in C=Hacking and numerous other publications. Additionally, these undocumented opcodes will be wholly incompatible with the eumulation mode of the SuperCPU. Rather, I would hope that these tips have helped somewhat for you to gain a deeper understanding of just exactly how the C= 8-bit machines work. Most tips are very esoteric in nature and I would never intend for you to use most of them in actual programming situations. They only serve to illustrate the inner workings of the 65xx CPU, and serves to demystify the ML programming aspect by showing its lovable quirks. ;) If you're seriously interested in pursing this subject further, look up the C=Hacking pages maintained by Jim Brain, or DisC=overy 'net-magazine" maintained by Mike Gordillo, and gaze into the ML abyss...
- Semi-documented JMP Instructions: BNE and BEQ Instructions.
Well, they have been documented as BRANCHing instructions, but their application for JMPing-type instructions usually is ignored. Let me illustrate: LDA CODE...JMP NEWROUTE. This code takes up five or six bytes, depending on the operand used in LDA CODE. It also takes six cycles to execute, due to the JMP instruction. But compare this with the LDA CODE...BNE NEWROUTE. This code shaves off one byte, an improvement over the old code. This replacement instruction, BNE works 99 percent of the time. However, if you do know that the zero flag will be set before the jump takes place, use BEQ instruction instead. The main advantage is that it makes the code relocatable, saves bytes and is a little bit quicker. The main disadvantage is that clarity is lost. It is intended to BRANCH on a certain condition, instead of being used as an JMPing type instruction, making the resultant code harder to read, and it is limited to jumping only 127 bytes forward/backward. Ah, it would be nice to have the BRA (Branch Always) instruction...
- Chameleon JSR Instruction-
It changes into a JMP Instruction! How many times when you JSR to a certain subroutine, and a condition resulted in that subroutine which requires exit somewhere else other than an RTS? You might as well as change it into a JMP in the first place. Or...you don't have to! First, an understanding of what a JSR instruction does: When the 65xx microprocessor encounters a JSR instruction, it saves the current program counter on the stack in low byte, high byte format, and then JMPs to the subroutine. When the 65xx microprocessor encounters a RTS instruction, it fetches the last two bytes on the stack, and JMPs to the program counter, plus one. So, if you want to exit somewhere else, you can do this: PLA...PLA...JMP SOMEWHERE. The PLAs pull out the last two bytes that contained the program counter, and everything's all tidied up and neat, with the discarded addresses sent off to the PC orphanage in the sky. ;)
- More Undocumented JMP instructions: RTS and BRK.
You think I'm jesting, huh? Nosiree, read on! Remember the previous tip explaining how the RTS instruction works? You can create a program counter, minus one, and PHA the counter in a low byte/high byte format onto the stack, and then RTS! By doing this, you created a quasi-JMP, and the irony is that you use the RTS instruction to accomplish this. Pretty sneaky, huh? As for the BRK instruction, it does what it says- It interrupts the program and aborts to the program counter currently located in the BRK vector at add ess $0316. You can modify the BRK vector to point to a new address in a low byte/high byte format, and then all BRKs will go to this address, thereby creating a quasi-JMP to that address. This is used in ML monitor programs. Both techniques can be used in your programs, but I'd advise against it- It consumes plenty of clock cycles, six for RTS and seven for BRK, and it pretty much makes the resulting code almost incomprehensible. Why do I bring this up? For the upcoming CBM Trivial Pursuit game, that's why! ;) (Come to think of it, read some of the previous Commodore Trivia articles by Jim Brain, if you're up to snuff.)
- Self-Modifying Code.
You may have heard of it. Many ML tutorials hint at it, but shie away from its arcane and blasphemous structure. Why? It contradicts the time honored principles of computer programming as fostered in a structured college curriculum, which emphasizes clean, clear and readable code above anything else and leaves no room for elegant 'trash' as this. But it's definitely worth your time to understand this little-known technique of ML programming. Example:
LDA #$07:LDY #$00:STA ZP+1:STY ZP:LDA #$01:LDY #$00
LDX #$03:LOOP STA (ZP),Y:DEY:BNE LOOP
DEC ZP+1:DEX:BPL LOOP:RTS
The preceding routine took 25 bytes to run, and an estimated time of 11,300+ clock cycles. Compare this with the following routine which uses self-modifying code:
LDA #$07:STA LOOP+2:LDA #$01:LDY #$00
LDX #$03:LOOP STA $FF00,Y:DEY:BNE LOOP
DEC LOOP+2:DEX:BPL LOOP:RTS
So far, the preceding routine took 24 bytes and an estimated time of 10,200+ clock cycles. The only apparent advantage of using self-modifying code is that of speed. You save a significant chunk of processing time this way to update the screen by 1,100 clock cycles. Also, using the absolute addressing mode helps the speed increase.
Also, please note that if you're programming in ML, you're ahead of the programming curve. Nowadays, Microsloth Corp. and other computer companies do not really sweat out their code or optimize their code. They just design the programs, and wait for the hardware to catch up and execute it at a satisfactory speed. We Commodore 8-bit users do not have that luxury, and resort to these dirty tricks to get the most out of our machine. By doing so, these users will be better prepared for programming apps for the current computer platforms and sneer at fellow employees, "You use C++? Ha, you're wimps! I assemble exclusively in 64-bit code for lunch!"
- Don't join them; BEAT them!
Create your own 'KERNAL' instead of using CBM KERNAL. How many of you honestly think that commercial programs use CBM KERNAL? I thought so... Most of those programs use their own 'KERNAL' developed in-house to speed up, optimize and maximize the computer's capabilities. Why would you replace those time-honored routines in CBM KERNAL with ones of your own? Let's use an example: JSR $FFD2, or CHROUT. If you disassemble the KERNAL CHROUT routine, you will notice that it is a long one, checking for all possibilities, flags, errors, etc., before a single character is physically present onscreen. If you designed your own 'KERNAL' with a routine called PRINTCHAR, designed that it will handle specific screen formatting, you'd be ahead in the game and a noticeable speed improvement will result. Soon in no time, you'll eventually have a 'library' of subroutines accumulated by your constant programming and experimentation, that you will easily have a full-blown 'KERNAL' of your own to develop future programs more quicker, bug-free and attain that 'commercial' feel.
- Use Zero Page Locations for your programs.
You aren't only limited to locations such as the cassette buffer to store programs or the 4K free area starting in $C000. You can 'plop' a small routine under around 124 bytes in zero page! When you enter the domain of ML, you forsake BASIC and its cozy environment. You can switch off BASIC and presto! You have around 124 contiguous free locations in zero page at your disposal. Why use it, you ask? Well, critical routines such as CHRGET have been placed in zero page to speed it up, and it is an excellent place if you are using a lot of variables. Supposedly if you have a critical screen updating routine, and you stored it in the 4K $C000 area, and you stored the same routine in zero page, the routine in zero page will win the race everytime. The reason lies in the 65xx architecture- The microprocessor jumps to the program counter relative to zero page. The microprocessor will not have to go far if the routine/variable is already there in zero page, and executes it like lightning. Oh, and shut off BASIC while you're doing this. One caveat- I did try this, and ran the program in a C64S emulator shareware version, and it crashed. One more reason to stick with the classic c64! ;)
- Pep up that ML subroutine with the SEI and CLI instructions!
Consider this: Interrupts happen sixty times a second, and they do a lot- They update the screen, check the keyboard, service the timer and generally wreak havoc. If you have a ML subroutine that isn't just up to par, put the SEI instruction before it, and when the subroutine is done, insert the CLI instruction to restore the interrupts. The theory goes that without the hassle and time consumed that the interrupt requires, your ML subroutine now commands nearly 100% of your microprocessor's time and runs quicker. Of course, a NMI request would still muck up your brillant piece of compact ML code.
- The JMP ($xxFF) instruction.
It ain't a bug, it's a feature! I know you may raise an eyebrow or two, but bear with me for a minute. First of all, let me briefly explain what the indirect JuMP instruction does: JMP ($033C) means that the first address ($033C) contains the low byte, for example $00, and the second address ($033D) contains the high byte, for example $C0. When the 65xx microprocessor fetches these two bytes and forms an address ($C000), it jumps to that address. So far, so good. What if it was JMP ($03FF)? Then we have a problem. The 65xx microprocessor gets the low byte from the $03FF address, but due to some quirk, gets the high byte from the $00FF address, not the $0400 address as it should do. Now, what to do? Easy... Just put the low byte into the $03FF, and the high byte in the $00FF address, and then the JMP ($03FF) instruction will now work. Why do it, you ask? No real reason other than that a part of it is in zero page, this executes a little bit faster than a 'bug-free' indirect JuMP instruction.
- The Maverick Bit-
We have Nine-Bit computers, not Eight-Bit computers! Technically speaking, yes, this may be true. In normal 65xx architecture, we have 8-bits traveling around at any given time. But not a lot of attention is paid to the Carry bit register that is normally affected by addition, subtraction, rotate and shift opcodes. This bit can be easily manipulated by these opcodes, and we can use it to our advantage. With this technique, we can count up to 512 different possibilities that a single nine-bit field occupies, and this flexibility can be an asset. One good application for this would be compression. For a detailed explanation on how to use the maverick bit, look up the monitor decompress routine in the C128 starting at $b6a1. Another application is for joystick routines. Just read the joystick byte, and LSR or ASL, and read the status of the carry flag to determine what action to take, and repeat, etc. It's that simple. No more ANDing or ORing the joystick byte as you would in BASIC.
- 8-bit Addition to a 16-bit Word.
When I talk about 8-bit addition to a two-byte integer, I mean this routine:
CLC:LDA LOC:ADC #40:STA LOC ; low byte
LDA LOC+1:ADC #$00:STA LOC+1 ; hi byte
By adding a zero with the carry flag, we account for any overflow and update the hi-byte accordingly. This is the majority of the code I've seen in which a single byte was added/subtracted to a two-byte word. This is inefficient. The above routine would be run every time it is called, and consumes 22 clock cycles and 17 bytes. (This is assuming that absolute addressing is used.) But compare with the following routine:
CLC:LDA LOC:ADC #40:STA LOC; low byte
BCC OUT:INC LOC+1:OUT rest of code; hi byte
The routine, 6 out of 7 times, would not run the INC instruction. This results in only 14 clock cycles consumed. Even with the INC instruction, the routine consumes 20 clock cycles. Not only it is faster than the ADC #$00 instruction, it is shorter as well, taking up space at mere 14 bytes.
This is much more efficient use of the addition routine and is more quicker overall. The efficiency decreases if the addition value increases. The same holds true for subtraction, except that you BCS OUT and use SBC instruction, etc. Called once or twice, the first routine would be okay, but if this routine is to be called repeatedly, the second routine will give you a speed boost. This tip is ineffective in two-byte addition/subtraction routines.
If you have any questions or flames, feel free to email me at: firstname.lastname@example.org (slow) or email@example.com (fast)