Manufacturing is Hard
There’s a big difference between building one of something, and making a repeatable process to build 10 of them, or 100. Unfortunately I’m learning that the hard way while I try to get some more Floppy Emu boards ready to sell. If I had any hair, I’d be pulling it out! I never thought this would be so hard.
If you haven’t been following the earlier posts, Floppy Emu is a floppy disk drive emulator for vintage Macintosh computers. I built the first Floppy Emu for my personal use about a year ago, and while the soldering was a little challenging, everything worked once it was done. I posted the design on the BMOW web site, and since then I’d estimate about 10 other people have built their own Floppy Emu boards. Then in October I built two more boards from my remaining parts stock, and sold them on eBay. I tested those thoroughly before I sold them, so I’m confident those boards were working well.
The eBay sale generated lots of interest and requests for more boards, so in late October I created board revision 1.1 in preparation for a small hand-made “production run”. The board layout changed slightly to make room for mounting holes, and some board traces were moved or added. I switched to a different PCB supplier, changed to a different brand of 3.3V LDO regulator, and substituted the Atmega1284 for the Atmega1284P to save a few pennies.
I built four of the rev 1.1 boards, and initially none of them worked. As described in my previous post, the new brand of 3.3V regulator proved to be unstable when combined with the output capacitor I’d been using. The oscillations on the 3.3V and 5V supply lines caused all kinds of crazy behavior and malfunctions that drove me crazy. I’ve since found that replacing the 10 uF ceramic output capacitor with a 33 uF tantalum solves that particular problem. Yet even with the capacitor fix, one of the boards exhibited occasional random write errors, and I somehow toasted another one during assembly.
Later I discovered a flaw in my CPLD firmware that was shorting the Mac’s PWM drive speed control input to GND. Floppy Emu doesn’t actually use that input, but shorting it to ground is not very nice, and may have damaged the CPLD, the Mac, or both. This only affected the rev 1.1 boards. That firmware flaw is now fixed, hopefully without any permanent damage.
I’ve since built two more of the rev 1.1 boards. One worked fine, but the other showed the same pattern of occasional random write errors. Of the six rev 1.1 boards I’ve built, that means I only have three working boards. Arghh! 50% yield is not good. The random write error is maddening. It doesn’t happen very often, so it’s necessary to do a LOT of testing before I can be confident a particular board does or doesn’t have this problem. I spent a long time with a lens, an oscilloscope, and a debugger trying to explain what’s going wrong, but failed. My best theories are:
Software Bug – Perhaps there’s a problem with the Floppy Emu software, like a timing bug or uninitialized variable, and tiny variations in boards or components cause the bug to appear or disappear. This was my first guess, but if true I would expect a continuous distribution of bugginess across boards, rather than two groups of “working” and “not working” boards. I tested the working boards heavily, and they really do work 100%. I also made many experimental software changes that I thought might cause the problem to appear or disappear, but there was no change in behavior. And to my knowledge none of the rev 1.0 boards have this problem, even though they use the same software.
Soldering Mistake – I may have created a bad solder joint somewhere, leading to flaky behavior. That’s possible, but it seems pretty unlikely I’d make the exact same soldering mistake twice in six boards. And I’ve visually inspected the problem boards carefully with a 10x magnifier, and touched up all the likely problem points with an iron, without any success.
CPLD Damage – Some of the CPLDs might have been damaged by the firmware bug that shorted PWM to GND, resulting in buggy behavior even after the firmware was fixed. That’s certainly possible, but then why weren’t all the CPLDs damaged? Why just two of them? If this is the true explanation, then future rev 1.1 boards should all work OK now that the firmware bug is fixed.
Atmega1284 vs Atmega1284P Variation – Maybe some minor difference between the two types of the AVR microcontroller is causing unexpected problems. As far as I know, the only difference is that the “P” version uses Atmel’s Pico-Power system to enable very low power sleep modes. Since I’m not using those sleep modes, that difference shouldn’t matter.
Board Design Flaw – The rev 1.1 board could contain a design mistake not present in the original board, like substantial coupling between neighboring traces, signal reflections, or other noise that leads to intermittent problems. While the layout changes between rev 1.0 and 1.1 were minor, I can’t rule this possibility out.
Manufacturing Flaw – The rev 1.1 boards from Smart Prototyping might not be built to the same tolerances as the original boards from Dorkbot PDX. In terms of published specs like minimum trace width and spacing, the Smart Prototyping process should be fine, and I used their design rules file to verify my board in Eagle. I know other people have been successful with rev 1.0 boards not made by Dorkbot PDX, though I don’t think any have used Smart Prototyping specifically.
Unfortunately I’m at one of those points where I really don’t know where to go next. I could build a few more boards to test the CPLD damage theory. Or get some more Atmega1284P’s and build a few boards with those, or experiment with going back to the original PCB manufacturer or the rev 1.0 board design. But each of those experiments would require more time and money to test the theory. I’d need to see at least five good boards and zero bad ones before I had any confidence that I’d solved the problem. Spread across all the possible problem causes, I could end up building several dozen test boards, and still come up empty-handed if the true cause is a software bug or something else I haven’t considered.
Read 11 comments and join the conversation11 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
Hi, I have never worked with the atmega, so take this for what it is worth:
Do you have the same cpu-clock, or are they drifting away?
Are there any additional config registers on the P-version?
-Sleep mode
-PLL-settings
-etc.
I think it likely that some CPLD’s could fail but not others. If you ever test chips to their max, you will notice they are very different. I tested several 74hc04’s (rated for 6 volts max) and some popped at 7 volts while others could take 12 and everywhere in between.
yeah the problem is being only one guy, you have to do everything.
manufacturing / production get get quite tricky especially when you are the one wearing all the hats 🙂
well thanks for what you you do! its great!
When i finally end up getting mine i will very much enjoy and appreciate it! for sure!
-uniserver
maybe i could help you with the Build/Test/Sales/Ship part?
Maybe your code(both the avr and the cpld) are too strict, and the mac might have a hiccup and all things go wrong.
Are you compiling the code specifically for the atmega1284p, it might cause problems, for example atmega644 and atmega644p have different register address’s for some things, so code for one will not work for the other, even the signature is different.
Ch00ftech made a production run for his QR-Clock in using Myro-pcb services, might take a look to his blog, he has made a very through explanation of what went well, and what went wrong, if you really are thinking about a production run, read it:
http://ch00ftech.com/category/qr-clock/
I’ve built six more and they all tested good, so maybe the damaged CPLD theory was correct. It still seems a little fishy to me, though.
Thanks for the ch00ftech link – that is some great info! His experience with Myro-PCB assembly is sobering, with 50% of the assembled PCBs having some kind of defect.
Most of his problems where caused by the huge size of the pcb, and not knowing how V grove is done in the pcb’s, such a long/large pcb should be thicker, even if the resistors didn’t break in the assembly, they could break in transportation, to much stress in such a little component.
There are more assembly houses, all around the globe.
Just one question, hope you don’t mind it, does your code compile totally clean, by clean, I mean not a single warning.
Yes, no warnings. You can download the code here: http://www.bigmessowires.com/floppy-emu-source-1.0K-F11.zip
Really glad to see this project coming along. I wish I could be of more assistance, but hope I can support the project down the road. Thanks for your great work on this thus far. It’s definitely out of my technical range to assist more, but would be great to bring the older Macs back to life!
Funny. I got exactly the same kind of problems before… I finally decide to make a true test software to be sure that all is fine with the software before releasing anything. (this is also very helpful to test the hardware and sdcard too…).
http://sourceforge.net/p/hxcfloppyemu/code/HEAD/tree/HxCFloppyEmulator/HxCFloppyEmulator_TestTools/HxCFE_StressTest/PCDOS_TestSoftware/
The software test all the floppy accesses: format / read sector / write sector (sequential and random access) with many disks layouts… The disks layout is generated randomly as well the data to write. Each written data (format & sector write) is read back and checked/compared. The software automatically changes the loaded image at the emulator side.
I generally let this software running some hours up to some days 😉
Here is the test solution running (~9 hours of test).
http://hxc2001.com/img/hxcfe/RevC_And_KingstonSDClass4_2.jpg
http://hxc2001.com/img/hxcfe/Results_Screen_9hours.jpg
Having a good test solution is very important, so i recommend you to have something similar for the floppy Emu for Macintosh! Very helpful to test the hardware, the software and the SDCards!
Regarding your problem, this is maybe some uninitialized variables : depending of the manufacturing process of the MCU, ambiante temperature, voltage rise up at power up, the internal memory may present some random states at power up. In my side i have added a small ASM code at the reset vector to clear the whole SRAM memory of the MCU at power up. By this way the starting state is well know 🙂 .
About the CPLD : i don’t think that they are damaged, but it’s maybe a metastability/timing problem : Are you sure that all asynchronous signals coming from the Mac are resynced with two or three FFs into the CPLD ? This is a very common and very random issue 😉
Good idea! I do have a test program on the Mac side, but it’s nothing as elaborate as what you’ve got there.
Yes, the async signals are synchronized by the CPLD, and I’ve checked for uninitialized variables. It is strange, but the problem hasn’t appeared in any of the newer boards I’ve built, so either it was a damaged CPLD, my solder technique is getting better, or there’s some other cause I haven’t found yet.
HxC looks like a very nice product, by the way!