Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Portables (Apple) Desktops (Apple)

Software Engineer Runs Generative AI On 20-Year-Old PowerBook G4 (macrumors.com) 48

A software engineer successfully ran Meta's Llama 2 generative AI model on a 20-year-old PowerBook G4, demonstrating how well-optimized code can push the limits of legacy hardware. MacRumors' Joe Rossignol reports: While hardware requirements for large language models (LLMs) are typically high, this particular PowerBook G4 model from 2005 is equipped with a mere 1.5GHz PowerPC G4 processor and 1GB of RAM. Despite this 20-year-old hardware, my brother was able to achieve inference with Meta's LLM model Llama 2 on the laptop. The experiment involved porting the open-source llama2.c project, and then accelerating performance with a PowerPC vector extension called AltiVec. His full blog post offers more technical details about the project.

Software Engineer Runs Generative AI On 20-Year-Old PowerBook G4

Comments Filter:
  • Have to see what I can run on a G3 with 512MB of ram.
    • by ceoyoyo ( 59147 ) on Monday March 24, 2025 @08:41PM (#65256955)

      It's just matrix multiplies. If you've got a big enough hard drive and enough patience you should be able to run any model you want.

      • It's just matrix multiplies. If you've got a big enough hard drive and enough patience you should be able to run any model you want.

        Patience is the key. This project just shows that a really tiny model that sort of fits in memory can run as long as you don't care what the performance is. But we already knew that.

        • by ceoyoyo ( 59147 )

          I'm not sure most people do know that.

          To be fair, the performance might not be as bad as one might expect. You can run a distilled version of Deepseek R1 on a Raspberry Pi and it manages to produce output faster than most people I know type, so it's potentially useful. That does fit in RAM, so you'd be taking another big hit in speed if you had to go to an SSD. Probably best not to use a spinning disk. And AltiVec is pretty decent at matrix multiplies, for a CPU.

          • RAM is a minor part of the interference performance equation, no matter how optimal the engine implementation, memory I/O bandwidth is the primary limiting factor for inference performance. Nvidia SoCs don't have the fastest CPUs, but their I/O bandwidth is off the charts with 3.35TB/s for H100, 8TB/s for the B100, 16TB/s for the GB200. For comparison, the fastest Mac M-series SoC is the M3 Ultra with 800GB/s. Inference is about walking memory, I'm sure you can run a lobotomized variant of R1 on a Pi, but y

            • RAM is a minor part of the interference performance equation

              To the contrary, it's the primary part of the inference performance equation.
              If your model doesn't fit in RAM, then your effective bandwidth is now whatever your I/O bandwidth is- and you're not a happy camper.
              After that, then yes, memory bandwidth is the primary limiting factor.

              Hint, they won't because the hardware memory bandwidth makes it impossible, even with a hand tuned assembly inference engine, it's just not possible to walk memory fast enough.

              Hint, no memory bandwidth constraint makes inference impossible, just unpalatable.
              I can run 1.2TB models on my M3 Max MBP, as long as I don't mind the 0.18t/s they operate at.

            • by ceoyoyo ( 59147 )

              Ah yes, memory IO is super important when you're running off an SSD.

              If that's not the bit you're referring to I'm not sure what point exactly you're trying to make.

          • You can run a distilled version of Deepseek R1 on a Raspberry Pi and it manages to produce output faster than most people I know type, so it's potentially useful.

            There's quite a bit of caveats, there lol.
            I would imagine it's Qwen 1.5B, and I'd imagine it's quantized down to 4bpp.

            Given the questionable use of such a thing, I'm not sure it can produce tokens quicker than most people you know can type random gibberish.

            And AltiVec is pretty decent at matrix multiplies, for a CPU.

            For its time- definitely.
            By contemporary measurements- nope. Not even close.
            He gets 0.77t/s on tinystories-110M.
            My current MacBook gets 22.6 t/s on that same test (CPU/SIMD), while the CPUs are basically idle (the workload can't even fill the CPU)
            B

            • by ceoyoyo ( 59147 )

              R1 16B I believe.

              It is funny how "useful" has suddenly turned into "can it run the latest and greatest thing that was released yesterday? No? Lol."

              • R1 is 671B parameter model.

                There are several distillations into smaller models that are commonly misnamed as R1.
                Qwen-1.5B
                Qwen-7B
                Llama-8B
                Qwen-14B
                Qwen-32B
                Llama-70B

                Basically, they've trained these smaller models to use chain-of-thought using R1.
                A Pi gets 6.12t/s for Qwen-1.5B (R1 distilled) at 4bpp. That's not a great speed, but it's a reasonable speed.
                Of course, performance drops with number of parameters, so you can see why I precluded the likeliness of it being a 7, 8, or 14.

                Qwen-1.5B is alrea
        • by allo ( 1728082 )

          Llama 2 does not count as tiny. The smallest Llama2 model is 7 billion parameter, that's the size that easily fits cheaper graphics cards but isn't fun on CPU anymore (with usual inference software). I guess the experiment still took a lot of patience, though.

    • by dgatwood ( 11270 )

      Have to see what I can run on a G3 with 512MB of ram.

      The G3 chip had no AltiVec, so it would be a total dog by comparison.

    • Your iBook G3 is Turing complete, you should be able to emulate a PowerBook G4 on it.

  • I miss my G4 (Score:3, Informative)

    by registrations_suck ( 1075251 ) on Monday March 24, 2025 @08:41PM (#65256953)

    My 12" PowerBook G4 might be my favorite machine of all time. I loved that form factor.

  • This is why so many of us were upset when Apple switched to Intel. It took several years for Intel's floating-point performance to get back to where the Mac was before. Mind you, the integer performance stomped the G5 into the ground, so the UI seemed faster, but for audio multitrack work, ugh.

    • Whaaaaaa-
      The Core Duo should have eaten any G4's lunch.

      I did a quick google, and only found 1 set of benchmarks by Anandtech, and it basically confirmed it. There wasn't a single benchmark the G4 won in, including various SIMD-heavy tasks like video encoding.
      There seems to be a certain mythology around the G4.
      • Other than AltiVec being nicer to hand code and a bit more flexible than earlier SSE3/SSE3 but by SSE4 the capabilities of AltiVec are a generation behind, and the memory bandwidth of AltiVec enabled PowerPC mean that x86 pretty much always won (somewhat the case for POWER too).

        AltiVec was exciting if you were doing work for PowerPC because the performance difference what huge compared to normal CPU/FPU operations on the same machine. I think this is how AltiVec earned its reputation, even though PowerPC ch

        • My recollection from the time, was that it was pretty easily demonstrable that a Core Solo would walk a G4 like a dog, even in SIMD loads, in the notebook space.
          I do recall that AltiVec did much better per clock, but ultimately the Core Duo clock advantage was insurmountable.
          I remember the G4 was rad when it came out. Intel had shit in 1999 that could compete.
          But Apple stretched that thing until like 2006 IIRC, and at that time Intel was fielding dual cores with higher clocks.

          G5 I think is the same sto
          • by dgatwood ( 11270 )

            My recollection from the time, was that it was pretty easily demonstrable that a Core Solo would walk a G4 like a dog, even in SIMD loads, in the notebook space.

            Yeah, mostly because the G4 sucked in the notebook space. If you compared with any pre-Core Intel, back before Intel started taking their thermals seriously, back when everything had to be throttled to within an inch of its life to keep the keyboard from catching fire, the G4 was pretty good by comparison, but that's not saying much.

            The desktop space was a rather different story.

            I do recall that AltiVec did much better per clock, but ultimately the Core Duo clock advantage was insurmountable.

            Bear in mind that except for laptops, where Apple couldn't get IBM to build a G5 that was cool enough, Apple left the G4 behind b

          • My recollection from the time, was that it was pretty easily demonstrable that a Core Solo would walk a G4 like a dog, even in SIMD loads, in the notebook space.

            Maybe not the Core Solo (Yonah) but Core 2 Duo (Arrandale) was better at video codecs than AltiVec, had a lot of very useful integer operations for that.
            Both the Core 1 and 2 beat the G4 on memory bandwidth, which is a major factor in getting good benchmarks in video processing.

            For compute and DNNs, the extra computational density of AltiVec is probably an advantage over SSSE3. Assuming you could do lots of computations while the vectors are held in registers and aren't dominated by memory bandwidth. Especi

            • Maybe not the Core Solo (Yonah) but Core 2 Duo

              Ya- the Solo too- but the reason for this was made pretty clear by the other person I was talking to.
              The G4 in the PowerBook was utterly fucking neutered to fit in thermal constraints, so comparing it to the CPU in the MacBook Pro requires some asterisks.
              The G4 in the PowerMac was far more formidable, though as we continued to discuss, the G4 and the G5 were still handily whooped by anything but the lowest Xeon in the Mac Pro.

      • by dgatwood ( 11270 )

        The Core Duo should have eaten any G4's lunch.

        First, notice I said G5. The Quad G5 was considerably faster on a lot of benchmarks than the second-generation (Core 2 Duo) MacBook Pro (Geekbench 2 [geekbench.com]). On some multi-core floating-point benchmarks, it was more than 10 times as fast.

        And the first-generation Mac Pro also lost [geekbench.com] rather badly.

        I think maybe the cores were slower, but the G5's memory bandwidth was so much faster that it ate Intel for breakfast for years afterwards.

        But even with the G4, you're not entirely correct [geekbench.com]. Yes, the Core 2 Duo was faster o

        • Fair- For the G5, we should be comparing the Mac Pro, of which the first generation still ate the G5's lunch, especially in SIMD instructions.
          Unless of course we pull some bullshit like you just tried to do, and compare the top end G5 with the lowest end Mac Pro with the 2Ghz 5130.
          Let's try that again, using the base model Xeon [geekbench.com]
          Or, if we really want to blush, The top end MacPro1,1 [geekbench.com]
          Dirty pool, dude.

          But you're right- we should be comparing the Core Duo with the G4, not the G5.
          • I'll actually dial back the dirty pool assertion, and some of that attitude. You did mention in a different comment to me that you compared the top to the bottom, but only because you couldn't find better.

            Again, I'll say, on a per core, and even a per-clock basis, the G5 was superior. There's no question.
            But the Intel offering had more cores, and higher clocks. Added up to a pretty devastating performance increase.
            • by dgatwood ( 11270 )

              But the Intel offering had more cores, and higher clocks. Added up to a pretty devastating performance increase.

              To be fair, that's only because Apple didn't want giant machines with a huge number of cores. The G5 was a Power4-based CPU, and the original Power4 design could handle up to 64 cores (32 chips with 2 CPUs), so the G5 could have been scaled up, too, and probably without a lot of difficulty.

              I'm pretty sure there's no technical reason why a G5-based design with more cores/CPUs couldn't have trounced what Intel was selling (other than perhaps the need for an office refrigerator to cool the thing, and perhaps

              • To be fair, that's only because Apple didn't want giant machines with a huge number of cores. The G5 was a Power4-based CPU, and the original Power4 design could handle up to 64 cores (32 chips with 2 CPUs), so the G5 could have been scaled up, too, and probably without a lot of difficulty.

                OK, that's not realistic, lol.
                You're talking about making Blue Gene racks, now.

                I'm pretty sure there's no technical reason why a G5-based design with more cores/CPUs couldn't have trounced what Intel was selling (other than perhaps the need for an office refrigerator to cool the thing, and perhaps the amount of effort on Apple's side required to make a northbridge that handled multiple elastic busses).

                I agree, Intel was not making supercomputers.
                However, it's not like the G5 scaled inherently better than the Xeons of the time. And no matter what, you weren't getting more the 2 cores per CPU until like POWER8.
                I'm sorry man, PPC lost the fight, on the merits. This is the same time period where Xeons started absolutely murdering POWER in the Top500 Supercomputers List.

                And of course, by the time the G5 came out, IBM's Power5 architecture (same time frame as the G5) could scale up to 128 cores (64 CPUs with 2 cores per CPU). I'm reasonably certain Apple could have used Power5 for desktops, but they were thoroughly screwed on the laptop front, because IBM couldn't deliver, and Freescale didn't want to bother, so they punted.

                Duuuuuuuuuuuuuuuude.
                64 CPUs? lol.

                Apple jilted PPC becaus

                • by dgatwood ( 11270 )

                  I'm pretty sure there's no technical reason why a G5-based design with more cores/CPUs couldn't have trounced what Intel was selling (other than perhaps the need for an office refrigerator to cool the thing, and perhaps the amount of effort on Apple's side required to make a northbridge that handled multiple elastic busses).

                  I agree, Intel was not making supercomputers.

                  However, it's not like the G5 scaled inherently better than the Xeons of the time. And no matter what, you weren't getting more the 2 cores per CPU until like POWER8.

                  True, but to be fair, IIRC, cores versus chips was also somewhat more critical on Intel because it didn't have elastic bus with multiple fetches in flight, so it makes sense that they moved to cores with shared caching closer to the chip sooner than IBM did.

                  Apple jilted PPC because they couldn't make one go 3GHz, and they didn't like seeing benchmarks of Xeons spanking them on the fanny in single-core performance. PPC just wasn't keeping up. And it wasn't going to ever again.

                  Fair point. The speed bump on general UI behavior after switching was huge, largely because so much code is single-threaded. I'm not saying it wasn't a good change in terms of overall usability, just that for certain workloads (audio in particular), it

              • I will concede this, though:
                If Apple had picked up PA Semi sooner and aimed them at POWER architecture parts, they would have stayed with them.
                POWER couldn't go much past 1.5-2Ghz, so it was always a winning battle. Apple wasn't big enough to justify engaging in one-off MHz wars. With PA Semi, they wouldn't have needed to be.

                So the world could have been a different place- Apple Silicon could have ended up being POWER instead of arm.
                But, it didn't. And POWER is basically relegated to the dust bins of hi
  • Of course it did. (Score:4, Interesting)

    by eclectro ( 227083 ) on Monday March 24, 2025 @09:06PM (#65257017)

    The PowerPC architectute came from Big Blue from an era where it was still defining technology and wanted to leap frog anything Intel had. I'd say it was a success except gaming and legacy barnacles would still be the primary driver for the majority x86 desktop use. It was very forward thinking of Jobs to adopt the PowerPC processor. It could have ruled the roost if the x86 hedgemony could have been broken at the time.

    What this does demonstrate is how important floating point performance is these days (something IBM didn't spare expense on with the PowerPC). If your micro doesn't have it now (I'm looking at you RISC V) it's nothing more than a microcontroller!

    • The PowerPC transition happened before Jobs came back. Original Power Macintosh 6100 / 7100 / 8100 was released in March of 1994. Jobs came back with the acquisition of NeXT in 1997.

      And the only reason they ditched it was because IBM wasn't concerned with making a power efficient version that was acceptable for use in a notebook. You'll notice there was never a PowerBook G5. That would have doomed Apple in a world that was increasingly wanting portable computing, and less interested in desktop workstati

    • It could have ruled the roost if the x86 hedgemony could have been broken at the time.

      Intel didn't win on architecture quality, it won on manufacturing. And they won on manufacturing because they put a LOT more money into it than everyone else.

    • It was very forward thinking of Jobs to adopt the PowerPC processor.

      It wasn't, because it was a limiting factor. When they got a big piece of the market, they ran into the problem of not being able to get enough chips, and they had to switch to something else. It would have been forward thinking of him to go to Intel sooner.

      What this does demonstrate is how important floating point performance is

      What this demonstrates is how important vector math is these days. That's what altivec is, a vector unit. It's in the name!

  • If one has the patience, pretty much anything flipping matrix math can (eventually) handle a basic LLM. I do applaud the person who did this, but not sure the point, other than the same reason why someone sticks Doom on a pregnancy test kit.

    • by allo ( 1728082 )

      Maybe showing that the whole "AI will need nuclear power plants!" thing is a bit overblown. Things are getting optimized and of course everybody wants to cut costs and reuse old (even though most the time not that old) hardware. Programming python is quick and good for much, optimized C code can still beat it. First comes the design and the prototype, then the optimization and the fast implementation. DeepSeek released quite a few techniques that allowed them to use older hardware (as they are embargoed on

      • It is a difference if you train an AI, with a few thousand video cards.
        Or if you have a ready made model on your laptop - which does not even need the GPU.

        • by allo ( 1728082 )

          Or maybe the training also got more efficient. You can train networks that required datacenters a few years ago today on your home PC. A good GPU is a good idea for that, though, but still speaking of Nvidia desktop cards and no data center cards.

      • Running the model (inferencing? not sure of the terminology) is not the "we'll need nukes" thing, even if it is technically much more power hungry than a non-LLM solution (if such a thing exists.) Generating the models in the first place, that's the power hog where you task an entire datacentre full of power guzzling GPUs for months to make a few hundred gig or more of weights that then "only" peg your consumer CPU/GPU when quantised enough to fit in your RAM for a few seconds.
  • That's the AI you don't want or need - and it seems that you don't need to upgrade the hardware either

  • I did something similar to this running tinnyllama on my 20+ year old Linux workstation. The system configuration consists of a Supermicro X6DAL-TB2 motherboard with two EM64 3.6GHz Xeons, 12GB ECC DDR2 and a GeForce GX 475. Without the NVIDA driver installed I was getting 1.5 tokens / sec. Once the driver was installed, I was getting 15+ tokens / sec. It was just a proof of concept and more of a "lets see what this old girl can do" experiment. Performance was even better with a Quadro P2000.

A committee is a group that keeps the minutes and loses hours. -- Milton Berle

Working...