

DeepSeek-V3 Now Runs At 20 Tokens Per Second On Mac Studio 90
An anonymous reader quotes a report from VentureBeat: Chinese AI startup DeepSeek has quietly released a new large language model that's already sending ripples through the artificial intelligence industry -- not just for its capabilities, but for how it's being deployed. The 641-gigabyte model, dubbed DeepSeek-V3-0324, appeared on AI repository Hugging Face today with virtually no announcement (just an empty README file), continuing the company's pattern of low-key but impactful releases. What makes this launch particularly notable is the model's MIT license -- making it freely available for commercial use -- and early reports that it can run directly on consumer-grade hardware, specifically Apple's Mac Studio with M3 Ultra chip.
"The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!" wrote AI researcher Awni Hannun on social media. While the $9,499 Mac Studio might stretch the definition of "consumer hardware," the ability to run such a massive model locally is a major departure from the data center requirements typically associated with state-of-the-art AI. [...] Simon Willison, a developer tools creator, noted in a blog post that a 4-bit quantized version reduces the storage footprint to 352GB, making it feasible to run on high-end consumer hardware like the Mac Studio with M3 Ultra chip. This represents a potentially significant shift in AI deployment. While traditional AI infrastructure typically relies on multiple Nvidia GPUs consuming several kilowatts of power, the Mac Studio draws less than 200 watts during inference. This efficiency gap suggests the AI industry may need to rethink assumptions about infrastructure requirements for top-tier model performance. "The implications of an advanced open-source reasoning model cannot be overstated," reports VentureBeat. "Current reasoning models like OpenAI's o1 and DeepSeek's R1 represent the cutting edge of AI capabilities, demonstrating unprecedented problem-solving abilities in domains from mathematics to coding. Making this technology freely available would democratize access to AI systems currently limited to those with substantial budgets."
"If DeepSeek-R2 follows the trajectory set by R1, it could present a direct challenge to GPT-5, OpenAI's next flagship model rumored for release in coming months. The contrast between OpenAI's closed, heavily-funded approach and DeepSeek's open, resource-efficient strategy represents two competing visions for AI's future."
"The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!" wrote AI researcher Awni Hannun on social media. While the $9,499 Mac Studio might stretch the definition of "consumer hardware," the ability to run such a massive model locally is a major departure from the data center requirements typically associated with state-of-the-art AI. [...] Simon Willison, a developer tools creator, noted in a blog post that a 4-bit quantized version reduces the storage footprint to 352GB, making it feasible to run on high-end consumer hardware like the Mac Studio with M3 Ultra chip. This represents a potentially significant shift in AI deployment. While traditional AI infrastructure typically relies on multiple Nvidia GPUs consuming several kilowatts of power, the Mac Studio draws less than 200 watts during inference. This efficiency gap suggests the AI industry may need to rethink assumptions about infrastructure requirements for top-tier model performance. "The implications of an advanced open-source reasoning model cannot be overstated," reports VentureBeat. "Current reasoning models like OpenAI's o1 and DeepSeek's R1 represent the cutting edge of AI capabilities, demonstrating unprecedented problem-solving abilities in domains from mathematics to coding. Making this technology freely available would democratize access to AI systems currently limited to those with substantial budgets."
"If DeepSeek-R2 follows the trajectory set by R1, it could present a direct challenge to GPT-5, OpenAI's next flagship model rumored for release in coming months. The contrast between OpenAI's closed, heavily-funded approach and DeepSeek's open, resource-efficient strategy represents two competing visions for AI's future."
Re: (Score:2)
That might be true, but wouldn't a code audit reveal such problems?
Re:Beware of Pooh's Bearing gifts (Score:4, Funny)
I'm pretty sure we could just ask an AI if the codebase is safe. I can't think of any downside to that.
Re: (Score:2, Insightful)
By the time audit is done version 4 will be out.
And? Will that have changed the code you already downloaded? Once the people have their hands on the code, it doesn't matter how many revisions the Chinese make to the live stuff.. We'll know if they're trying to pull a fast one by examining what we have now.. and then looking at the next one and the one after that...
Are you being obtuse on purpose? Doubling down just makes you look like a clown (intentional rhyme).
Re:Beware of Pooh's Bearing gifts (Score:5, Informative)
But the code is the deep neural network... the ultimate word in obfuscation.
The only way to "inspect the code" is to inspect all the training data and retrain the LLM yourself.
Re: (Score:2)
Yes, the code is the network.
But the code doesn't run on your computer.
It runs in an interpreter that does exactly 2 things. Computes attention layer, and FMAs the feed-forward layer.
There is one thing, and one thing only this interpreter can do- generate text.
If you were to, say pipe that text into a terminal or something, then yes- one might be concerned about what the "code" will do.
But piping it to your screen? I think that's probably pretty safe.
Re: (Score:3)
Are you offering to audit the 641GB of code?
Re: (Score:2)
No. But then I'm not using the model either, nor do I expect anyone else to do so.
Re: (Score:3)
Re: Beware of Pooh's Bearing gifts (Score:5, Informative)
If you run the model locally how is it going to leak any of that data back to China?
Re: Beware of Pooh's Bearing gifts (Score:4, Informative)
If you run the model locally how is it going to leak any of that data back to China?
The model is essentially data. You run an engine (many available on GitHub) that understands the model's underlying structure, converts your question into tokens, and performs numerous matrix operations to generate a sequence of tokens as the answer.
Essentially, by using a trusted GitHub engine (typically consisting of a few thousand lines of code), the only way the model could leak information to the Chinese would be if the model's data were sophisticated enough to craft responses that subtly persuade you to share sensitive information."
Re:Beware of Pooh's Bearing gifts (Score:4, Insightful)
That physically can't happen when run locally. That said, it does contain the Great Firewall of China, rather crudely shoehorned into the finetune (to the point where it sometimes suddenly switches from first person to using the phrase "We" when speaking from the perspective of the CCP).
That said, a lot of this article summary is nonsensical hype. For example:
If you're running on an NVidia server that takes kilowatts of power, you're going to get WAY better performance than 20 tokens per second. A B200 is 20 petaflops (20000 teraflops) at fp4 precision. M3 Ultra is 115 teraflops at fp16 (AFAIK it doesn't accelerate lower precisions than that faster than fp16) - 25% slower than an outdated Nvidia RTX 3090 gaming card. These things are not the same.
Re: (Score:3)
Eh, you're missing the point. The comparison you quote is only nonsensical if you recontextualize it with a strawman like you've done.
People are excited about the Mac solution because we're approaching usable non-cloud AI at home. Running a B200 is simply not even in the running -- price and power consumption make it a non-starter. The article is not suggesting that performance per watt is better, just that total power consumption is
Re: (Score:2)
Like the comparison between an M3 Ultra and an RTX3090.
First, you need 7 RTX3090s. 1 isn't going to do you any good.
Each RTX3090 is going to be about 17% faster than the M3 Ultra (FP16 compute units isn't the limiting factor- the model is way too big to fit well into cache- memory bandwidth is the bottleneck, period, full stop)
And since the transformer blocks have to be run sequentially, in the end, you're looking at 7 RTX3090s for a +17% performance incr
Re: (Score:2)
Not nearly as impressive.
Re: (Score:2)
Are you having fun straw manning? This was about the article's claim that Deepseek + Mac M3 studio was some sort of super fast, efficient way to run a model. The mention of 3090 was only to have a flops comparison point as to how poor the performance of the M3 studio is. Nobody is saying that you can fit DeepSeek onto a 3090; that's a complete derailing of the topic. The topic is about how M3 studio's performance compares to NVidia servers that use "several kilowatts of power". Aka, not a gaming card l
Re: (Score:2)
Are you having fun straw manning?
Actually, that's what you did.
This was about the article's claim that Deepseek + Mac M3 studio was some sort of super fast, efficient way to run a model.
The super-fast came from your imagination. As for efficiency? I don't think that can be reasonably argued. It is vastly more efficient.
The mention of 3090 was only to have a flops comparison point as to how poor the performance of the M3 studio is.
It's not poor at all, particularly in the context that it can do things you need 7 3090s to do.
Is its performance 14% less than that of a 3090 burning 225% the power for models under 24GB? Yes. It is.
that's a complete derailing of the topic.
I see it as rather an mirror of exactly what you're trying to do.
The topic is about how M3 studio's performance compares to NVidia servers that use "several kilowatts of power". Aka, not a gaming card like the 3090.
Indeed. One wonders why you even brought it up. Some kind of bizarrely aimed dig
Re: (Score:2)
Huh, I must have imagined that the summary said "The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!" as if this some sort of extreme performance figure. They even included an exclamation point for good measure. Or did I imagine that too?
Utter nonsense. It has an literal order of magnitude worse fp4 tflops per watt.
Re: (Score:2)
Huh, I must have imagined that the summary said "The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!" as if this some sort of extreme performance figure. They even included an exclamation point for good measure. Or did I imagine that too?
First, summary != article.
Now you're conflating the summary with the article in your own critique. Remember, you said:
"This was about the article's claim that Deepseek + Mac M3 studio was some sort of super fast, efficient way to run a model."
Utter nonsense. It has an literal order of magnitude worse fp4 tflops per watt.
You're doing it again.
Confusing compute with memory bandwidth.
The topic here is inference. It is bounded by memory bandwidth, not compute performance.
YOU are the only person here suggesting the absurd notion of using 7 3090s. The 3090 was only brought up to give a grounding of the level of compute power. NOT as a VRAM comparison. NOT as a "suggested alternative implementation". The fact that this has been pointed out to you multiple times and yet you persist has far beyond moved into straw man territory. You've decided what scenario you actually want to argue about - a scenario that was never suggested - and persist in trying to argue about it rather than have to defend the simply false case that the M3 is higher performance than modern NVidia servers.
You brought up an irrelevant data point, and I pointed out the stupidity of it.
It's ok, we all do it sometimes.
L
Re: (Score:2)
Hey, let's play a little game called "scroll up in the thread": "That said, a lot of this article summary is nonsensical hype"
Literally my very first post in the thread.
That said, everything in the summary is from the article, including that quote, so it doesn't matter which one is referred to.
I'm not "confusing" anything. As was laid out in detail above, compute is maxed in actual real-world usage. Which is the reaso
Re: (Score:2)
Look, if you want an olive branch here: If you're looking for a local machine for inference of large models for under $10k instead of tens to hundreds of thousands of dollars... yeah, the M3 ultra IS a good option. I do not object to this - at all.
What I object to is the nonsensical claim that it is "fast" or "efficient" compared to modern NVidia servers. It is not. At all. Unless you're making lazy, contrived scenarios, that is.
Re: (Score:2)
That said, everything in the summary is from the article, including that quote, so it doesn't matter which one is referred to.
I wasn't trying to imply that it didn't come from the article, I was merely pointing out that you were doing a lot of heavy lifting on intent from the summary.
compute is maxed in actual real-world usage
Compute is never maxed in inference, unless your machine has slower compute units than memory (which, simply put, it doesn't.)
If your GPU is maxed during inference, then it's not a very good GPU. Applies to your CPU as well.
Simplified- if the maximum performance of your computer can be achieved while scanning over really fucking slow DRAM, then your
Re: (Score:2)
Power usage at 100% utilization:
Me: 85W
3090: 350W
W/t/s:
Me: 1.04W
3090: 2.42W
The M3 Max- 130% more efficient in W/t/s. I.e., over double.
Re: (Score:2)
That said, a lot of this article summary is nonsensical hype. For example:
It's not hype. It's just... woefully deficient at explaining the fundamental problem.
Trying to put even a Q4 571B parameter model into VRAM is a bit of a trick.
You have a couple of choices, really- several kilowatts of power in GPUs, or a Mac Studio.
Now- it should be clearly explained that you get a lot more for that several kW of power than you get for a Mac Studio, but still the distinction is relevant.
M3 Ultra is 115 teraflops at fp16 (AFAIK it doesn't accelerate lower precisions than that faster than fp16)
I didn't try FP8, but my M4 Max on Gemma 3 27B INT8 gets 15.67 t/s, and at FP16 gets 8.62, same query
Re: (Score:2)
Re: (Score:2)
Run it off a very fast SSD array?
Remember- LLM performance is bounded by memory bandwidth, period.
If you can come up with 800GB/s worth of disk bandwidth- then yes, you can run it at a speed comparable to the M3 Ultra Mac Studio.
My SSDs in my MacBook Pro get about 6GB/s, so I'd need 134 of them striped to compete.
The problem then, is what bus are connecting that many disks to that doesn't have an even lower bandwidth limit?
So no, we're stuck with "VRAM matters."
Re: (Score:2)
Re: (Score:2)
Nobody is proposing to run Deepseek on 3090s. That was simply a point of comparison for how few flops M3 studio has. Literally orders of magnitude less than actual modern Nvidia servers that use "several kilowatts of power".
Re: (Score:2)
The fact that people are translating "it has compute power less than an old gaming card" as "we should run it on old gaming cards!" is frankly blowing my mind when both I, and the erroneous Slashdot summary, were talking about servers that use kilowatts of power.
Re: (Score:2)
Re: (Score:2)
I fully get you want to entirely ignore the compute capability of the M3 and avoid having to discuss it at all costs because it's embarassingly slow by the standards of AI tasks, and yes, this VERY much matters in the real world. Because if I were trying to argue your side, I'd likewise be trying to avoid having to deal with discussing how few FP4 FLOPS the M3 has.
Re: (Score:2)
I fully get you want to entirely ignore the compute capability of the M3 and avoid having to discuss it at all costs because it's embarassingly slow by the standards of AI tasks
What are you talking about?
Inference is memory-bandwidth-bound, period.
This is basic CPU architecture.
If you're scanning over a memory structure that's larger than cache, you're limited by the memory bandwidth. Period.
5 execution units, 30- doesn't matter- you'll only use as many as you can fill.
You're taking this personally... Are you upset that people with Macs have better inference capabilities than you, or something? VRAM envy?
Re: (Score:1)
TFA says it seems as powerful as commercial models, yet is "open source".
But it's probably Xi's way to inject and/or vacuum our content and prompts, so it's kind of equivalent to a commercial product.
So have a look at the code... That's kinda the whole point of "Open Source". Personally, I trust the Chinese about as far as I can throw them. But you have access to the code.... So why the need to speculate or guess? That's like trying to guess the circumference of the Earth... There's no need to.. You have access to the information.
Re: (Score:2)
Can you point out on the doll where the big huge fucking list of vectors touched you?
Re: (Score:2)
The article is about local inference.
Can you point out on the doll where the big huge fucking list of vectors touched you?
Truly LOL!
Re: (Score:2)
We must worry about the unknown unknowns here.
Re: (Score:2)
Re: (Score:2, Funny)
All the damned time. I mean, what if it is talking to some Chinese satellite over those solar batteries?
Re: (Score:2)
Re: (Score:2)
If worried, run it under a rootless Docker container, or if really concerned, run it under Parallels under a Linux VM.
If there is something about code on a public GitHub archive which can jump out of a container or a VM, that will be something that likely will be detected by others quickly.
Re:Beware of Pooh's Bearing gifts (Score:4, Insightful)
Re: (Score:2)
seems like a pcap would find that pretty quick.
Re: (Score:2)
Re: (Score:2)
Not vacuum anything, since it can run offline, but it is a way to have CCP propaganda distributed to the rest of the world. The model has censorship already built in. Ask it about events that happened at Tiananmen Square.
Consumer hardware (Score:3)
The Mac 512K had an introductory price of $3,195 (equivalent to $9,670 in 2024). I think the collapse of home computer prices in the 1990's and 2000's has altered what we think a reasonable price for "consumer hardware" is.
Re: (Score:2)
The Mac 512K was hardly "consumer hardware", unless you lived in a very wealthy area.
I only saw it in businesses and universities. Consumers had a Commodore 64 if they were lucky.
Re: Mac 512K.... (Score:2)
Were you making a joke reference?
They specifically referred to a Mac Studio with M3 Ultra chip.
Re: (Score:2)
I still think it is a miracle that for a few C-notes, one can pick up a generic mini PC for gaming. I think about how something like the Apple //fx was, with a price tag over $10,000... in 1990s dollars, and that was completely obsoleted by cheaper Quadras a year or two later.
It was a decade where on average, $2500 was the "sweet spot" for a computer, and that was pretty much a basic home machine. However, the one thing those machines had which current ones didn't was often a tape drive, so one could do t
Re: (Score:2)
My $200 smartphones can emulate DOS faster than my 486dx2-66. The power of Moore's law. (and Taiwan chipset manufacturing)
The 1990's was terrible time to invest in a high-end workstation, and a great time to chase the PC upgrade train. Between the falling prices the speed bumps you could get an affordable computer that was faster than last year's but also obsolete in 18-24 months.
I remember the Beyond 2000 TV show promising holographic storage in crystals. We should be able to store an exabyte in write-once
Re: (Score:1)
You can order all that online.
The simplest is probably a blue ray writer.
Regarding USB/Thunderbolt tape drives, it is probly only a matter of money, just google them.
Re: (Score:2)
That's the one I have on my desk. Can't you see it?
Re: (Score:2)
In US customary units, it is 13 swimming pools per Rhode Island, or eight jelly donuts per bald eagle.
Re: (Score:2)
So many people call them Imperial units, when they are in fact US Customary.
Our gallons are 3.785 liters, not 4.546 ffs.
66 BAUD (Score:2)
Is 66 English characters a second fast?
How would you feel about a 66 BAUD modem?
Re:66 BAUD (Score:5, Interesting)
"baud" , named after Émile Baudot, is bits per second, not bytes. ... what?
Any yes, 20 tokens per second is good for native inference. I don't know where you get 20 tokens = 66 char from, but it sounds reasonable, and a lot faster than people read.
This is a machine smarter than most people at a huge number of tasks, for the price of a used car, and you are complaining about
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
FSK, 1 bit per baud
PSK, 2 bit per baud
then they got into encodings with 4 bits/baud, and then they started using compression
Re: (Score:3, Insightful)
Nope.
It is steps per second, has nothing to do with bits or bytes per se.
For example if you can transfer 4 bits with one signal change, and can do and recognize 1000 signal changes aka steps per second, then you transfer 1000 baud, but 4000 bits per second.
That is why high end modems on 4kHz phone lines had 4000 baud but transfered roughly 20kBits.
Re: (Score:2)
"baud" , named after Émile Baudot, is bits per second, not bytes.
The baud rate is a measure of 'symbols' per second, where a symbol might be more than one bit, in which case the baud rate is lower than the bit rate. It's a form of compression, where a single state change on a wire can supply more bits, something regularly used by dial-up modems in the good old days.
Re: (Score:2)
Whether I got the capitalization wrong or not, whatever. Baud is one change in a carrier signal per second. Kinda similar to a change in tokens in a token stream. It was never about bits or bits per second, or about equating characters to bits.
According to this API doc, 1 English character 0.3 token:
https://api-docs.deepseek.com/... [deepseek.com]
20 Tokens per second, 0.3 tokens per English character, 66.666 English characters per second.
20 tokens/second (Score:3)
I'm not up on my AI jargon. How should I feel about 20 tokens/second? What does that mean to me as some kind of user?
Re: 20 tokens/second (Score:2)
This is the one thing that needed to be included to be valuable.
That said, I've been running the distilled model, 14b and it runs favourably on my M1 Max Pro with 32GB.
Re:20 tokens/second (Score:4, Informative)
20 tokens/sec is faster than you can read, so it is very usable.
And note that this is a not a reasoning model, so you won't be waiting ages for it to start the response proper.
Re: (Score:1)
o1 is $15 per M tokens. I'd stick with o1.
Re: (Score:2)
At 20 tokes / second, you do 630M tokens / year, which at 630*15 has the value of $9450 which just about the value of the desktop computer you need to run this.
And while it is true that o1 is better than deepseek, it is also true that $15 is a heavily subsidised price. I'm sure it costs OpenAI more than $10k to run o1 for a full year, not to mention electricity cost.
The point being that AI can be commodified in the sense of enabling small outfits buying a bunch of servers and starting to compete with big gu
Re: (Score:1)
Pricing is definitely not subsidized since they offer deep batch discounts but yeah it's not clear yet how they will make money
An Anecdote (Score:1)
Well I'm not quite sure either but let me tell you what I experienced on an older Intel iMac Pro - (2017).
I loaded up the largest model possible just to see what it would do... I entered some initial question, I forget what, and then got about 10-20 minutes of a "thinking" message.
Then, I got... an "H".
A few minutes later... an "I".
Yeah it too about 30 minutes to begin a message with "Hi", I gave up after a few hours.
So 20 tokens a second is sounding pretty good compared with that!
Re: (Score:2)
It's a bit faster than you can read.
On the other hand, R1 (built upon V3 but as the thinking alternative and not the successor) creates a wall of text of reasoning before the answer part (even though the reasoning is often helpful to read), which introduces some waiting time if you have "only" 20 T/s. Still quite good for running such a large model on hardware you can afford.
Re: (Score:2)
A token is a portion of a word. It doesn't equate to individual letters, nor entire words, but something in between. The average is around .75 tokens per word. So four "average" words take 5 tokens.
So 20 tokens per second is perfectly fine for a single person interactively chatting with the LLM. If you're doing any sort of larger data processing (feeding in large documents, outputting large documents, or multiple users) it's pretty slow.
Re: (Score:1)
This new V3 replaces an older version of V3, not V2. Second, it is faster because for DeepSeek-V3-0324 they released the 4-bit quant which allows it to run faster.
Tokens Per Second (Score:3)
Re: Thanks for the TPS Report (Score:2)
No kidding, how much is that in furlongs per fortnight?
Re: Thanks for the TPS Report (Score:2)
It's not Slashdot's fault you failed to keep up.
Re: (Score:1)
Seriously. If you don't know what a token is in 2025, why are you even commenting in a LLM thread?
Re: 42 (Score:2)
Darn, I forgot my towel.
What's in a name? (Score:2)
OpenAI's closed, heavily-funded approach.
What's in a name?
Re: (Score:2)
Re: (Score:2)
Goodwill :)