Shared Chat
TinyGrad vs PyTorch

TinyGrad and PyTorch are both frameworks used for deep learning and machine learning tasks.

TinyGrad is a lightweight deep learning framework developed by George Hotz, the founder of the tiny corp . It is designed to be minimalistic and expressive, allowing developers to focus on writing efficient code by eliminating boilerplate and unnecessary overhead . TinyGrad also has good support for the ONNX format and aims to surpass the ONNX Runtime in compliance tests .

PyTorch, on the other hand, is a widely-used and highly popular deep learning framework developed by Facebook . It offers a rich set of features, extensive community support, and a higher-level API compared to TinyGrad . PyTorch has a larger user base and more extensive documentation compared to TinyGrad .

In terms of performance, PyTorch is currently faster than TinyGrad on NVIDIA GPUs, with TinyGrad being about 5 times slower than PyTorch . However, Hotz mentioned that there are reasons for this performance difference and that efforts are being made to make TinyGrad faster . TinyGrad aims to optimize computations and reduce unnecessary memory access, which can provide performance benefits in some scenarios compared to PyTorch .

Ultimately, the choice between TinyGrad and PyTorch depends on the specific requirements and preferences of the developer. PyTorch offers a mature and feature-rich ecosystem with a large community, while TinyGrad aims to provide a minimalistic and efficient framework with better support for the ONNX format .

1
(someone): Just the NVIDIA open source driver and this open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable. Rewriting the kernel driver, I don't even have docs. I don't have any docs for the GPU. Like it would just be a massive reverse engineering project. I wasn't complaining about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining about the thing crashing my entire computer. It panics my kernel. And I have to wait five minutes while it reboots because it's a server motherboard and they take five minutes to reboot. So I was like, look, if you guys do not care enough to get me a decent kernel driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs. Intel GPUs have a stable kernel driver and they have all their hardware documented. You can go and you can find all the register docs on Intel GPUs. So I'm like, why don't I just use these now? There's a downside to them. Their GPU is $350. And you're like, what a deal. It's $350. You know, you get about $350 worth of performance. And if you're paying about 400 for the PCIe slot to put it in, right? Like between the power and all the other stuff, you're like, okay, nevermind.
2
(someone): For a long time, TinyGrad had a hard limit at a thousand lines of code. And what this would force you to do is really make sure you were not wasting lines. I got rid of the restriction because it became a little code golfy at the end, but once the core framework of TinyGrad was there in those thousand lines, the core framework, the ideas are expressed with no boilerplate. If you go read PyTorch, PyTorch I think is actually pretty good code. I think Facebook's pretty good, but there's so much boilerplate. Go in PyTorch and try to track down how an LGU actually works.
swyx: Just a lot of distractions.
(someone): Oh, you're going to be diving down a long stack from Python to C to custom libraries to dispatchers to, and then I don't even know how to read TensorFlow. I don't even know where's an LU in TensorFlow. Nobody knows. Someone at Google knows, maybe. Google as an organism knows. I don't know if anyone individual at Google knows.
Alessio Fanelli: What are like the important ergonomics, like for a developer, as you think about designing the TinyGrid API?
(someone): So the TinyGrad front end looks very similar to PyTorch. There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does.
3
(someone): That's not what's real. Graph.one will show you the actual kernels that were dispatched to the GPU. You can also type debug="2", which will print those kernels out in your command line, and it will tell you the exact number of flops and the exact number of memory accesses in each kernel. So you can immediately see, wait a second, okay, this kernel used this many flops, this was the gigaflops. This is how many bytes it read, and this was the gigabyte per second. And then you can profile without having to like, okay, I mean, in theory in PyTorch, sure, use the NVIDIA Insight Profiler. No one does that. No one does, of course, because it's so difficult, right? Actually, NVIDIA used to, I think CUDA 9 was the last one they had. They had a command line one, but now it's like, okay, I'm going to generate this blob, use this NVIDIA GUI tool to convert it into a Chrome trace and then load it. Yeah, no one does that, right? Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically.
swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that.
4
(someone): i checked the matrix by hand it matches tiny grad i don't understand and i switch back to cpu and. I'm like oh. What does like bugs like if you like transposed the matrix because i think it's like has to like multi views and pi torch and like weird under the hood stuff that's not exposed to you like that there's bugs and maybe they fix them but like you know it seems like there was a lot of momentum again because you're getting how many engineers care about making pie charts work on m1. Thousands, tens of thousands. And you have an open development process. And guess what? It's going to be good. How many engineers care about AMD working? You got 10 guys that work for AMD. And then like a couple hobbyists.
swyx: You revealed an interesting detail about how you debug. You hand check the matrix math.
(someone): No, I don't hand check it. One of the best tests in TinyGrad is a file called testops.py. And it's just a hundred small examples written in TinyGrad and PyTorch. And it checks both the forwards and backwards to make sure they match.
swyx: the test suite. Yeah, very important.
(someone): That's I mean, that's one of them where you like, I really I put a lot of effort into CI for TinyGrad. I think CI is super important.
5
(someone): There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does. And we're going to have, I think we're going to pass ONNX Runtime soon too. Like people think ONNX Runtime, that's a gold standard for ONNX. No, you can do better. Pass them in what specifically? Test compliance tests. So ONNX has a big set of compliance tests that you can check out. and we have them running in TinyGrad, and there's some failures. We're below ONNX Runtime, but we're beyond Core ML. So that's where we are in ONNX support now, but we will pass ONNX Runtime soon, because it becomes very easy to add ops, because you don't need to do anything at the lower levels. You just do it at this very high level, and TinyGrad compiles it to something that's fast using these minimal ops. You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C, right? If you write that in NaivePyTorch, what it's going to do on the GPU is, well, read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory.
6
(someone): You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too, right? PyTorch is working on this and uh you know it's a little bit harder like in comma I felt like I was competing against a lot of idiots here I'm competing against you know smart smart very smart people who've made other people yeah who've made some I think different trade-offs whereas if you're trying to build something that is just straight up good on Nvidia and we have a lot of people and complexity to throw at it yeah PyTorch made a lot of the right choices I'm trying to build something that manages complexity. You can always make your software do more. The magic is when you can make your software do more without adding complexity. Because complex things eventually collapse under their own weight. How does fusing actually work? There's this thing called lazy.py. And when you do like A times B, that's, it's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, like the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going to try its best.
7
(someone): You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too, right? PyTorch is working on this and uh you know it's a little bit harder like in comma I felt like I was competing against a lot of idiots here I'm competing against you know smart smart very smart people who've made other people yeah who've made some I think different trade-offs whereas if you're trying to build something that is just straight up good on Nvidia and we have a lot of people and complexity to throw at it yeah PyTorch made a lot of the right choices I'm trying to build something that manages complexity. You can always make your software do more. The magic is when you can make your software do more without adding complexity. Because complex things eventually collapse under their own weight. How does fusing actually work? There's this thing called lazy.py. And when you do like A times B, that's, it's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, like the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going to try its best.
8
(someone): Just the NVIDIA open source driver and this open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable. Rewriting the kernel driver, I don't even have docs. I don't have any docs for the GPU. Like it would just be a massive reverse engineering project. I wasn't complaining about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining about the thing crashing my entire computer. It panics my kernel. And I have to wait five minutes while it reboots because it's a server motherboard and they take five minutes to reboot. So I was like, look, if you guys do not care enough to get me a decent kernel driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs. Intel GPUs have a stable kernel driver and they have all their hardware documented. You can go and you can find all the register docs on Intel GPUs. So I'm like, why don't I just use these now? There's a downside to them. Their GPU is $350. And you're like, what a deal. It's $350. You know, you get about $350 worth of performance. And if you're paying about 400 for the PCIe slot to put it in, right? Like between the power and all the other stuff, you're like, okay, nevermind.
9
(someone): I have two minor things merged into Pytorch because it's very responsive, you know?
Alessio Fanelli: So that's kind of like the lowest level of the stack. And then at a slightly higher level, obviously there's TinyGrad, there's Mojo, there's GGML. How are you thinking about breadth versus like depth and like where you decided to focus early on?
(someone): So GGML is very much like a, okay, everyone has M1s, right? Actually, I was thinking. In the beginning, I was thinking of something more like GGML focused on the M1s, but GGML showed up and was just like, we're actually just focusing on the M1s. And actually, M1 PyTorch is considerably better than AMD PyTorch. M1 PyTorch works, it only gives wrong answers sometimes, and it only crashes sometimes. But like, some models kind of run. When I was writing the Metal back end, I was comparing to MPS PyTorch, and I had like a discrepancy. TinyGrid checks all its outputs compared to Torch. And I had one where it didn't match. I'm like, i checked the matrix by hand it matches tiny grad i don't understand and i switch back to cpu and. I'm like oh.
10
(someone): i checked the matrix by hand it matches tiny grad i don't understand and i switch back to cpu and. I'm like oh. What does like bugs like if you like transposed the matrix because i think it's like has to like multi views and pi torch and like weird under the hood stuff that's not exposed to you like that there's bugs and maybe they fix them but like you know it seems like there was a lot of momentum again because you're getting how many engineers care about making pie charts work on m1. Thousands, tens of thousands. And you have an open development process. And guess what? It's going to be good. How many engineers care about AMD working? You got 10 guys that work for AMD. And then like a couple hobbyists.
swyx: You revealed an interesting detail about how you debug. You hand check the matrix math.
(someone): No, I don't hand check it. One of the best tests in TinyGrad is a file called testops.py. And it's just a hundred small examples written in TinyGrad and PyTorch. And it checks both the forwards and backwards to make sure they match.
swyx: the test suite. Yeah, very important.
(someone): That's I mean, that's one of them where you like, I really I put a lot of effort into CI for TinyGrad. I think CI is super important.
11
(someone): You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too, right? PyTorch is working on this and uh you know it's a little bit harder like in comma I felt like I was competing against a lot of idiots here I'm competing against you know smart smart very smart people who've made other people yeah who've made some I think different trade-offs whereas if you're trying to build something that is just straight up good on Nvidia and we have a lot of people and complexity to throw at it yeah PyTorch made a lot of the right choices I'm trying to build something that manages complexity. You can always make your software do more. The magic is when you can make your software do more without adding complexity. Because complex things eventually collapse under their own weight. How does fusing actually work? There's this thing called lazy.py. And when you do like A times B, that's, it's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, like the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going to try its best.
12
(someone): Laziness is not guaranteed laziness. It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, like that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B. It's like, whoa, wait, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.one in TinyGrad. Another amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph equals one, which is an environment variable. It will output a complete graph of all the operations. A lot of people are like, oh, you can use PyTorch, export it to Onyx, and use Netron. Yeah, you can. But like, what? That's not what's real. Graph.one will show you the actual kernels that were dispatched to the GPU.
13
(someone): Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically.
swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that. I wonder if whatever John is working on, he's adopting this style and maybe we can sort of encourage it by, I don't know, naming it and coining a certain kind of debugging style.
(someone): If you would like to start contributing to TinyGrad, I'd be so happy. I've chatted with him a few times. I'm not really sure what his company's doing, but no, I mean, hopefully we get TinyGrad to a point where people actually want to start using it. So TinyGrad right now is uncompetitive on, it's uncompetitive on NVIDIA and it's uncompetitive on x86.
swyx: And specifically, what do you care about when you say uncompetitive?
(someone): Speed. Share of speed. It's correct. The correctness is there. The correctness for both forwards and backwards passes is there. But on NVIDIA, it's about 5x slower than PyTorch right now. Like 5x, wow, this is unsurmountable. No, there's reasons it's 5x slower, and I can go through how we're going to make it faster.
14
(someone): You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C, right? If you write that in NaivePyTorch, what it's going to do on the GPU is, well, read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory. And now I did a whole extra round trip to memory that I just didn't have to do. And you're like, yeah, but you can use the Torch JIT and it corrects this. Yeah, for that one example, for that one example of malloc, but, oh, now you did three multiplies, six multiplies? It won't compile arbitrary code.
swyx: And have you looked into the other approaches like PyTorch Lightning to accelerate PyTorch itself?
(someone): Well, PyTorch Lightning, my understanding is it's mostly a framework around PyTorch, right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six tensors together. Why is it going to memory any more than a single read from each and a single write to the output? There are lower-level things in PyTorch that are... I'm not exactly sure what Dynamo does, but I know they're generating some Triton stuff, which is going to generate the kernels on the fly. But you know, PyTorch Lightning is at a higher level of abstraction.
15
(someone): Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically.
swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that. I wonder if whatever John is working on, he's adopting this style and maybe we can sort of encourage it by, I don't know, naming it and coining a certain kind of debugging style.
(someone): If you would like to start contributing to TinyGrad, I'd be so happy. I've chatted with him a few times. I'm not really sure what his company's doing, but no, I mean, hopefully we get TinyGrad to a point where people actually want to start using it. So TinyGrad right now is uncompetitive on, it's uncompetitive on NVIDIA and it's uncompetitive on x86.
swyx: And specifically, what do you care about when you say uncompetitive?
(someone): Speed. Share of speed. It's correct. The correctness is there. The correctness for both forwards and backwards passes is there. But on NVIDIA, it's about 5x slower than PyTorch right now. Like 5x, wow, this is unsurmountable. No, there's reasons it's 5x slower, and I can go through how we're going to make it faster.
16
(someone): There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does. And we're going to have, I think we're going to pass ONNX Runtime soon too. Like people think ONNX Runtime, that's a gold standard for ONNX. No, you can do better. Pass them in what specifically? Test compliance tests. So ONNX has a big set of compliance tests that you can check out. and we have them running in TinyGrad, and there's some failures. We're below ONNX Runtime, but we're beyond Core ML. So that's where we are in ONNX support now, but we will pass ONNX Runtime soon, because it becomes very easy to add ops, because you don't need to do anything at the lower levels. You just do it at this very high level, and TinyGrad compiles it to something that's fast using these minimal ops. You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C, right? If you write that in NaivePyTorch, what it's going to do on the GPU is, well, read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory.
17
(someone): That's not what's real. Graph.one will show you the actual kernels that were dispatched to the GPU. You can also type debug="2", which will print those kernels out in your command line, and it will tell you the exact number of flops and the exact number of memory accesses in each kernel. So you can immediately see, wait a second, okay, this kernel used this many flops, this was the gigaflops. This is how many bytes it read, and this was the gigabyte per second. And then you can profile without having to like, okay, I mean, in theory in PyTorch, sure, use the NVIDIA Insight Profiler. No one does that. No one does, of course, because it's so difficult, right? Actually, NVIDIA used to, I think CUDA 9 was the last one they had. They had a command line one, but now it's like, okay, I'm going to generate this blob, use this NVIDIA GUI tool to convert it into a Chrome trace and then load it. Yeah, no one does that, right? Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically.
swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that.
18
(someone): That's I mean, that's one of them where you like, I really I put a lot of effort into CI for TinyGrad. I think CI is super important. Like, I want that green check to mean I can merge this. Yeah, right. I don't want my tests to and if the green check if you somehow managed to introduce a bug and get the green check, okay, we're fixing the test top priority. Mojo? It's closed source. No, I'm not that interested. Do you know what I mean? Like, look, I like Chris Lattner. I think he's going to do great things. And I understand the, like, kind of the wisdom even in keeping it closed source. But, you know, I'm interested when it's open.
swyx: Yeah. You have an interesting design deviation from him because he's decided to be, well, promised to be a superset of Python. And you have decided to break with PyTorch APIs. And I think that affects learnability and transportability of code.
(someone): You know, if the PyTorch thing ends up being like a stumbling block, I could write a perfect PyTorch. Instead of import PyTorch, instead of like, yeah, import Torch, you type import TinyTorch as Torch. And if that really becomes the stumbling block, I will do that.
19
(someone): Laziness is not guaranteed laziness. It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, like that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B. It's like, whoa, wait, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.one in TinyGrad. Another amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph equals one, which is an environment variable. It will output a complete graph of all the operations. A lot of people are like, oh, you can use PyTorch, export it to Onyx, and use Netron. Yeah, you can. But like, what? That's not what's real. Graph.one will show you the actual kernels that were dispatched to the GPU.
20
(someone): I have two minor things merged into Pytorch because it's very responsive, you know?
Alessio Fanelli: So that's kind of like the lowest level of the stack. And then at a slightly higher level, obviously there's TinyGrad, there's Mojo, there's GGML. How are you thinking about breadth versus like depth and like where you decided to focus early on?
(someone): So GGML is very much like a, okay, everyone has M1s, right? Actually, I was thinking. In the beginning, I was thinking of something more like GGML focused on the M1s, but GGML showed up and was just like, we're actually just focusing on the M1s. And actually, M1 PyTorch is considerably better than AMD PyTorch. M1 PyTorch works, it only gives wrong answers sometimes, and it only crashes sometimes. But like, some models kind of run. When I was writing the Metal back end, I was comparing to MPS PyTorch, and I had like a discrepancy. TinyGrid checks all its outputs compared to Torch. And I had one where it didn't match. I'm like, i checked the matrix by hand it matches tiny grad i don't understand and i switch back to cpu and. I'm like oh.
21
(someone): For a long time, TinyGrad had a hard limit at a thousand lines of code. And what this would force you to do is really make sure you were not wasting lines. I got rid of the restriction because it became a little code golfy at the end, but once the core framework of TinyGrad was there in those thousand lines, the core framework, the ideas are expressed with no boilerplate. If you go read PyTorch, PyTorch I think is actually pretty good code. I think Facebook's pretty good, but there's so much boilerplate. Go in PyTorch and try to track down how an LGU actually works.
swyx: Just a lot of distractions.
(someone): Oh, you're going to be diving down a long stack from Python to C to custom libraries to dispatchers to, and then I don't even know how to read TensorFlow. I don't even know where's an LU in TensorFlow. Nobody knows. Someone at Google knows, maybe. Google as an organism knows. I don't know if anyone individual at Google knows.
Alessio Fanelli: What are like the important ergonomics, like for a developer, as you think about designing the TinyGrid API?
(someone): So the TinyGrad front end looks very similar to PyTorch. There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does.
22
(someone): Laziness is not guaranteed laziness. It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, like that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B. It's like, whoa, wait, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.one in TinyGrad. Another amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph equals one, which is an environment variable. It will output a complete graph of all the operations. A lot of people are like, oh, you can use PyTorch, export it to Onyx, and use Netron. Yeah, you can. But like, what? That's not what's real. Graph.one will show you the actual kernels that were dispatched to the GPU.
23
(someone): You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C, right? If you write that in NaivePyTorch, what it's going to do on the GPU is, well, read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory. And now I did a whole extra round trip to memory that I just didn't have to do. And you're like, yeah, but you can use the Torch JIT and it corrects this. Yeah, for that one example, for that one example of malloc, but, oh, now you did three multiplies, six multiplies? It won't compile arbitrary code.
swyx: And have you looked into the other approaches like PyTorch Lightning to accelerate PyTorch itself?
(someone): Well, PyTorch Lightning, my understanding is it's mostly a framework around PyTorch, right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six tensors together. Why is it going to memory any more than a single read from each and a single write to the output? There are lower-level things in PyTorch that are... I'm not exactly sure what Dynamo does, but I know they're generating some Triton stuff, which is going to generate the kernels on the fly. But you know, PyTorch Lightning is at a higher level of abstraction.
24
(someone): i checked the matrix by hand it matches tiny grad i don't understand and i switch back to cpu and. I'm like oh. What does like bugs like if you like transposed the matrix because i think it's like has to like multi views and pi torch and like weird under the hood stuff that's not exposed to you like that there's bugs and maybe they fix them but like you know it seems like there was a lot of momentum again because you're getting how many engineers care about making pie charts work on m1. Thousands, tens of thousands. And you have an open development process. And guess what? It's going to be good. How many engineers care about AMD working? You got 10 guys that work for AMD. And then like a couple hobbyists.
swyx: You revealed an interesting detail about how you debug. You hand check the matrix math.
(someone): No, I don't hand check it. One of the best tests in TinyGrad is a file called testops.py. And it's just a hundred small examples written in TinyGrad and PyTorch. And it checks both the forwards and backwards to make sure they match.
swyx: the test suite. Yeah, very important.
(someone): That's I mean, that's one of them where you like, I really I put a lot of effort into CI for TinyGrad. I think CI is super important.
Unknown error occured.