YTread Logo
YTread Logo

This Algorithm is 1,606,240% FASTER

May 28, 2024
1.6 million percent

faster

, it's hard to even show it visually because nothing appears. Today I will show you the seven steps that were taken to improve the execution time of an

algorithm

, but here is the kicker: the slowest was O of N and the fastest. one is still o of N and along the way I'll use common optimization techniques that you can use in any application and then we'll get a little exotic and enter the world of bit manipulation to really speed things up. of course compiler and simmet optimizations and everything to go incredibly fast, but first the problem of course is correct, then the problem is simple, it was since the advent of code 2022, which is finding 14 distinct characters in a long string once you find 14 different characters you just report the position directly after those, if you're not familiar with the Advent code, it's like a fun Santa themed Christmas adventure where you're solving problems to bring in the tree of Christmas.
this algorithm is 1 606 240 faster
The possible solution would be to simply use a hash set, take the first 14 characters, put them all in the hash set and then check the length of the hash. In the end, if there are 14, we can try, if not, we should move on and try again. We need to try again. Try again. Try again. try again, oh my goodness, look we have found 14 different characters, now all we have to do is report that little J here in position j and we have solved the advent of the code, so here is the very very simple code, we just go through 14. characters at once compile everything into a hash set, check the length of 14 boom, we are done, now a very simple optimization can be done and

this

is the first optimization that will take us to a runtime 92 percent

faster

and of course, that is, inserting characters one at a time, the moment it detects a duplicate, it doesn't process any further and we move on to the next set of 14.
this algorithm is 1 606 240 faster

More Interesting Facts About,

this algorithm is 1 606 240 faster...

I know it's shocking, 92 percent faster for such a simple if statement, let's move on to the next one. optimization level,

this

will improve the execution time by 8.9 times, practically an order of magnitude faster and you will be surprised how we do it, of course, using a vector instead of a hash set, might be a bit counterintuitive . You're thinking well, why would checking each position somehow be faster than a hash set lookup? Aren't those times constant? Yes, but so is a single vector search. It's also a constant time, but I want you to think about something when you see Big O. of a constant time is kind of a lie in the sense that it's actually Big O of a constant multiplied by one.
this algorithm is 1 606 240 faster
What is that constant when it comes to a hash set? That constant is significantly larger than a vector constant and that should make sense, right? because when you're talking about a hash set you have to compute a hash and an index on a large table within that table, then you have to go to that entry and potentially scan some lists of elements if there are collisions and check for equality, whereas with win you're still allocating on the heap as a hash set, but you simply have to follow a reference and compute an offset in a region of memory which will be super fast in comparison and the code isn't much more complex. we create a vector instead of a hash set and we simply have to call contains instead of checking the return value of insert.
this algorithm is 1 606 240 faster
It's really not much different, but it is significantly faster and the same code, but in JavaScript it is 6.7 times faster. Works. in both languages ​​because adding and checking a list in general is significantly faster than a small scale hash set, i.e. a few elements like 14. Okay, the next one should be obvious what we're going to do here and this one will do it. be 26 times faster, which of course is using a matrix instead of a vector, the stack is allocated and we get cash locality type benefits much faster, but obviously this requires a little more work because now I have to keep track of the indexes as well because I don't want to keep checking all 14 elements each time, but only a subset of them, but it's still not much more complicated than our previous solution.
Now I know that at this point all of this has been a bit surprising, especially if I haven't used a real language, real language, that is, a language that has static arrays, sorry, JavaScript, what about a ray buffer? Okay, calm down, I mean, they kind of do, but seriously, these next few optimizations are going to be a little wild. going from 26x to 233 times faster, we're starting to get into the incredibly faster category, before we can go incredibly faster, which is technically a word, you first have to hit the subscribe button and I'll explain to you about bits. because from now on the whole solution involves some bit manipulation to gain some super speeds because we don't need to use an array anymore, we can use a 32-bit singular number to store the state of our search.
One thing you should understand about ASCII. characters is that there is an associated numerical value for each of them and they happen to be contiguous from 97 to 122. The second Au 32 is an unsigned number that has 32 bits, you can think of it as an array that has the ability to store a true or a false in 32 positions a true is a one a false is a zero, which means that any character modulated by 32 will result in a number between 0 and 31, which is the number of bits that are stored in a u32, let's quickly review For example, let's say our search state is zero, we haven't seen any characters yet, and we want to insert the character d d or 100 modulo 32 is 4. a left shift operator is equivalent to multiplying by 10. that means we're going to shift this is divided into four positions, thus equaling the binary representation of one zero zero zero zero or ten thousand binary, so if our state is equal to zero and we use it with our new binary value that we just created, the resulting value It will be binary 10,000.
Let's say our next character is a following this equation, it will produce a value of binary 10. If we are in binary 10, we will get the binary value ten thousand ten, but how do we check if that binary has already done so? was set before setting it, we simply use the logical and operator using our previous example, we would run the and operator on it and get a value of zero, therefore these two values ​​do not share any common one now. I know it was a very shortened version of binary. If you want more information about it, I'll post a link in the description.
Go see it. It's directly below the Like button. Here's Benny's

algorithm

and it's actually pretty clever, so we make the state something we talked about before and take the first 13 characters in the sequence and here's the secret sauce: use an xor and an exor, think about it like a switch, if there is a zero in that bit position, it will change it to one. if there is a one in that bit position it will change it to exclusive zero or or 14 at a time, if you're not familiar with Windows operation, it literally goes from 0 to 13, 1 to 14, 2 to 15, it just creates a nice window for you and the position iterator will return the index of the first true result. i.e. when you return true here it will return the index not the value that happened so we take the first and last character in our window now remember our state has the first 13 characters we toggle on the 14th character so now we have the possibility of having 14 within our state, if we have 14 then we have found the position we are looking for, then we remove the first character we saw, that means we go from 14 changes to 13 changes from 1 to 13. then we do the loop again adding the next last element in the window which is now position 14 so now we have 1 to 14 we do the distinct count again then we delete the first one and go to the second one and keep doing that little crawling move window over this clever algorithm gave Benny 233 times faster than the hash set or 23,000 faster.
It's an incredibly smart algorithm. Congratulations to Benny, damn Benny what a smart man and to think my solution was only 100 times faster than hash I feel so stupid looking at Benny's super smart window tracking solution let's get into our final form here we go Let's look at the algorithm that sped up Benny's more than four times, it was about a thousand fixed size. Here you will notice that we make an entry. git index to index plus 14. So that's a constant size window, it's very important to remember that next we create a 32-bit state variable, just like Benny's solution, then we're going to go through that segment and we're going to iterate. in reverse now for me personally, when I saw this, that means we had to take an iterator, get all the elements, collect them all, and then walk backwards through them.
I thought there was no way this was going to be faster than Benny's. We do the same operation. We take the modulo 32 of the bit we test to see is that the bit is already set to one, if it is set to one then we find a place for a duplicate, then we change that bit using the or operator and then return if we have seen it or No. this byte already, so here's one of two mind-blowing optimizations. If we've seen this snippet since we went back down the list, we can jump to this position plus one which is a huge optimization we're going to do. trims multiple characters, we don't have to look again and again because if we don't find a repeated character or the false condition, we actually return this index, which is correct, this is the position of the 14 contiguous characters, now you are probably wondering how, show me how this was so fast because I don't understand why going back through a list was faster other than that little jump, but you still have to iterate over everything right, so look at this, this is the compiler.
Explorer, this is godbolt.org, so I first put Benny's solution here so we can see it first. The first thing you'll notice is that the input here, since it was a fixed size, what did the compiler do here? during optimization, it actually unrolled the loop, it just brought it all in line, so there's no going back and doing this a bunch and then jumping to the end and we get to this our Windows 14. Now, this is kind of interesting because it's pretty much the main loop here, we just go through this code over and over again, so it's pretty fast, but now let's look at David's solution, so the same thing happens with David's solution here, it actually unrolls so that The reverse iterator happens in reverse right here, you can see it right there, look at that 11 10 9 8 7 5 3 2 1 and then whatever this means, I don't even know what it means and I've been told that what's happening here is everything. the Sim D stuff that's happening, which means that a single instruction is multiple data calculations in one go with the compiler, which means you can massively speed up what you're doing, so it's not just the inner loop unwinding , unlike Benny, but it is also Simd optimized and that, my friends, is incredibly fast, but you are probably wondering: wait, wait, that one was 16,000 times faster no, that was just three orders of magnitude faster fast a thousand X faster was 983 X faster so how do we do it? faster of course it was only used by 64 threads that just don't do anything.
Come on, of course, I'm going to do that and that allowed us to go incredibly fast, in fact, 16,000 times faster than the original solution. I know someone will probably like it. that's kind of cheating now, I mean, it's a bit of cheating, I mean, it's part of an algorithm, right, we're just improving the algorithm, it's a very simple working set to parallelize, in fact, it's so fast that Google charts I can't even. a blue line appears here, that's how fast it is, it's so fast that the original solution would process 3.84 megabytes of data per second, while the final multithreaded solution did 617 gigabytes per second, of course it only did that for about 11.9 milliseconds because that's all it took to cover about 800 megabytes of data, of course David A Perez gives you the three times faster 980 solution.
Go follow them on GitHub, now that's incredibly fast. I hope you enjoyed it. Like button, let me know below. It was so much fun doing this. Obviously, it was all done on Twitch. We went through all the optimizations and it was all a glorious time. Thanks for watching. The name is Primagen.

If you have any copyright issue, please Contact