YTread Logo
YTread Logo

Akka streams for high throughput data processing by Zack Loebel-Begelman

Jun 01, 2021
Hello everyone, thank you for coming. I just want to note that if you came to this talk in Chicago, it's pretty much the same thing, so unless you want to know, bro, if you have questions afterward or whatever, you might want to go somewhere else or something. anyway. My name is Zach LaBelle Begelman. I'm a senior software engineer on the

data

platform team at a company called Credit Karma and we're here today to talk about acha

streams

for

high

-performance

data

processing

. Make sure you rate the app so it can get some feedback and you know next time I do this I hope to do better, so anyway, what are we talking about today?
akka streams for high throughput data processing by zack loebel begelman
First of all, those who are familiar with acha actors simply show their hands. Great, most of the team members here have written code using orchestrations, ok, still. Quite a few hands on, but has anyone tried to solve a

high

performance problem using

akka

actors? some people, so for those of you who raised your hands, this is for you, you will be able to hear about my pain in doing so, so what are we talking about today? we'll start by talking about data infrastructure at Credit Karma we'll talk about how as we scale our data we start to run into problems a little introduction to

akka

streams

some of the core concepts and how some of that stuff works how akka streams saved my butt and made my life much better and then a bit of benchmarking comparing extreme stock players to end with some results and learnings so without further ado when I learned what the data infrastructure was like at Credit Karma when I started Credit Karma for those of you who don't You know it is financial assistance that helps more than 60 million Americans and Canadians achieve financial progress.
akka streams for high throughput data processing by zack loebel begelman

More Interesting Facts About,

akka streams for high throughput data processing by zack loebel begelman...

So when I started at Credit Karma, this is what the data infrastructure was like: we had just introduced Apache Kaka and for those of you who don't know Apache Kafka is a very durable message queue that provides at least a guarantee that it's very scalable, you can transferring a large amount of data very quickly, has high performance, so we started introducing JSON into Kaka, which is a semi-structured queue. data format, our analysts were comfortable using Vertica, an acid-compliant relational data warehouse that likes large batch inserts as well because it's acid-compliant and very tightly structured, so we have JSON and Kafka, but we need to have these strict types guaranteed and pushed into Vertica, the other thing. with Vertica because it's a column store you like these big batch inserts, you don't want to insert one row at a time, it's going to go very slow, you want to do hundreds of thousands or even millions, so we need to group our inserts into batches.
akka streams for high throughput data processing by zack loebel begelman
Another thing, because if there is at least a guarantee in Kafka, sometimes you get duplicates correctly, if a message does not arrive, sometimes it is written again and you get a duplicate, so we wanted to remove those duplicates, those are the three constraints and the reason we built our data warehouse, import one of the applications we'll talk about today, written in Scala using akka, more on that shortly, the other application we're talking about, the analytics export service, came up when we partner with a third party called amplitude. to provide us with product metrics and customer interactions on how people used our app on our website now we never sent any financial data, it was always anonymous.
akka streams for high throughput data processing by zack loebel begelman
We take security very seriously at Credit Karma. This was another Scala service that used our factors and these are. The two applications we're going to talk about today are how we extract that data from breadth, push it into Kafka, and then how we get it from Kafka into Vertica. So this is really about our experience building the actor system and ultimately moving to akka streams, so what about the scale of the data here? We have megabytes per minute for the last four years, so these are not huge numbers, it's not necessarily Google or Facebook, but it's still not trivial, it's enough for where we needed something.
We knew that could scale, so you'll notice in 2014 we had about a hundred and sixty-two megabytes per minute and you can see we doubled over the next few years, which I thought was really cool, isn't that strictly doubling? It's actually exponential growth that we didn't necessarily expect to double, but either way we're glad we prepared for that, which is why we chose to use Scala because we wanted something in the JVM where we could have some kind of threads. was developed as needed but also akka, now most of you know waka actors, it's a different way of thinking about concurrent programming leveraging the actor model, no blocking, no synchronization necessarily, actors are like a function heavy where your message is your inputs for that.
It works and actually scales very well because all the mutable states are contained within an actor It scales easily to other frames or even other threads Sends a message The actor does something, you can send that message over TCP if you want, you can add it bundled these are some of the reasons why we chose to use akka actors also akka streams were not available at that time but it is a push based model send a message do something change your behavior create more actors whatever whatever, so we chose akka actors because we wanted to scale with our data and we knew it would be easy.
Hopefully, the other nice thing about akka actors is being able to send them to a single actor or you could send them to a router of actors that you don't necessarily need to know about. Simply sending this message could be handled by any number of actors in the same frame or a different frame, so what is the application we create using all the characters here? Our data warehouse is important, so if your call we had three constraints that we want our batches to be inserted into. Vertica, we need to deduplicate messages from Kafka, we also need to convert our semi-structured JSON into a stricter structure.
Actually, psv was preferred, so this is what we ended up creating. We had a single extractor for each topic in Kafka. each table in Vertica and that extractor would be responsible for creating another three actors, a reader to pull the message from Kafka's ad duplicator, to make sure we only send unique items. We did this using a sliding hash map that announces deletions from the linked hash map. Ideas, so what happened is that the extractor would send a message to the reader, who would read a message from Kafka and then send its message to replicator D to then send a message to the processor, but it would actually make sure that all the fields were there and that the types were correct. and that would convert it to TSV, which would then send that message back to the extractor, which would write to disk, upload it to vertigo when relevant, and then send another message to the reader to start the whole process over again, so there's something that keep in mind here, we only ever had one message in flight per extractor, so by cable in Vertica and the upper king Kafka we only had one message or message from Kafka, I should say in flight, so this actor system worked very well for a while, it worked so well that we decided, hey, we.
We have this other breadth use case coming up, this analytics export, why don't we use it to build that? So what did we have to do for this one? Well, we have a coordinator. Amplitude only makes this data available once an hour, so it's this gzipped file that we pull out of amplitude once an hour, we unzip it, we unzip it, and then we have a bunch of files that we find that we want to do some as well. transforms and then send them over the network to our ingest server to finish By having a router of these transform workers to take it from the breadth format type and convert it to our internal format, we then had a Kafka import worker that would actually do the network call to our In Jet server which would then receive a batch.
He would send messages over HTTP, he would write them to disk and then, once he had written them all to disk, he would actually respond. We had a separate agent that was on the block to actually consume those messages from disk and send them to Kafka, but that's the high level of what this does, we built it, we were very excited, we shipped it and immediately ran into problems , so the first problem we encountered was missing data. This is the exception we saw after we submitted it, so we were using spray at the time we are using the host level connector.
I think it's called where you pretty much have a pool of connections, you ask spray for a connection, and then you use that connection to make your actual outgoing network call, so we're looking at future timeouts. asking for connection spray meant that the connection pool was effectively saturated and we couldn't get a new connection in 30 seconds, so we figured this is probably not good, no big deal, but we're getting to work. you're in memory and futures are running out and data is being deleted, that's fine, should be an easy fix, 30 seconds, what if we just increase it to like?
I don't know, 30 minutes, that should be it, it should be fine, what's wrong? the worst thing that could happen out of memory is the worst thing that could happen, so even though we were missing data before, now they're calling me because our app crashes, so we don't really fix the root cause, here we are still waiting on the network, except now , instead of the future running out and an exception being thrown, the data disappearing, the data building up in memory, eventually enough data building up in memory to where the entire application crashes, so we weren't really controlling well the speed of work again, easy solution.
Well, we just increased the heap space, so we did that, but we knew it probably wouldn't work forever, so we decided to take a closer look at the architecture of what exactly we were doing to get these starred gzip files. . from the breadth, we compressed them, read them from disk, and as we read them from disk, we would actually do this in the memory transform, which was pretty fast, then we have a little bit slower network call, so decently fast, but then we have a very slow write to disk before responding, so what we were literally seeing is the work piling up, the ingest server going slow, the import workers going a little faster, but still, nothing is as fast as in memory transformations, so you're just working. they build up at every step until the app crashes and they call me which I don't want so we knew this was going to be a problem but we weren't sure what the solution was, it was actually at that point we looked at our data warehouse import and we noticed that a similar paradigm was actually happening.
It has a pretty fast network read from Kafka for recent messages. It's actually only kept in memory, so it's mostly limited by the network. It still has a pretty fast memory transformation. Well, we're deduplicating or JSON transforming the TSV and then, what a surprise, we're writing to disk. Vertica probably also writes to disk when we load it, so it wasn't long before we started having problems there as well, so here I actually have a performance graph that reads from Kafka and that green line is our intake rate, so 750,000 is our magic number, so if we are not above that line we are falling behind and will never catch up, so what change did that cause? this strange behavior obviously works great at first it crashes and we even have a couple of 20 minutes where we don't do any work so this is not good so using data from swap space is not loading in Vertica, so people are starting to chase me.
I'm trying to figure out what exactly changes, what's different here, so we had a new schema or topping in Kafka online previously, our largest schema was about 8,000 events per minute, this new schema was more in order. of 80,000 events per minute, a 10-fold growth, so apparently that was enough to bring the entire system down. I'm not sure what's going on. There is stopping behavior. We are using swap space, which is never good for a JVM application. I just know that people are chasing. They're still calling me about the other app and I think there has to be a better way, so Achatz dreams had actually just become GA around this time and we decided okay let's take a look at this, maybe this can solve ourissues.
What exactly are automatic transmissions, while with actors it's all a push-based model? Send a message Automatic transmissions introduce this concept of back pressure or a pull-based model. They still act secretively, but it's a much friendlier API that's really optimized for performance. It's also very modular API that is composable, so what will happen is it actually starts at the end, you have your end subscriber or a sync in August language that will request, send a data request or demand and it will go through the entire chain, so you get the beginning, time at which the data is actually sent, so this concept of backpressure was formalized in the reactive streams manifest or the log stream spec, sorry, and this is actually the akka streams that it implements and you can get everything as a subscriber or publisher of reactive streams.
These are all still non-blocking asynchronous signals because they are hidden actors, but again, it's kind of a high-level API optimized for high performance. I want to make sure to mention that akka streams is limited to a single box, there is no remote control or bundling that is available just because these are all active messages, it is possible that one of those messages and sending over TCP will be lost and if that message is lost, possibly it is a transmission to Deadline, so it must be restricted within a single frame, so that is the big problem that is important to be aware of, so what are some of the basics that you will encounter in Aqua Streams?
It all starts with a source. This is where your data comes from. I also want to point out that everything is a graphics page, so you have these source stages that have an output in the stream because their input is something external to the stream, something like a file or it could be a stream or an iterator, it could even be an actor reference. There are a bunch of these built-in APIs that provide backpressure so you know the custom code needed, so once you have your data in the stream you have an output that is exposed, you have a flow, this flow is a way to transform or potentially change your data, something like a map or a filter, we'll look at some of those soon, but it's actually a lot of things that you We'll find in Scala collections that the API has an input and an output and it's just one way to do some kind of transformation, it is a way of connecting different things.
The other thing that's really cool is that the API in these flows are actually the inputs and the outputs. and the types so you can have these composite flows where maybe you want to expose three flows that do a variety of things as a single flow that is possible simply because the API is pretty much what is your input and what is your output and the types . once you have transformed or changed your data, it will go through the sink, sort of the final resting place of the data, something with one input and zero outputs because the output will be something external to the stream again, something like a file, It could be a sequence. or again sending an actor and once you have all your inputs and outputs connected you will have a runnable graph which is your blueprint, that is the work that will be done.
Still, the blueprint is actually shareable, but no work is done until you actually call execute. that blueprint or that executable graph is at that point when all the resources are allocated, things materialize. The materialization here is that you can have certain stages that materialize into something, so something like thinking it's a sequence would materialize a future for a sequence. when run is actually called and that's where all those resources are allocated, you also notice that every time there's something that you provide to a stream, it's going to be a function because that function is going to be called every time run is called, so let's take a look. look at some code, if you need to import our datastore, we were doing some deduplication and

processing

so I want to show how easy it is to destroy some unit tests which allows me to focus on my business logic and just input values, so In team we have our duplicator D, which is a stream that takes a string, generates a string, and does not materialize any values ​​that are not used.
We had to write a custom graph stage for the D duplicator just because that functionality didn't exist. For the processor, we are effectively making a map. I wrote it as a function stream, but normally you could just call map on a stream that you had and you'll notice that we're just calling the process scheme where we provide the bytes take the string and provide an option for a string, so what does it look like? the compound? We can say D duplicated by a processed and now we still have a stream, but instead it's a string that returns an option from a string and that's like I said that compound stream where now we take two and put them together, but now let's plug in some values ​​to write our test, so first of all, let's remove the values ​​that we removed using a filter and just get will. then we create some data that we want to insert, we use a stream here as our source and then we just send it to a sync with a stream and you say, okay, keep the materialized value.
You can see that we have an executable graph that will materialize a The future of the string sequence, that's the other thing that I really like about endpoints, personally, is that you get type safety, you know it at compile time, yes your program is correct, otherwise it will not compile because the types on all inputs and outputs are known and connected. above, you actually know that you have type safety, something that is not available today with our actors, so we also need to create an actor system and then a materializer using that actor system and then we call run that actor materializer is a value implicit provided to the run function we get our future, wait for it to resolve and then just make sure the results are correct so I can test my business logic, but I can enter other values ​​for our sinks from the far source just to make sure that my flows are correct.
I can focus on what it is I want to test instead of hooking up all those actors to make sure the case classes are correct and things from that source, so what are some of the built-in APIs that you get? built-in sources, things like an actor that references a signal, you can use a stream, you can have an iterator or you can use a pass, there's built-in functionality to read and write from the file system, you get all of that outside of the - box using backpressure throughout, there is no need for similarly custom code for our sinks, we have things like head and head option, these are starting out should start to look familiar to all of you who have used the Scala Collections API , which I assume is everyone here, but things like the last for each are reduced a bit and you can materialize a signal that you can read or write something on either end if you need to communicate with the outside world similarly with many of the integrated functions. processing stages, you will see some familiar functions grouped together sliding around while things like that and for any of them expect a lot of functions, there is another version that will contain them within which you can provide a timeout, so if you are creating a batch or something like that . like that and you don't get enough items in five minutes or whatever number you prefer it will just produce the values ​​at that point just because it's a potentially infinite or potentially limited sequence who knows but if not you won't get enough over time just produce the values , same with some of the converters, you have an input and output stream and even a Java eight stream, if you want to plug external things like sources or sinks into your stream you can use those functions, so let's look back at importing our warehouse and what exactly we were doing and what it looks like with the currents, so as I mentioned before, we had created a simplified version of backpressure, we created our own little mini acha currents without realizing it just because the coffee is potentially unlimited.
We had to do it. I also want to make sure to mention we as a best practice or as I understand it with akka, you want to wrap every message to your actors in a case class, so the extractor would send the case class to the reader, which would get stuff from Kafka, the I would put them in the case class and send them to deduplicate. I would wrap those things in a case class and a longer processor would wrap them in the case class and send them to an extractor, so we're pretty much creating five case classes for each message.
What we get from Kafka is important later, we'll come back to it, but what does it look like when we remove some of those actors? When we replace it with sequences. Well, it's much simpler and there are fewer diagrams in my graph. You just have the right sequence. now you have an extractor that instead generates a stream and that stream does all the work for you, so now we have an extractor that creates a stream for each topic in Kafka for each table in Vertica. It's easier to reason about it and we don't need to do it anymore.
Worry about type safety because we know it's correct, otherwise it wouldn't compile. So what does the code look like? You'll see that we have a runnable graph of unused, that's because we didn't materialize anything that we started as I said we have a function to get our Kafka iterator. We also created a message parser which is where we predefined it. Here is the schematic you are using. If the types we expect, we send it through our D duplicator that we saw earlier. At this point we pretty much make a map to process that converts that JSON into a TSV and then send it to something to an actor that we had previously used as our schema file writer.
This is a trait that I think has been deprecated at this point, but it was really useful for us, but the concept is that you can have a subscriber actor or a publisher where it's a trait that you can take advantage of and it will provide back pressure if you're converting a existing actor, but I think that's not as supported anymore and you'll have to use custom graphics stages at this point, which are pretty easy to write, but either way we have backpressure at all times, we have a sequence, but we don't actually have a sequence yet, we have a runnable graph and You'll notice in our receive block below because we're actually still inside an actor that pulls, as you remember, it's not until we get the start pull message that we actually call run and that is the point at which, well, streams are actually created resources. are assigned and work begins, so that's where kind of the magic happens, so to speak, we wanted to get back to more performance, it's not safe and we know everything that's happening, it's easier to reason, so what What about our data warehouse and our analysis export?
What does that look like? Well, we have a few different routers here that were able to completely undo and replace everything with a stream similar to the above, while our unzip element created multiple files, we ended up creating one stream per file, so this is kind of a high diagram. level. What does the code look like? This one is a little more complicated than the previous one, so we start with each file we have. First we will read from disk, as I said, the functionality is built. in august dreams so we can use our type of route method on the file I/O object which gives us raw bytes or I think it's a string of bytes so at that point you have to frame it because we only have a lot. of bytes coming in and we want to split it into new lines, so that's what framing does: once it sees a new line, it says ok, this is a single event, then we convert it to a utf-8 string and there it is when we use something called map async, so there are some methods within the built-in data streams that you can run asynchronously, map is one of them, so what map async does is similar to how we had an actor router before we could now observe our map in parallel. instead of an A to a B, we now have an A to a future of B, but we can run it in parallel by providing the desired level of parallelism to that function, so what will happen is that once all those spaces are filled and have all your maps will work better and stop asking for new jobs until one of those places opens up, so here we did it to convert things into a jayob jecht to parse it as json, then we do a filter to remove the schemas that don't We don't care, we convert it to our internal type that we are familiar with, then we run map async again where we do our internal transformation, this is something we want to paralyze, it's a little more expensive, then we group it into the batch size. we want ournetwork call and then we make our network call, that's the last asynchronous knob and this is where we had problems before because we were waiting or we had work piling up and we only had a limited number of connections, now we can do it explicitly.
We provide this parallelism input to this function and once all those slots are full, all those connections are in use, we just press back, we just wait once the slot opens, we can take another one out of the queue and then send it, so that's where We make our network call and then we send it to our full receiver. The idea of ​​that receiver is that if the transmission completes successfully, it simply sends you a try, which is why I think we were using aqua persistence, for checkpoints. we are done complete file continue again it is easier to reason about provides type safety we know that if we are correct we do not need to connect all those routers we do not need to manage them ourselves it is not finished you in a very friendly API, so it had more performance , let's take a look before, here we have a heap utilization in gigs, you can see that we have to reallocate several times, that's why they kept calling me and I had to keep increasing the numbers, but we are using around 28 gigs and then if You look at it later, it's using about a third and it's much more predictable and consistent throughout.
I think this was, yes, something on the order of 12, but either way now we can run other things. We have constant utilization in that box and I don't get paid anymore, so I'm very happy about that. What about importing from our warehouse? Although this graph still haunted me, we had this hesitation behavior and it clearly wasn't working for us. This is our before, what is our after? It seems like it takes a little while to get to an old number of nine million, but we constantly say between 8 and 11 million, leaving our old number 750 in the dust, our interest rates, as you saw before, have continued to rise, as well which luckily we switched to streaming to be able to support it, but this had no problem running probably at its capacity effectively, but I was still bothered by this pausing and stopping behavior in this 20 minute period where we didn't even have anything to do. happened, it took some benchmarking, but I started to notice some trends and I think I know what the problem was.
Remember before saying that we are creating about five case classes for each message we receive from Kafka. Well, I think that was too many because if you take a look, the Burbage pickup times shown here are before at the top and our after at the bottom. I want to make sure everyone notices that the before at the top is in seconds and the after at the bottom is milliseconds, so we're talking several orders of magnitude difference. You can see there are some periods in there where we literally spent more time picking up trash than doing actual work.
You know, several pauses of 20 to 25 seconds and that's it. That's that stopping behavior you were seeing. The other thing is that if you are using swap space in your garbage collection, that will take a long time. That's the other thing you want to make sure you never do just because you have to throw. disk references are very slow at the bottom, we have 40 to 80 milliseconds, the application barely notices, performance is barely hindered and things are running much better, so performance has improved. I can finally get some sleep. They don't call me anymore. at my desk asking Hi Zack, where is my data, so this was a big win for us, but be honest, there are some of you skeptics who might be thinking, ok, no, your codes are probably really bad, if you did, it would be much, much.
Better, for those I decided, okay, what if we change my hat? What if I said okay, let's write some code where I focus on the business logic and do an apples to apples comparison of actors and streams, so that's what I did. I decided okay, let's say I'm not Zack, a platform engineer at Credit Karma, but rather I work for Wikipedia's security corporation due to the changing political climate. We want to make sure Wikipedia is safe for eternity, so for those of you who don't know first. Anyway, thanks for Cappy dia, but Wikipedia actually publishes a dump of their entire website or all the data on Wikipedia twice a month.
It's a fairly large JCB file that once unzipped weighs about 60 gigabytes, so I decided okay, I'm working for Wikipedia's security. corporation I heard the actors are great extremes also pretty great. I don't know which one to use, so let's try to compare and contrast. I want to focus on my main business logic, which is encrypting Wikipedia files, so I wrote an XML parser. Well, I wrote. one, I use the one in the JDK, but practically we need to parse some XML. Wikipedia publishes it in this wiki markup language. We also needed to perform an analysis. We perform encryption.
We did RSA encryption and then we will write it back to disk. just to make sure everything is secure so I wrote the business logic separately and then connected them to very thin actors and strings and decided to take a look and for consistency why do we take a look? in our speed distinctions here, we have a slow read from disk, we have a pretty fast memory transform, a little bit slower encryption because it's RSA and then we have a pretty slow write to disk, so what did we do for this ? We rented an AWS was on a quad core box with eight gigs of RAM and I thought it would be interesting to take a look at three different disk speeds, so we need to parse the disk XML.
I pre-downloaded the file, so around 60 gigs, why make sure? everyone notices a 260 gig file, we have 8 gigs of RAM, so 60 larger than eight will potentially have problems. int int we use the JDK built-in parser for XML found in the Java X XML Stream package or library. I wrote a trivial wiki markup. Scala parser has built-in parsers or you know it's easy to create a Scala RSA parser and encryption built into the JDK. I used a file output stream for the actors and as it makes sense I use the akka streams file API because I have backpressure and it's well it's all built into the stream so why wouldn't I?
So the first test we ran was our kind of benchmark test. Let's just read the material from the disk, parse the XML, and write it again, with the idea that we shouldn't be able to continue. faster than this base test, so it took us 16 minutes in 27 seconds. Alright, why don't we try it with actors? I work at Wikipedia's security corporation. I don't know any better, can anyone guess what's going on? No memory because, again, 60 doesn't fit into 8, there's no after-questions or anything and we're just these actors. In the end we had memory, but it turns out that there is this concept called bounded mailbox, so you can define the size of your mailbox that is used. for those actors and if that mailbox is full it will be blocked until there is space in the mailbox, so once we use the fenced mailbox we also use some routers just to make it interesting.
The actors finished took around 40 minutes and 27 seconds. It's actually slower and I suspect there's some stuff going on here, but because there were no asynchronous limits on my stream, I think it actually caught it in one stage and it's still a little slow, but either way, a time we used map async it was actually significantly faster hmm faster 28 minutes compared to 40 minutes, this is still a pretty big improvement but it's not necessarily about it going faster, there were some other results that I personally thought were interesting when we looked at the different disk speeds, so in this the lower graph is better because this is the execution time and you can see that we have three different disk speeds for the source: slow, medium and fast.
We have our benchmark test where we don't really do anything. We have streams using map async ross streams and then two. different actor implementations using different routers both using a balance mailbox so what I thought was really interesting here is with the streams as our source disk got faster everything got faster quite a bit well it makes sense, it is well designed for high performance, reducing bottlenecks, we see the reverse. actors actually is that as our source gets faster, everything gets slower. I suspect this is simply because the jobs go a little faster and don't necessarily control the job speed in any way, there is more overhead in terms of objects and your assumption is as good as mine, but either way the Transmissions are designed to reduce bottlenecks and you really see that here, as there is more speed available to feed the system, the whole system goes faster, whereas with actors the universe seems to happen, so it wasn't necessarily so. about acting I had some revelations while writing the code to do this so that not everyone is trying to read the code on the screen.
I just want to illustrate a point, so first of all, when implementing the actors, I obviously had to do it. Spin all those actors and point them at each other, but then I realized that I also needed to implement a parsing actor and then I needed to implement a flower writing actor and a encryption actor, then I realized that oh, the system does not act. Li knows it. which was done so I needed to find a completion message so that when all the messages had gotten there we would send it and either way it looked like a bunch of repetitive text.
I wanted to focus on my business logic, if I wanted to write boilerplate text I would. I just write Java, so there's a lot of repetitive text and because we're doing maps pretty well and I had to have all these actors do this in the meantime with sequences and a lot fewer lines of code, it's also a lot more readable and easier to understand about what it is. happening, so if we take a look at what's happening here, we start, we have an XML file source. I had to write a custom source to extract those messages from XML, but I do the same thing with actors, so I thought it was fair to leave it. explain it in both and you'll also notice that this is all a composite font, the types in these are fonts from start to finish all the way to the end, when in reality we have an executable graph, so we start right and provide a file. input stream to our source then we parse the messages giving us a test of the wiki page here you can see in one I have a map a sink and the other has an old map and then we remove the things that failed because they don't really I don't care, then we try to do our encryption again, another single map or a map as a receiver and then we just discard the things that failed, convert them to a byte string and write them to disk, so it's not just more API high performance, but I may end up focusing on my business logic instead of trying to make sure all the pipes are connected correctly to all the actors.
I can just focus on the business logic of what I'm trying to do here, which is maintain Wikipedia. sure, but there were some other things right, so first of all, to make the actors work, we had to use a limited mailbox, we also had to use routers just to make it interesting. It took me several hours when I tried not to use it and I didn't even bother putting in the results and we also had to find a completion message while I get all that more or less for free using all the streams, so some of you might also say no , no, no, they can get more out of you if you do this or that because that's the conversation I keep having with my coworkers, but that's not really the point I'm trying to make here is that I could spend my entire life trying to modify acha, that's not what I do.
I get paid to do it necessarily, so I want to focus on the business logic and leave the rest potentially to the experts or whoever wants to do those things, so that's my two cents, let's take a look at some of the performance metrics of those points of reference. so we have the GC timings here, first we have our benchmark, then we have our two-stream implementations, and then we have our two-actor implementations. The streams are pretty good at about 20 milliseconds of consistent GC pauses and you can see they are all over the place. actors make sense, you are creating many more kinds of cases, butagain, there's much more consistent use for streams throughout, meanwhile you have this high variance everywhere, which probably doesn't help our performance.
You see similar trends. with load, this is a fork or a box, so when you can see our actor tests going up from 20 to 25, that box is not having a good time, it won't scale much higher if we wanted to do more things with that implementation, so would do. It's probably going to drop because it's already mostly being pushed to the limit, whereas with transmissions it's mostly just under 10 or it'll add around 8 consistently, give or take, so you can see that one of these will climb much better. than the other and you see similar trends with heap size not even moving with streams, said around 750 Meg, meanwhile actors are going all the way up and that's where you see our allocation or our GC times because the piles everywhere Meanwhile, the place where the streams handles most of that for you and is consistent throughout, so what were our learnings and our takeaways here?
Raw acha actors are very powerful, but there is no backpressure, so you can really get into trouble when you go from fats, especially when you go from fast to slow, they are not particularly composable either, you have to make sure the types are correct, you have to point them at each other, they are good for low latency and, if you need clustering, and frankly, if your data set is small and fits in memory, then use whatever you want, it doesn't necessarily matter, you probably won't have any problems, while So much so, the Austrians are optimized for high performance, they are still built on akka actors, but it's done, it's built. to reduce the bottlenecks and do it with a very friendly high level API that you can use, like I said, to reduce the bottlenecks and if your data doesn't necessarily fit in a single frame you will probably have to use a source that will scale along with your data needs, something like a consumer copy or partitioning the files yourself or whatever, you'll probably have to partition it yourself, so don't be like me, don't try to create something high performance with a system of actors, you will simply start creating akka streams, the people at Bend have done it for you.
It's a very nice and very useful API that I enjoy working with, so thank you very much, any questions are fine, thanks for the talk, but can you explain it? More details, how did you connect ARCA streams to Kafka? The entire core as a source, so yes I was using it as a source, since in my specific example we were only using the Kaka iterator at the time, I think because writing reactive Kaka is one thing. now, that's managed by akka, it's a pretty good API too, but we were just using the raw iterator that Kafka provides and since you can create a stream using just an iterator as a source or source wrapping an iterator, I have to say that we just used the iterator and that worked for us at the time, there were a few things we were missing in terms of committing offsets, but for our use case at the time it worked, that's how your question ends, okay, thanks everyone else, Thanks for a very good example of a client implementation.
Sure, the question did you consider that Kafka streams basically implement the same topology? So I recently compared a few different tools, unfortunately I didn't have a chance to analyze Kafka streams, so if I were talking, as I understand it, it would be mostly anecdotal. I'll stay away from it from now on and I can't say I tested it for that specific use case at the time when we already had an actor system and it seemed sensible to convert all those pieces. Also, I don't think Kafka protects. It even existed at the time, so no, we didn't really consider it well, thanks Luce, great, thanks for coming everyone.

If you have any copyright issue, please Contact