YTread Logo
YTread Logo

Akka streams for high throughput data processing by Zack Loebel-Begelman

Jun 01, 2021
hey everyone thanks for coming just want to make a note if you attended this talk in chicago it's pretty much the same so unless you want to you know me bro with questions after or whatever you might want to go somewhere else or something anyway. my name is Zach LaBelle Begelman I'm a Senior Software Engineer in the

data

platform team at a company called Credit Karma and today we're here to talk about acha

streams

for

high

performance

data

processing

be sure to rate the app so i can get some feedback and you know the next time i do this i hope i can do better so anyway what are we talking about today first of all who here are familiar with acha actors just show hands, great, most of the crew here have written code using orchestras. quite a few hands, but has anyone tried solving a

high

performance problem using

akka

actors? some people, so for those of you who raised your hand, this is for you, you can hear about my pain in doing that, so what are we talking about? l today we're going to start by talking about the data infrastructure at Credit Karma, we're going to talk about how as we scale our data, we start to have problems.
akka streams for high throughput data processing by zack loebel begelman
A little intro to

akka

streams

some of the basics and how some of those things work how akka streams saved my butt and made my life so much better and then a bit of benchmarking comparing extreme action actors ending up with some results and learnings so without further ado when I saw what the data infrastructure at Credit Karma was like when I first started Credit Karma for those of you who don't know is a financial assistance that helps over 60 million Americans and Canadians to progress financially, so when I started at Credit Karma, this was what data infrastructure looked like, we had just introduced Apache Kaka. and for those of you who don't know that Apache Kafka is a very durable message queue that provides at least once guarantees that it is very scalable, you can transport a large amount of data through it very quickly, it is high performance, so we started introducing JSON into Kaka, which is a semi-structured data format.
akka streams for high throughput data processing by zack loebel begelman

More Interesting Facts About,

akka streams for high throughput data processing by zack loebel begelman...

Our analysts were comfortable using Vertica, an acid compliant relational data store that likes large batch inserts also because it is acid compliant, it is very tightly structured, so we have JSON and Kafka. but we need to have these strict types guaranteed and inserted in Vertica the other with Vertica because it's a columnar store it likes these big batch inserts you don't want to insert one row at a time it will go very slowly you want to do hundreds of thousands or even millions , so we need to group our inserts, the other thing because if there's at least one guarantee in Kafka, sometimes you get duplicates, right?
akka streams for high throughput data processing by zack loebel begelman
If a message doesn't arrive sometimes it gets written back and you get a duplicate so we wanted to remove those duplicates those are the three constraints and the reason why we built our data store imports one of the applications we'll be talking about all it's written in scala using akka more on that shortly the other app we're talking about analytics export service came about when we partnered with a third party called amplitude to provide us with product metrics and customer interactions with the form in which people used our application. on our website now we never send any financial data, it was always anonymous, we take security very seriously at Credit Karma, this was another Scala service that uses our factors and these are the two apps we're going to talk about today. that out-of-range data pushes it to Kafka and then how do we get it from Kafka to Vertica?
akka streams for high throughput data processing by zack loebel begelman
So this is really about our experience building the actor system and ultimately switching to akka streams. So what about the scale of the data here? We've got megabytes per minute over the last four years so these aren't huge numbers it's not necessarily Google or Facebook but it's still not trivial it's enough for where we needed something that we knew could scale so you'll notice that in 2014 we were doing about a hundred and sixty-two megabytes per minute and you can see us doubling over the next few years for what I thought was really cool is touching not doubling strictly?
It's actually exponential growth, we didn't necessarily expect it to double, but we're glad we prepared for it anyway, which is why we chose to use Scala because we wanted something in the JVM where we could have thread types. rotate as needed but also akka now most of you know waka actors its a different way of thinking about concurrent scheduling leveraging the right actor model no locks no synchronization necessarily actors are like a heavyweight feature where their message is its input to that works and actually scales very well because all mutable states are contained within an actor easily scales to other boxes or even other threads sends a message the actor does something so you can send that message tcp if you want you can add it pooled these are some of the reasons we actually opted to use akka actors also akka streams weren't available at the time but it's a push based model please message , do something, change your behavior, create more. actors whatever they are so we chose akka actors because we wanted to scale with our data and we knew it would be easy hopefully the other thing that's good about akka actors you can send it to just one actor or you could send it to a router from Actors that you don't necessarily need to know you're just sending this message can be handled by any number of actors on the same frame or on a different frame, so what's the application we built using all the characters here ?
Our datastore is important, so if your calling, we had three constraints, we want our batch inserts in Vertica, we need to deduplicate Kafka messages, we also need to convert our semi-structured JSON to a stricter structure, it was actually psv as what which was preferred so this is w What we ended up creating we had a single extractor for each topic in Kafka for each table in Vertica and that extractor would be responsible for creating another three actors as readers to get the ad replicator message out of Kafka for to make sure we only push unique items we did this using a sliding hashmap which announces the removed linked hashmap ideas so what happened is the extractor would send a push message to the reader which would read a Kafka message and then send its message to the replicator D and then it would send a message to the renderer, but actually make sure all the fields were n there, the types were correct and that would convert it to TSV which would then send that message to the extractor, which would write to disk, load it into vertigo when relevant, and then send another message to the reader to start the whole process over again, so there is something to note here we only had one inflight message per extractor so by wire on Vertica and top king Kafka we only evaded I got a message or message from Kafka I should say inflight so this system of actors worked really well for a while, it worked so well that we decided, well, we've got this other breadth use case coming up, this analytics export, why don't we just use it? to build that, so what did we have to do for this one? well we have a coordinator amplitude only makes this data available once an hour so it's this hard gzipped file that we extract from amplitude once an hour untie it unzip and then we have a bunch of files that we found they want to do some transforms as well and then we send them over the network to our ingest server so we ended up having a router of these transform workers to take it from the breadth format type and convert it to our internal format then we had a Kafka import worker that would actually make the network call to our jet server which would then receive a batch of messages via HTTP, write them to disk, and then once it had written them all to disk it would be we actually responded that we had a separate agent sit on the block to actually consume those messages from disk and send them to Kafka, but that's the high level of what you're doing, we built it, we were very excited about it, we shipped it, and we immediately ran into issues, so the first issue we ran into was missing data.
This is the exception we saw after submitting, so we were using spray at the time we used the host-level connector. I think it's called where you have more or less a pool of connections requesting a connection and then using that connection to make your actual outgoing network call, so we're looking at a futures timeout requesting spray for the connection, which means the connection pool was effectively saturated and we couldn't get a new connection in 30 seconds so we figured ok this is probably not good not a big deal but we are starting to work on memory and futures will they are running low and data is being deleted thats ok should be an easy fix in 30 seconds what if we increase that as i dont know 30 minutes that should be it should be fine the worst that could happen out of memory is the worst that happened so while we were missing data before now they're calling me because our app is crashing so we didn't really fix the root cause here, we're still waiting on the network, except now instead of that future timing out and throwing an exception, the data disappears, the data is pooling in memory eventually enough data pools memory to where all of the app crashes so we weren't really controlling the work rate well again, easy to fix, just increase the heap space, so we did that, but we knew it probably would. it won't work forever so we decided to take a closer look at the architecture of what exactly we were doing so we get these breadth featured gzip files, we compressed them, read them from disk and while reading them from disk we would.
I actually do this in memory transform which was pretty fast then we have a little bit slower network call so decently fast but then we have a very slow write to disk before responding so literally we were seeing work piling up right in the intake. server is slow, import workers are going a bit faster but still nothing is as fast as in memory transforms so it just piles up work at each step until the app fails and i get paged which which I don't want We knew this was going to be a problem, but we weren't sure what the solution was.
It was actually around that time that we looked at our data warehouse import and noticed that a similar paradigm was actually going on. It has a pretty fast network reading. from Kafka for recent messages, it's really just kept in memory, so it's mostly just linked by the network, it still has a pretty fast transformation in memory, right? We are deduplicating or transforming JSON to TSV and then what a surprise. We're writing to disk Vertica is likely also writing to disk when we load it so it wasn't long before we started having problems there too so here we have a performance graph reading from Kafka and that green line is our intake rate, so 750,000 is our magic number, so if we're not above that line, we're falling behind and will never catch up, so what change caused this weird behavior? we even have a couple 20 minutes where we don't do any work so it's not good while using swap space data isn't uploading to Vertica so people start hounding me.
I'm trying to figure out what exactly it is. change what's different here so we had a new schema or we get past Kafka we went online earlier our largest schema was about 8,000 events per minute this new schema was more on the order of 80,000 events per minute. the whole system I fall. I'm not sure what's going on. There is interrupt behavior. We are using swap space, which is never good for a JVM application. I only know that people are persecuting me. there has to be a better way so achatz dreams had actually just moved to GA at this point and we decided ok let's take a look at this maybe this can fix our problems so what are the automatic transmissions exactly whereas with actors it's all a push based model send a The automatic sequences ofMessages feature this concept of back pressure or a pull-based model that still works under the covers, but it's a much friendlier API that's actually optimized for performance.
It's also a very modular API that can be composed properly. you have your end subscriber or a sync in august rhiness parlance that will request to send a data request or demand and they will walk the entire chain for you to get the start at what point the data is ac finally sent so this concept of back pressure it was formalized in the reactive streams manifest or registration stream spec, sorry, and this is actually akka streams that it implements so you can get everything as a subscriber or publisher of reactive streams. These are still non-blocking asynchronous signals because they're actors under the covers, but again it's kind of a high-level API optimized for high performance.
I want to make sure to mention that akka streams are limited to a single frame. There is no remote control or grouping available just because these are all active messages. it's possible for one of those messages and it's being sent over TCP to be lost and if that message is lost it's possibly a broadcast to Deadline so it needs to be restricted within a single frame that's the big problem that It is important to note, so what are they? some of the basics you'll find in aqua streams everything starts in a source that's where your data comes from I also want to note that everything is a graphics page so you have these source stages that have an output on the stream because your input is something external to the stream something like a file or it could be a stream or an iterator it could even be an actor reference there are a bunch of these built in api's that provide back pressure so you know custom code is needed , so once you have your data in the stream you have an output that is exposed you have a stream this stream is a way to potentially transform or change your data something like a map or a filter we'll cover some of those soon but It's actually a lot of things that you'll find that the Scala Collections API has one input and one output and it's just a way to do some sort of transformation. and the types so you can have these composite streams where you might want to expose three streams that do a variety of things as a single stream which is possible just because the API is sort of what's your input and what's your output and the types , so once you've transformed or changed your data, it goes through the sink as the final resting place of the data, something with one input and zero outputs because the output will be something external to the stream again, something like a file which could be a stream or send back to an actor and once you have connected all your inputs and outputs you have an executable graph which is your blueprint that is the work to be done. run on that blueprint or that runnable graph, that's at that point when all the resources are allocated, things materialize, the materialization here is that you can have certain stages that materialize into something, so something like thinking it's a sequence that would materialize a future for a stream when execute is actually called and that's where all those resources are allocated, also keep in mind whenever there is something you provide to a stream i is going to be a function that is because that function is being will be called every time the run is called so let's take a look at some code if its calling for our datastore import we were doing some deduplication and

processing

so I want to show you something nice how easy it is to destroy some unit tests that allow me to focus on my business logic and just input values ​​so I have team we have our D replicator which is a stream that takes a string a, generates a string and does not materialize any the value that was not used we had to write a custom graph stage for the D doubler just because that functionality did not exist for the renderer, effectively we are making a map I have written it as function stream but normally you could just call the map on a stream you had and you'll notice we're just calling the process schema where we provide the bytes it takes the string and provides an option for a string so what does the composite look like?
We can say D doubled by a processed and n Now we actually still have a stream, but instead it's a string that returns an option of a string and those are, as I said, that composite stream where now we take two and put them together, but Now let's connect some. values ​​to write our test so first let's drop the values ​​we removed using a filter and just get then we'll create some data that we want to insert we use a stream here as our source and then we just send it to a sync to a stream and you it says ok keep the value materialized you can see we have a runnable graph that will materialize a string sequence future that's the other thing I really like about endpoints personally is you get type safe you know in time if your program is correct, otherwise it won't compile because the types on all inputs and outputs are known and connected, you actually know you're type safe, something not available today with our actors , so we also need to create a system of actors. a d then a materializer using that actor system and then we call execute that actor materializer is an implicit value provided to the execute function we get our future wait for it to resolve and then just make sure the results are correct so i I can test my business logic, but I can plug in other values ​​for our far source of sinks just to make sure my flows are correct.
I can focus on what it is I want to test instead of plugging in all those actors making sure the case classes are correct and things from that source, so what are some of the built-in APIs that get some of the built-in sources, like an actor that references a signal, you can use a stream, you can have an iterator, or you can use a pass there is built-in functionality to read and write from the filesystem, all of which you get out of the box using backpressure to the end. Similar custom code is not needed for our sinks.
We have things like the head and head option. should be starting to look familiar to all of you who have used the Scala Collections API, which I'm assuming is everyone here, but things like last for each cut it down a bit and can materialize a signal that you can read or write something on either end if needed to communicate with the outside world similarly with a lot of the processing stages built in you'll see some familiar functions bundled together that slide the shot while things like that and for any of those you expect a lot of functions there's another version that will contain it inside where you can provide a timeout, so if you're creating a batch or something and you don't get enough items within five minutes or whatever number you prefer, it will just produce the values ​​at that point just because it's potentially infinite stream or potentially limited who knows but if you don't get enough in time just produce the same values ​​with some of the s converters, you have an input stream output stream and even a Java eight stre I'm if you want to connect to external things like sources or sinks in your stream you can use those functions so let's take a look at importing our store and what exactly we were doing and what it looks like with flows, as I mentioned before. we had created a stripped down version of back pressure we created our own little mini acha flows inadvertently just because coffee is potentially limitless we had to. you want to wrap each message to your actors in a case class, so the extractor would send the case class to the reader, which would get stuff from Kafka, put it in the case class, send it to deduplicate, wrap that stuff in a case class and a longer processor wrap it in case class and send it to an extractor, so we're creating five case classes for every message we get from Kafka.
That is important later. let's remove some of those actors when we replace it with streams, well it's much simpler and fewer diagrams in my graph you just have the stream now you have an extractor which instead spins a stream and that stream does all the work for you you, so we now have an extractor that creates a sequence for every topic in Kafka for every table in Vertica, it's easier to reason with and we don't have to worry about type safety anymore because we know it's correct otherwise we don't know would compile, so what does the code do? you'll see we have an executable graph of unused that's because we didn't materialize anything we started like i said we have a function to get our kafka iterator we also created a message parser that's where predefined hey here's the schema which you're using if types to look forward to we send it through our D replicator that we saw earlier, at which point we pretty much map out to process it which is converting that JSON to a TSV and then send it to something to an actor that we had previously used our schema file writer this is a feature I think is deprecated at this point but it was really useful for us but the concept is you can have a subscriber actor or publisher where is a feature you can embed and it will provide backpressure if you are converting an existing actor system, but I think it's not that supported anymore and you'll have to use custom graph stages on this p point, which are pretty easy to write but either way we have back pressure every time we have a stream but we don't actually have a stream we still have a runnable graph and you'll notice that in our receive block below because we're actually still inside an actor that extractor, as you recall, it's not until we get the extract start message that we actually call execute and that's the point at which it's ok, the streams are actually created, the resources are allocated and the work starts so that's where the magic happens so we wanted to get back to being more performance it's not pipeline safe and we know everything that's going on it's easier to reason so what about our warehouse of data?
Our analysis export? replace everything with a stream similar to before, while our unzip element created multiple files, we ended up creating one stream per file, so this is our high level diagram, what does the code look like? This one is a bit more complicated than the previous one. so we start for each file we have we're going to read from disk first as i said the functionality is built into august dreams so we can use our type of path method on the file object i/o that gives us bytes without process or I think it's a string of bytes so at that point you have to box it because we just have a bunch of bytes coming in and we want to split it by newlines so that's what boxing does is once it sees a new string ie ok this is a single event then we convert it to a utf-8 string and that's when we use something called map async so there are some methods inside the built in data streams that you can run asynchronously map being one of them and what map async does is similar to how we had an actor router before we can now see our map parallel so instead of A to B we now have A to a future of B but we can run it in parallel pro providing a desired level of parallelism to that function, so what will happen is once all those slots are filled and you have all your maps working, it will come back better, it will stop asking for new work until one of those places opens up, like so which here we did it to actually cast the stuff to a jayob jecht to parse it as json, then do a filter to throw out the schemas we don't care about, cast it to our inner type that we're familiar with, then run map async again where we do our internal transformation this is something we want to cripple it's a little bit more expensive Then we batch it into the batch size we want for our network call and then we do our network call that's the last asynchronous knob and this is where we had problems before because we were waiting or had work piling up and we just had so many connections now that we can explicitly provide this e parallelism input to this function and once all those slots are full, all those connections are in use, we just push back, we just wait once the slot opens, we can pull another queue and then send it, so which is where we make our network call and then send it to our full sink. i think using aqua persistence - for checkpoint cool we're done completing the file move on again it's easier to reason it provides type safety we know if we're inthe right thing we don't need to connect all those routers we don't need to manage those of ours itself it's not made for you in a very friendly api so it was more performant let's take a look before here we have heap utilization in gigs , you can see we have to reallocate multiple times that's why I kept getting paged and had to keep increasing the numbers but we're using about 28 gigs and then if you look at it after that you're using about a third and a lot more predictable consistent throughout I think so something on the order of 12 but either way now we can run other things on that box we have constant utilization and I don't get paid anymore so I'm pretty happy with it what's up with import from our warehouse?
Even though this graph was still haunting me, we had this interrupt behavior and it clearly wasn't working for us, so this is our before, what does our after look like? It takes a bit to get to a previous number of nine million but we consistently say between 8 and 11 million leaving our previous number of 750 in the dust our interest rates as you saw earlier have continued to rise so luckily we switched to the streaming to be able to support it, but this thing had no problem running probably at full capacity effectively, but I was still bothered by this pausing behavior in this 20 minute period where nothing even happened.
It took me a few benchmark tests, but I started to notice some trends and I think I know what the problem was. Remember before I say that we are creating about five case classes for every message we receive from Kafka. I think that was too many because if you take a look, we have the Burbage harvest times shown here: the before is at the top and the after is at the bottom. I want to make sure everyone notices that the before at the top is in seconds and the after at the bottom is in milliseconds, so we're talking several orders of magnitude of difference, you can see there are some periods in there where we literally spend more time garbage collecting than doing the actual work you know several 20-25 second pauses and that's all hesitant behavior I was seeing the other thing is if you're using swap space on your garbage collection that's going to take a lot time that's the other thing you want to make sure you never do just because you have to throw away the disk references are very slow at the bottom we have 40 to 80 milliseconds barely the application barely even notices the performance barely shows hampered and things work much better, so performance has improved.
I can finally get some sleep. at my desk asking, hi Zack, where is my data? So this was a huge win for us, but be honest there are some of you skeptics out there who might be thinking it's okay no your codes are probably super bad if you did it would be a lot, a lot. better then for those i decided, ok, what if i change my hat? What if I say ok let's write some code where I focus on the business logic and do an apples to apples comparison of actors and streams so that's what I did.
I decided ok let's pret final I'm not Zack a platform engineer at Credit Karma but I work for Wikipedia security corporation due to changing political climates we want to make sure Wikipedia is secure for eternity so for those of you who don't know, first of all, thanks for the Cappy day, but Wikipedia actually posts a dump of their entire website or all data on Wikipedia twice a month. It's a fairly large JCB if file which, once unzipped, is about 60 gigabytes, so I decided it's ok. I work for the Wikipedia Security Corporation. I listen to actors. they are cool extremes pretty cool too I don't know which one to use so let's try to compare and contrast.
I want to focus on my core business logic which is encrypting Wikipedia files so I wrote an XML parser ok I wrote one I use the one in the JDK but more or less we need to parse some XML Wikipedia publishes it in this language from wiki markup we also needed to do some parsing do some encryption we did the RSA encryption and then we'll write it back to disk just for ma I'm sure it's all safe so I wrote the business logic separately and then I connected them to very thin actors and strings and decided to take a look and so just for consistency why do we take a look at our speed distinctions here so we have a slow read from disk we have a transform pretty fast in memory, a little bit slower encryption because it's RSA and then we have a pretty slow write to disk, so what did we do for this?
We rented an AWS it was in a quad core box with 8 gigs of RAM and I thought it would be interesting to take a look at three different drive speeds so we need to parse the drive XML. I pre-downloaded the file, so about 60 gigs, why make sure everyone notices? 260 gig file we have 8 gigs of ram so 60 bigger than eight will potentially have problems int int we use the built in JDK parser for XML which is in the Java X XML stream library or package I wrote a wiki markup parser trivial Scala has built-in parsers or you know it's easy to build a built-in Scala RSA parser and encryption into the JDK.
I used a file output stream for the actors and since it makes sense I use the akka stream file API because I get pressure and it's nice that everything is built into the stream why would I do that? No, so the first test we ran was our kind of benchmark test, let's just read the stuff from disk, parse the XML, and write the idea that we shouldn't be able to go any faster than this base test, so it took us 16 minutes in 27 seconds very good, why don't we try it with actors? I work at the Wikipedia Security Corporation.
I don't know any better, can anyone guess what happens? 8 no back questions or anything and we're just these actors we finally had a recollection but it turns out there's this concept called fenced mailbox so you can define the size of your mailbox that's used for those actors and if that mailbox is full it it will do block until there is space in the mailbox so once we use th fenced mailbox we also use some routers just to make it interesting. The actors finished took about 40 minutes and 27 seconds. cross transmissions. We're actually slower and I suspect there's a few things going on here, but because there were no asynchronous bounds on my stream, I think it actually composed it in one stage and still a bit slow, but either way once you we used map async it was actually significantly faster hmm faster 28 minutes compared to 40 minutes this is still a big improvement but it wasn't necessarily a matter of it going faster there were a few other results that I personally thought were interesting when we look at different disk speeds so in this graph lower is better because this is the running time and you can see we have three different disk speeds for slow medium and fast source we have our test of baseline where we don't really do anything, we have streams using map async ross streams and then two different actor implementations using different r external ones s use a balance letterbox so what I thought was really interesting here is with streams as our source drive got faster everything got faster pretty good makes sense well designed for a high performance, reducing bottlenecks, we see the opposite with actors actually is that as our source gets faster, everything actually gets slower.
I suspect this is just because the job is going a bit faster and doesn't necessarily control the job rate in any way. There's more overhead in terms of objects and your guess is as good as mine, but neither is the way the streams are designed to reduce bottlenecks and you really see that here as there's more speed available to drive the system, The whole system goes faster, whereas with the actors, the universe seems to happen, so it wasn't necessarily just about the performance that he had. some disclosures while writing the code to do this so no one tries to read the code on the screen.
I just want to illustrate a point, so first of all, okay? When I implemented the actors, obviously I had to spin all of those actors and point them to each other, but then I realized that I also needed to implement a parse actor and then I needed to implement a flower write actor and an encrypt actor, so i figured oh the system doesn't act li i know what it's done so i needed to find a completion message so that when all the messages came in we would send it and anyway it felt like a bunch of replays.
I wanted to focus on my business. Logically if I wanted to write boilerplate I would just write Java so a lot of boilerplate and because we're just doing maps pretty well and I had to get all these actors to do this in the meantime with streams a lot less lines of code it's also a lot more readable and easy to understand about what's going on, so if we take a look at what's going on here, we start off, we have a source XML file. I had to write a custom source to extract those messages from XML, but I do it similarly for actors. so I thought it was fair to leave it out on both and you'll also notice that this is all a composite font the types on these are fonts until we get to the end when we actually have a runnable graph so we start on the right we provide a file input stream to our source then we parse the messages giving us a proof of the wiki page here you can see on one I have a map a sink and the other has a plain old map and then we remove the stuff that has failed because we don't really care, then we try to do our encryption again, another single map or a map, a receiver and then we just discard the stuff that failed, convert it to a byte string and write it to disk so it's not just a higher performing API, but I may end up focusing on my business logic instead of trying to make sure all the pipes are properly connected to all the actors.
I can focus on the business logic of what I'm trying to do here, which is keep Wikipedia safe, but there were a few other things right, so fi First of all, to get the actors to work, we had to use a mailbox bounded mail, we also had to use routers just to make it interesting. It took several hours when I tried not to use it and didn't even bother to post the results. and we also had to find a completion message while i get all that stuff more or less for free using all streams so some of you might as well say no no no you can cheer the actors up more if you do this to that because that's the conversation I keep having with my coworkers but that's not really the point I'm trying to make here is that I could spend my whole life trying to modify acha that's not what I'm necessarily paid to do so i want to focus on the business logic and leave the rest to potential experts or whoever wants to do those things so that's my two cents let's take a look at some of the performance metrics from those benchmarks so we have GC timings here we have our baseline test first then we have our two streams deployed ons and then our two actor deployments the streams are pretty much ok with about 20 milliseconds of consistent GC pauses and you can see they're all over the place for actors, it makes sense that you're creating a lot more case classes, but again right a much more consistent use for sequences at all times, while so you have this high variance all over the place which probably won't help our performance you see similar trends with loading this is a fork or box so when you can see our actor tests going up to 20 to 25 that box it's not having a good time it's not going to scale much more if we wanted to do more with that implementation it would probably crash because it's already being pushed to the limit meanwhile with streams it's sitting mostly a little under 10 or will add around of 8 consistently more or less at all times, so you can see that one of these will scale much better than the other and you'll see similar trends with a lot of mon size. ton that doesn't I don't even move with the streams he said about 750 Mega in the meantime the actors are going up and that's where you see our allocation or our GC times because heaps everywhere meanwhile the streams manage most of it of that for you and is it just consistent throughout the whole process, so what wereour learnings?
Our takeaways here Raw acha actors are very powerful but there's no back pressure so you can really get into trouble when you're going from fat, especially when you're going from fast to slow, they're also not especially composable, you need to make sure you that the types are correct you need to point them to each other they are good for low latency and if you need to pool and frankly if your dataset is small and it fits in memory then use whatever you want it doesn't necessarily matter it probably doesn't you're going to run into a problem in the meantime the austrians are optimized for high performance they're still built on top of the akka actors but it's done it's designed to reduce bottlenecks and made with a very friendly high level api you can use , like I said, reduces bottlenecks and if your data doesn't necessarily fit in a single box, you'll probably have to use a font that scales along with the s your data needs something like a consumer copy or split the files yourself or whatever you'll probably have to partition that yourself so don't be like me don't try to build something high performance with an actor system , you will just start by building akka streams the folks at bend have done it for you its a very nice and useful API I enjoy working with so thank you very much any questions ok thanks for the chat but can you explain in more detail how did you connect the ARCA transmissions? to Kafka the whole kernel as a source so yeah I was using it as a source in my specific example we were just using the Kaka iterator at the time I think because writing reactive Kaka is something now managed by akka it's a pretty good API too, but we were just using the raw iterator that Kafka provides and because you can create a stream using just an iterator as its source or the source by wrapping an iterator, I have to say that we just used the iterator and that worked for us on that At the time there were a few things we were missing commit terms of offsets but for our use case at the time it worked that ends your question ok thanks everyone else thanks for a really good example of implementing a client.
I've recently actually compared a few different tools, sadly I haven't had a chance to parse Kafka streams, so if I were talking it would be mostly anecdotal, I understand I'll stay away from now and can't say I tried. for that specific use case at the time we already had an actor system and it seemed sensible to just convert all of those pieces, plus I don't think Kafka screens existed at the time, so no, we didn't really consider that. er is fine thank you looks good thanks for coming all of you

If you have any copyright issue, please Contact