YTread Logo
YTread Logo

Natalie Dullerud | CaPC - Confidential and Private Collaborative Learning

Mar 10, 2024
Hello everyone, welcome to the intelligent science plus cooperation group. This is a special kind of series where we really focus a lot on cryptography and the intersection of cryptography and machine

learning

and the intersection of cryptography and privacy-creating technologies in particular and type. of the promises that they have for different cooperative structures between humans, um and, ultimately, maybe also for AIS and to maybe just lay the groundwork a little bit and this is our first meeting this year, so I'm very excited about the this year, I think. This group is now in its third year, so thanks also to those of you who have been here for a really long journey.
natalie dullerud capc   confidential and private collaborative learning
I want to make its way as an event that we have in person, so many of you were already there last time. year, but we have our Crypto Security AI Workshop, which really looks a little bit more into the decentralized path between a human being and a crypto-enabled AI Corporation, so if you're interested in that, that's going to happen in the Area of ​​the Bahia in July. and we have a lot of really interesting seminars coming up and they are all in our Computing seminar group. I'll share them here in the chat, but we'll also have Divya.
natalie dullerud capc   confidential and private collaborative learning

More Interesting Facts About,

natalie dullerud capc confidential and private collaborative learning...

I have Stuart Russell next and today we have the wonderful Natalie Dullerwood from Stanford and at least physically relatively close to me right now and we are very happy to have you here and very excited about the work that you are doing. and from now on, the status of this page will also feature non-

confidential

private

collaborative

learning

, so that is the main topic for this group and I will share more information about you here in the chat. I'll be in the chat and if you have any questions if you've already asked them in the chat or raise your hand afterwards and tell them when we get there but thank you so much for coming it's a real pleasure to have you yeah thank you so much for having me and me .
natalie dullerud capc   confidential and private collaborative learning
I'm going to go ahead and share my slides and then we can start the presentation, so yeah, thank you very much obviously for introducing me to Allison and thank you to everyone here for coming to the presentation, so yeah, as Allison said my my name is Natalie Delarude. and I'm currently a PhD student at Stanford and today I'm going to talk about a project that I did during my master's degree at the University of Toronto in Dr. Nicholas Paperno's lab and this was published on a machine. learning conference last year,

confidential

and

private

collaborative

learning, which was a group effort with my amazing colleagues listed here, so I'll give kind of a brief overview of federated distributed learning, kind of privacy issues that arise in this environment and then our proposed solution and the type of future work and the security of distributed learning.
natalie dullerud capc   confidential and private collaborative learning
And I know many here are probably familiar with some of these types of topics as they stand, so I'll try to focus a little more on the technical details. and a specific solution that I've developed later in the presentation, so yeah, let's go ahead and start with some kind of conceptualizations of distributed and federated learning. Otherwise, I said that many in the audience are probably familiar with at least some kind of context. about distributed learning um and throughout the talk I'll use sort of federated collaborative distributed and federal learning interchangeably um although you know in terms of the actual use of these terms there are some minor differences um well so in distributed learning more or less The high level idea is that we have some sort of set of parts or devices, each of them has their own data and they interact to train a model with their collective data and in some ways there are different methods for how the interaction between them can interact.
This parts. will occur, I will quickly walk through the centralized framework and describe some alterations, such as decentralized learning and heterogeneous learning, so in the centralized distributed learning environment there is a single model that the parties learn and the model is usually carried out in a kind of trusted central server and each part here or device involved in the learning process usually retains its own set of data which is not shared between devices or with the central server, so assuming we are using stochastic gradient descent or some similar algorithm. To update the model, a copy of the global model is sent to each party's device, which then calculates the gradients with respect to the parameters of the global model on the locally stored data and then these gradients are returned to the central server which aggregates these gradients between devices, usually through averaging and updating the model through gradient descent and, as with traditional machine learning methods, this is generally an iterative process that continues for a certain number of epochs, so it is kind of a centralized learning and decentralized learning setup rather than kind of a central server that mitigates and cross-device training, typically the parts or the devices themselves can be coordinated to train a global model, um, and now there's been a growing increase in interest in a concept called heterogeneous distributed learning and one can imagine that many application domains involve, you know. a large set of heterogeneous devices or heterogeneous parties that might want to train their own local models that have unique model architectures but still want to benefit from accessing data from other devices without having to copy that data to their own device.
So this is a nascent field and somewhat difficult to solve in this federated learning setup and we'll come back to that later. Now that I've introduced some background on federated learning, I'll talk. a little bit about why federated learning is increasingly seen as kind of the new horizon in AI and obviously in traditional machine learning, costs and resource efficiency are kind of a big issue, so for To train a very high-performance deeper neural network model, you need to have enough data, enough storage capacity, you know, unlimited access to the GPU for up to several days, but with federated learning, we outsource the gradient descent steps. stochastic SGD to potentially millions of parts, each of which has its own data sieve.
From that, at the device level, we save a lot of time on calculations and this setup allows you to learn about continuous learning on the end user or third party devices, while ensuring that that type of end user data does not leave the device etc. point, while we constantly discuss living in the era of Big Data, data is commonly distributed across devices, so for example on smartphones, you know that every smartphone user has tons of text prediction information of all applications that are aware of user input. text and that is stored on your device, but if we think about it at an institutional level, banks and hospitals, other types of institutions often have a big sense of data being stored locally and protected, and federated learning allows the server to save storage.
By requiring only the aggregation of model updates, but not needing to store the aggregated data from all participating devices, I am one of the last important parts of why federated learning has become as big as Natural Language Processing models. like Bert are increasing dramatically in size, are predominantly used, and are expanding in their use cases, but you know, NLP models usually require huge sieves of text. Data that can be difficult to accumulate. and store in a sort of central location and often require a lot of GPUs to train, for that reason the distributed computational and storage load, presented by a sort of frederated learning, is something ideal for NLP, okay, and now we can move on to understand where privacy issues generally arise in standard federated learning and therefore we naturally want to motivate why securing federated learning is an important issue because intuitively, as the data is stored on each device separately. no data is passed between devices and in theory, um, in theory and in a kind of colloquial meaning, this means that there are no privacy violations, so someone would intuitively say that it seems like there would be some kind of data privacy problem and that this is a common advantage of students.
Federated learning, but the truth is that you know, even if you don't have explicit data set information exchanged, like directly between the server and the devices, or between the devices themselves, you know that privacy concerns still exist, so that even though the devices do not explicitly share data with any other party, the training calculations are calculated on the data and for that reason the information could be leaked through this mechanism and this is a quite important point, so the Privacy as a concept is not really a study on whether an attacker can get all the information about private data, but it is more about whether an attacker can get additional information and whenever any kind of information related to private data is shared, we should Researchers in distributed environments have found that there are a multitude of attack surfaces that need to be addressed in federated learning.
And I'll look at the broad categories of attacks that have been identified and then we can talk about some of the proposed tools to use as defenses, in the last few slides I've introduced predominantly the privacy concern of data leakage and with data leakage of the one we're talking about, you know, potentially, membership inference, plain text access to some kind of private data. and in this case, plain text means unencrypted data, so in this kind of simplified example, we have to consider that we know the two colored parties here and through some mechanism, the red party gets some kind of unauthorized access to data from the greed party where the type of access they get is not encrypted, so alternatively in inference attacks we are concerned that based on the gradients returned by the green party to the central server, they are sent To the red party in a decentralized setup, the red party could infer data from Green Party members and there has only been limited exploration of this attack and in a restricted context, but anyway the idea with this attack is that because Because the data points in the Green Party data have been used to train the model, the model will perform very well on the green data and for that reason the gradients will probably be very small on the green data. , um and then the red part will be based on sort of gradients with respect to different data points. could potentially discover which data points were in the green party's training data and thus ultimately the data leak results in the red party discovering the private data points stored as part of the data from the green side, so moving to model leakage is kind of a concern in the heterogeneous environment I mentioned earlier where each side trained their own model architecture separately but leverages federated learning to increase the accuracy of their model and, In many cases, model architectures could be considered intellectual property, so if the red party can discover the model architecture designed by the green party, the red party could then publicly claim ownership of a model they did not design and a kind of a different idea of ​​a privacy issue, as well as a kind of model and data leak. leak, which is a little more intuitive for us, you know that someone is stealing something else, a different idea, some kind of corrupt calculation, and this doesn't really involve any information leak, but let's say a device is adversary and wants the If the training process results in a poorly performing model, then the device could simply return updates to the model that, as you know, are completely random, corrupt, or potentially adverse in the sense that they explicitly hinder the model's performance, so in Instead of computing a valid gradient descent update, on the data they just return some kind of random information and not random information, then it is used to update the model and you probably know that this will degrade the model, so intuitively it will be a minor problem if we know 100 or millions of devices, but if many of the devices are compromised or damaged, we can end up with a very poor model and,Finally, often in centralized configuration or when there is a central type of execution or communication channel, we assume that the server is trusted. but there are side channel attacks that are possible on the core server and attackers could obtain information from the core server which is not authorized.
Okay, what tools can we use to solve these problems and ensure a distributed distribution? learning um So based on the kind of attack surfaces that I discussed earlier, we can use um, for example, differential privacy um and differential privacy generally used to protect against inference attacks that I mentioned earlier um and uh, I'm kind of not going to get into that. as Very extensive details on the theoretical formulation, but essentially differential privacy and deep learning generally boil down to adding noise to the gradient updates passed between devices and the central server to try to obfuscate any type of recoverable information about the devices. individuals in the private training data set. maintained by each party, um, so an adversary party or device cannot use gradients to determine whether a data point was contained in another party's private data, um, you know, but obviously adding noise to Gradients will cause some degradation and utility of the model, so we can also turn to cryptography for our remaining tools, so computing on encrypted data and using trusted execution environments, also called tees or the core server, could restrict plaintext access to the data or gradients. um however, you know that, nevertheless, the tees are still vulnerable to side channel attacks, um, so you know that there are still some drawbacks with the two methods that I have mentioned so far and we finally have a verifiable calculation where an auditor engages in some interactive process with the device to ensure that the training calculation was calculated correctly and things can be used to verify that you know that the return training calculations were performed correctly so that you know that adversary devices cannot return corrupted updates to the model and this The approach towards corrupted devices is still in its infancy in machine learning, but if shutter speed learning is implemented on a large scale, this is an important piece of ensuring federated learning and therefore Therefore, there are currently many cutting-edge solutions. artistic approaches that involve these tools to defend yourself, you already know various attacks that we have discussed and in many cases these are predominantly research-based tools that have not yet been implemented and there are still several drawbacks that could affect reality. implementation um, so in a kind of standard federated learning, we also have some disadvantages that we sort of need to overcome to improve the applicability of federated learning, um, so in general, the standard configuration is kind of inflexible, as I mentioned earlier The heterogeneous type of federated learning has a kind of large-scale applicability, but it is difficult to achieve in a cost-effective way in standard federated learning and, combined with restricted communication capabilities and centralized and decentralized frameworks, this leads to a kind of inflexibility, um.
So the second part is that standard federated learning typically requires sort of millions of known devices or parts to achieve sort of an improvement over traditional machine learning. And you know that access to millions of parts is quite realistic in some environments. parties are smart from smartphones or something, but you know, if parties are institutional like hospitals, that's not necessarily a realistic environment, you know, maybe a handful of hospitals will collaborate, and obviously standard federated learning isn't like that either. . provide any level of privacy, um, and so, while the privacy preserving methods that I presented on the last slide addressed this last point, privacy preserving, federated learning tends to have significant privacy utility tradeoffs in the that increasing privacy leads to a potentially worse model. very high utility or computation time, um, and that ultimately negates the purpose of collective model training and something similar can be said of the computational overhead introduced by privacy measures that hinder a sort of supposed learning advantage federated, which is efficiency and, furthermore, We have the problem that there is no elegant solution to address all the privacy concerns that we have, so most approaches to private federated learning only address one of the surfaces of attack we discussed and at best what we have now is simply layering these types of defenses on top of each other, which is often not that elegant of a solution.
Well, then, what should we demand from an improved method that addresses some of these drawbacks? um so mainly we want to address both the setbacks like the standard federated learning that kind of limits the applicability and we also want to multiply private learning um uh we want to have private learning that doesn't introduce any kind of meaningful learning. computational or accuracy tradeoffs for training the model, um, and this leads me to present our recent work on confidential and private collaborative learning or cap C, which satisfies some of these requirements that I laid out first, how? satisfy these, so we propose a kind of hybrid configuration where devices can communicate with each other, so we have some decentralization and a kind of facilitation of heterogeneous model architectures, but in addition we have a central part that we call the type of content service provider or CSP. which handles certain private calculations and we also require fewer parties to see an increase in model accuracy by allowing a type of query between the parties in a way that preserves privacy and that allows the parties to learn from each other's separate models, Besides that.
We predict against attacks on the core or some kind of curious Core by only allowing data to pass through the CSP that is encrypted and the content service provider can't read it in plain text, so ultimately, to address multiple surfaces Of attack. we combine encryption and differential privacy and finally with our cross-party query approach we introduce a collaborative learning paradigm that enables improvements and accuracy, so we will quickly dive in to follow the learning paradigm that works here and this cap C framework is based largely. in uh Pate, which is kind of a pretty well-known differential privacy algorithm with some kind of significant adaptations, so our framework contains two training stages, so remember that we have the kind of decentralized hybrid setup that also has a core part um. the CSP, so we assume that each party has access to at least a small sift of labeled data and potentially a very large sift of unlabeled data and this is a very realistic assumption, since you know that the data itself is something cheap, but the labeled data is It is very expensive, but it is necessary to train a supervised machine learning model, so first what we have is that each device separately trained its model with its own local labeled data and then in the second stage , we have the collaborative part, so in this part the devices can communicate with each other and use their unlabeled data to query all or a part of the other parties in order to get a label for all their data points without tag, so they are essentially taking advantage of the response of the parties in this type of collaboration. to label your unlabeled data and then you can use it to continue training your local model so that this type of exchange between parties is completed privately with encryption and differential privacy so that we can query other models without privacy concerns, so here I am.
I have a kind of short infographic that describes the second stage of capsi, so let's imagine that our query group on the right is a hospital and they have trained a model to, for example, determine a patient's diagnosis based on data from the patient and now a new patient arrives who needs to be diagnosed and this patient is not labeled from the ml models perspective and for that reason we can use capsi to get a label for the patient and update the hospital model and so do this, the query and the part The hospital encrypts its unlabeled patient data and sends your queries to you.
Meet the other parties in the setup, which in our case would probably be partner hospitals that we have named responding parties here and the responding parties then feed this encrypted data. through their model through a method called private inference and the predictions from their encrypted model are aggregated through differential privacy by the central party to generate noisy votes for different diagnosis predictions, so in this case we see that Hernia has the highest votes among the other parties. predictions, then the CSP will return the hernia label to the quarian group and the Korean group will use it to label their patient and continue training their model, so here essentially the type of green part of the hernia and the red part of appendicitis. kind of true votes from the parties that responded and then this kind of dotted line around the top is the noise that would be added to those votes so we can see that hopefully it's known in cases where the majority of the parties that responded agree with each one.
On the other hand, we still get the correct diagnosis. You know it doesn't affect the final result too much, but we still maintain privacy and in some cases you know the result could change. So here's a more technical sort of diagram outlining the steps. of the C boundary to achieve encryption and differential privacy, so the content service provider here is the central party that adds noise to encrypt the votes of the responding parties, so we can go over each part of the approach here, so that the first type of any party can start the protocol as the querying party and in Step 1A the Korean party will send its encrypted data to all the responding parties and then the responding parties will use private inference to obtain the logits of the model, which are the predictions of the non-normalized model for each class in the encrypted data and in our current implementation we use, you know, a combination of homomorphic encryption and multiparty computation to achieve private inference through a neural network, but you know, in theory, this framework should work for any private inference method, um, and then in step 1B B to avoid differential privacy leaks, each responding party will secretly share its Logics with the Aquarius Party by subtracting a random Vector R from them before the sending, so that this acts as a form of encryption that prevents the querying party from gaining plain text access to the logic before the Pate component of this type of setup is sort of done, so in the Step 1C the querying party and each responding party will participate in a two-party calculation protocol that computes the unique hot encoding of the logits, so this is a binary vector with a single one in the class index predicted and the quarian part and the responding part each get a part of the active coding, so the sum of the two parts reveals the true active coding, but each of them gets a part of it, so this present type prevents any of the parties from seeing this Vector, this is simply a type of binary encoding of the logits into plain text and then in step two, each responding party sends its part to the content service provider, who will add the parts and add noise to achieve differential privacy. pate guarantees and then at the same time the consulting party has a share from each of the responding parties and they will add their shares and finally in step three the content service provider and the consulting party participate in a secure two-part calculation protocol. to sum each of their shared sums, which then removes the secret exchange to reveal the plaintext but noise value and thus the result is the final differentially private label.
Well, based on this, what exactly does kepsi protect again so that cap cum prevents data leakage? through inference attacks by the CSP that adds noise to the subshares of the responding party, it also prevents data leakage in the sense of plaintext leakage, since all calculations are performed on encrypted data and, similarly, because all model calculations are calculated on encrypted data. the model weights and thearchitecture of all responding parties are protected and finally since the central party only sees some actions of the responding parties encrypted, we don't have any plain text privacy leak to be the central party, okay as I said before .
One kind of significant advantage of the C limit is the performance improvement seen by parties querying your model by labeling their unlabeled data by private voting by responding parties and we see this performance increase on model architectures, both homogeneous and heterogeneous, and in all data sets. and we also have the flexibility of having very few parts compared to traditional federated learning, so our results are obtained in collaborative learning between less than 200 parts, so here we see the precision of the local models of the part that performs the consultation before. to participate in cap C, so between the first and second phase of training in blue and after participating in cap C or after the second phase of training in yellow, are shown for each class label in the sets of visualization data and for both homogeneous and heterogeneous model architectures between the parts, um, and here we can see overall, we have an increase, you know, up to five percent in accuracy, um, but this kind of increase and performance becomes twice as high when we experiment on skewed data domains which Do you know is much more realistic?
So the skewed or class and balance type of data setup arises when each party has data drawn from a different data distribution. For example, we could imagine in a hospital setting that you know the geographic location or Urban versus Rural was heavily influenced by which diagnoses are seen more in certain hospitals, um, and we're running experiments here in a setting where respondents have distributions of variable or different data than the parties performing the consultation and we found that you know when to combine consultation with active learning. In the techniques, we see performance increases of up to 10 percent, which is a very significant boost, as demonstrated, the kind of main advantages of the C limit are the flexible collaborative paradigm that allows both decentralized learning to occur as centralized, in addition to easy heterogeneous learning and significantly fewer parties are required to see an improvement in accuracy over standard distributed learning, so the C boundary is also completely private, according to both cryptographic and confidential and differentially private definitions of privacy with very low dependence on the reliability of the core part, so, as shown in the last slide, we have significant improvements in usefulness, especially for non-identically distributed data.
You know, however, as always, there are currently areas for improvement. For the private inference part of our protocol, we are using out-of-the-box private inference methods which introduce a fairly significant time overhead and, furthermore, we still require the order of dozens of parts to improve the accuracy of the model. but we want to be able to do better, so let's say, for example, that a handful of parties can choose to collaborate, you know, maybe less than 10 and they can still see an increase in performance from this collaboration, and so continue to build on it. Of these areas of improvement, what the future looks like in terms of ensuring distributed learning and write privacy, preserving distributed learning methods for deployment, etc.
In terms of realistic deployment scenarios, we largely consider domains in which we require both types. of strict security constraints and excellent usability and, as we saw with cap cum, works very well in non-identically distributed data scenarios. Risk domains like healthcare or finance, but we would also like to consider the ability between companies that typically focus on personalized ads or, you know, recommendation systems, etc., to help more in deployment scenarios to ensure distributed learning. There are more domain specific problems that we need to address, so one of them is vertically partitioned data, so now we are looking at horizontally partitioned data, and there has been some kind of work done on vertical partitioning of data and learning federated. setting up an essentially horizontal partition means that you have full features, a kind of observability, or you have a full range of features for every data point that you have in a kind of every part that you have, but the data points between the parts are different.
Know the vertical partitioning scenario you could have, for example, many of the parties have the same data points, so they have data on the same individuals, but their characteristics are shared unequally between the parties, so you can think, for example, of the hospital environment. The patient has gone to several different hospitals, each of those hospitals maybe has different that that would be could be the vertically split type of setup, also, we want to address the different distributions of objectives, so here we are talking about one type of institution could have different classes or different types of objectives that they are interested in. in a sort of search in terms of tags and we want to be able to at least support that in some way by supporting some kind of union or intersection of tags between the parts, so that we can still take advantage of this, and then I also want something with a kind of um to focus on more domain specific attacks um and then this um, you know, it can be a variety of things, but basically I'm looking at, you know, for example, in the financial sector, what kind of very specific attacks against what we might want protect ourselves, so the type of example for one of them is like the current methods, the first type of um cap C and this relates to other threats that we are concerned about. um it's um we have the central party and like I said before um you know we protect against um you know if the central party is um you know curious, they could look at the data but they couldn't learn anything about the data, but we trust that the central party is honest in the sense that it will properly add the shares given to it and also properly add the noise, which doesn't necessarily protect against some sort of delegated calculation corruption, so if the central part is dishonest, it might just return some random noise, um, instead of the sum shared with the differential privacy noise, which would lead to the quarian part getting a completely false label, um, so this is something I'd really like to address um, not really there's no method in the movement that I know of that addresses the corruption of computational data leakage and model leakage in federated learning all at once, um, and just to point out a sort of tax attack.
The surfaces are very different, but they might be necessary to fully secure AI in federated learning scenarios, and hopefully we could find something that's better than simply layering these types of defenses on top of each other, as that would likely introduce a a lot of computational overhead, and basically, we've probably already seen a lot of these kinds of delegated computing corruption attacks today, but we don't really know how to detect them or how to verify them and at least at least by the time you know. , because my friend is mainly implemented in the scenario where we have, potentially, millions of devices.
You know you would have to control them as if a corrupt party would have to control many of those devices to have a significant effect. in, for example, the model, um, and you know, we don't like, there's no kind of urgent need, um, with the FL models currently implemented, but I think that in the future and for FL to work, we really need to look at . in sort of how to verify sort of calculations and then yeah, some other things we might want to see are sort of improving the computational time and the sort of time incurred by sort of private inference methods, um and so on, like I said. before in chap.
C We look at homomorphic encryption combined with multi-party computation for the purpose of private inference, but this introduces a really significant overhead and I also like the tool we use that is already implemented. Know? which is not very robust at the moment. and then potentially expanding these by making sort of private inference methods where Implementable would be another thing we'd like to see, okay, so I think I'm kind of done, but thank you very much for listening and I'm happy to start the discussion and talk about questions and awesome stuff, thank you so much, this was, this was crazy, really, really happy to see so much progress there, I'm going to jump right into the questions, um, I think Amy has a hoodie, so there maybe I got you started and then maybe I'll answer some more general questions at the end, but this was really wonderful, thank you, thank you so much, Natalie, this is, yeah, in terms of platform. seeing, I posed the question there, which is probably more general, but in terms of platform implementation, how do you see platforms being created where there are continuous users in industry or in different organizations, for example, medical research and I want to work? in similar drug discovery and use each other's models, there will be concerns, of course, about derivative work and users have some data for discovery that may not benefit the owners of First Data, which we are going to leave out. side because it's a different conversation, but how easy do you think it could be to implement such models where collaboration is very fluid, where parties can come and go and that relates to permanence or long-term need for access? to the data and, in your experience, yes, yes, then?
So yes, I think the type of collaboration is quite flexible in capacity, so the parties that respond all they need at the beginning is simply that they have trained a model on their data and the rest. kind of setting up type of encryption and differential privacy, could you know that hopefully and in theory it could be done without much overhead, so if you wanted to join and then leave, that wouldn't be too difficult, really, the main part ? that that could potentially affect a differential privacy accounting, so how many parties are involved in the collaboration is important to how much noise you would end up adding and so you know in sort of flexible environments it would still be fine if you did . from someone jumped, but if you think about if let's say we first had tons of parts involved, like I said, we would need on the order of dozens of parts, um, for the capsi to see an improvement, let's say the number of parts drops below 10 and then in a kind of consultations in which the responding parties are, let's say, six responding parties. um you know it's still fine capsi will still work it's just that you might see some sort of minor improvement um but it should still work where you know anyone can join at any time as long as they have some sort of model train with their own data, they can still respond to the type of party making the query or they can choose not to respond, so I hope I answered the question. question, I'm not sure, to some extent, yes, my thinking is about real collaboration at scale, basically platforms we build with cap C, so there is seamless accounting of usage and cross-training, yes, yes, I think yes, as long as Like you, I assume that the model can be easily loaded, for example, as a platform and I'm not that familiar with the type of current implemented software, but, as long as your type of model can be loaded into the software and then you can decide, for example, if you receive a query, you can decide whether or not you want your model to run that data and then respond in a differentially private way and in an encrypted way, but like I said, privacy accounting would be the main, so the differential privacy reporting depends on how many parties are responding, so I think that would be the main kind of bottleneck in terms of how the central party would maybe always need to know how many parties are participating in order to adjust appropriately the noise they are adding to their actions, that makes sense, yes, thank you, that is helpful, yes, okay, Micah. and then we have Morgan, great.
You mentioned several times that for a corrupt computing problem, if you have like, say, a million devices connected, you have to control a large portion of it. It seems to me that generating legitimate data is very expensive, so it seems that a single party with a reasonable amount of processing power could easily generatecould overwhelm a network with corrupted data because it's very cheap to generate that could be true so I guess the thing is it depends on the configuration so if each step requires all devices to respond then a corrupted part can only respond at each step, and they only respond with, say, a gradient update. that's then averaged across millions of devices, so let's say they generate some random noise or something that goes back to the kind of core group and then that core group averages that random noise with millions of other legitimate gradient updates, um, I guess in In that scenario it seems like it might not be a big deal, but like you said, maybe if, for example, the type of SGD iterations are like each device responds when it can, then what would happen is pretty much what you said. is that you know a device could generate, say, millions of data points that are completely fake and send them back to the core and they keep updating the model based on that, so in that scenario I could see that happening, I think one The thing is that many times to do fairly significant damage to a model in both scenarios, you actually have to adversely generate data, which is potentially just as expensive as generating real data.
I hope that answers the question I couldn't see in the scenario where each device just responds when it can, like if a device could respond millions of times, for example, like once, with just tons of corrupted data points, then I could see how what you're describing could happen, yeah, gotcha. So the idea is that you would make some kind of simple resistor to ensure that each device is somehow connected to a unique human being and that one person can't just pretend to be 10 million devices, so I guess in the case. Yeah, I guess the verified calculation setup is essentially you have to provide some kind of um, you know, if you're familiar with things like proof of work, you have to provide some kind of proof of the fact that it actually computed a gradient, so , you can't, you didn't just generate it, but you fed data through a model and then you calculated a gradient and like I said, at some point in the kind of um, talk like that, that verified calculation type field is very, very patient, like it's not very developed yet, um, so there's been very, very limited work, like um, in this area on how to verify that someone has done gradient descent, for example, um. in fact, Nicholas's lab has a couple of recent papers called proof of learning that simply prove that you computed a similar set of iterations of gradient steps, that you actually did the computation, so there's been a bit of work on it, but like I said it's a little limited right now I think so, answer this question, thank you very much, next one with Morgan and Morgan is a faucet partner this year, so I'm very happy to have you here.
I met Morgan at a DC event a while ago. and um, yeah, I had heard a lot about your work before and when I was in the Bay Area, so I'm really happy to have you join us and focus on cybersecurity and really AI policy and maybe Also go deeper. Crypto related work, thanks so much for joining, thanks Allison, yeah, thanks Natalie for this talk, this is wonderful, so I'm really curious because you talked a little bit about the privacy budget aspect, and I mean, a big challenge. DP's user design is often exactly correct, accounting, yes, so I like it. the design of the elegant and appropriate noise editing taking into account that there is no information from the profusers etc., and this has been a problem where they are like an area where there have been public failures like the Australian defenses, things like that, Yes, I'm a little curious. because I'm looking at, hey, I was looking at your diagram and I was wondering if you could explain a little bit what the possible failure modes could be if you don't do privacy accounting. correctly, for example, where in the process would it fail and then say if you were an external auditor coming in and trying to understand how much privacy was being added or being added, for example, where could you investigate this process and see?
Mind you, so I think I'll answer your second question first because I think it's a little bit easier to answer, so I think the central party is probably the main part, like investigating what the central party is doing and again if you look What the central party is doing, you won't see any of the data in plain text, so it still won't reveal any information, but you will get how much noise was added, which should basically give you an idea of ​​what the privacy is. real. The warranty is and then you should also be able to calculate how many parts were involved, um, and then the calculation of how much noise to add is related to how many parts there are, um, and then there are a couple of other components that relate to this, so one of them are, potentially, you know how many queries a single party can do, which you know adding at the beginning might not be known correctly, so, once you hit your limit, so to speak, then you won't necessarily be able to query tons more. um. um until other parties have trained their models to move forward or we would potentially need to add more noise um and then the other party would be what kind of differential privacy guarantee would you want to provide so um I don't know how familiar people are with a kind of language of differential privacy, but there is like the Epsilon Delta, a kind of conceptualization of differential privacy, which I think, as you said, there have been failures in differential privacy in the sense that I think the main thing is that both Epsilon and Delta , which parameterize differential privacy, are not necessarily intuitive, so in practice people use like um, you know, Epsilon basically likes it more than you know, eight or ten sometimes, but most theorists would say which Epsilon is not. useful if you are above one, which is very difficult to achieve a good type of privacy because the noise you add is inversely related to the Epsilon, so as Epsilon gets smaller and smaller the noise you add grows and potentially you end up with something potentially useless, um, and for this kind of thing, let's say, an auditor who is looking at whether privacy accounting is done well in the context of the C limit, would need to know yes, the number of parties and then how much noise is being added. in this type of central part or the content service provider um and hopefully it wouldn't reveal any information to the auditor about the type of data of the part because all you need to know is that noise component um and then um sort of I think although what you would need is some sort of noise over time so that the noise is drawn from you, for example a Gaussian or Laplacostian distribution, so you would need a lot of noise to know if that noise is actually generated or not from the proper layout and then you could do something like Kamol Garage like the Smirnoff test or something like that to test whether or not those notice that those kind of generated noises are actually coming from the correct layout, so I think that answers your question. second question but in the process of that I forgot what your first question is I'm so sorry you actually answered both oh okay yeah so I mean can you think and I'm thinking a little bit about um like Mulligan's article which exposes your epsilons and some of the problems?
There's been, you know, really being able to evaluate when a company or entity says they're providing privacy, like yeah, what's their Epsilon privacy in terms of the noise that's added, are they really providing? So, I thought it sounds like inside. This model would know that it provides information about itself, such as degradation, and that the Delta Epsilon would provide enough information, so I think it would have the same drawbacks as differential privacy, where you know what noise is being added but you don't necessarily know how you, you you might like to know theoretically what the underlying distribution is, so let's say you know the Epsilon, you know the Delta, you know how many queries all the parties have, you know how many parties are collaborating on each step. able to calculate what the distribution is that the central party should be drawn from as the centralized party, but as I said, it would take a lot of points showing what the noise was to determine whether or not the central party was actually based on that distribution. and I guess in general, that's the kind of problems with DP, in general, even if you know the noise and there are reasonable tests like I said, like the K the Kamal garage Smirnoff test like in statistics and things that if you have a bunch of points you can determine whether or not they've been drawn from some distribution, um, but you know, you know they're probably going to be correct or not, so there's still a chance that they're not correct and that probability that they're not correct decreases as that you have more and more points that you're measuring, but you know, if you only have a few, then it's pretty unlikely that you'll be able to say anything meaningful about whether that noise was drawn from the correct distribution.
I hope I answered your question and you didn't like it. Keep it up. Yes, thanks. I apreciate it. Yeah, well, then I'm leaving. To enter now with a very similar question, I suppose, more general. I think you know it for people, although I know people don't like to speculate, but I think you know if you could hold a carrot in front of people's faces and we thought, hey, this is where we could get to in about five more years. or less, you know why what kind of potential applications could be useful here, I mean, there are potentially so many and how you do it in particular.
I see that field making progress and especially in light of recent AI, you know, progress, yeah, personally, a lot of my research has been focused on healthcare and AI applications for healthcare, so I think I'm more motivated towards something like that. that application scenario, I think there's a lot of potential or yes, collaboration between parties and essentially there have been a couple of recent papers that have shown that you know standard methods of differential privacy in the healthcare context, which is really one of The main motivator for differential privacy is that it actually works pretty poorly in most healthcare scenarios, so you know one kind of idea that some people have and there's kind of a debate about it is that you know if you have a lot more data, hopefully. that would solve the problem to some extent and potentially hospital collaboration could be one thing, um, an aggregation, there are a lot of implementation issues, like aggregating electronic medical records from different hospitals, it's very difficult, um, but I think um, that it's the kind of main scenario that I was hoping for, especially just doing research in kind of machine learning for healthcare, um and also because there are other things like I was saying, um, right now.
Federated learning works very well with things like millions of parties um, but with, say, six parties or something that's pretty realistic, let's say a town or a city wants all of its hospitals to potentially collaborate, you know it's not necessarily that many. like six or something, but they all have a lot of data and we would like them to know that in the end they would get some benefit from collaborating, so those are the current scenarios that I hope to see improve. Especially not even in the federated environment, as I hope we can make differential privacy work well in the healthcare context and because that was one of the motivating applications for differential privacy, um, so yeah.
I think that was something that Cody would say, he agrees that we also have the biotech and longevity group and it's a big problem, yeah, yeah, and any solution would be like yeah, that would be very welcome in that space and lastly, we had work use cases uh who before also had an open mind and also focused on a privacy preserving machine learning for healthcare, maybe I'll leave a link to his work here in the chat. Okay with that question. Hello, thank you very much Natalie for the Vlog, I was very fast. I know we're running.
In the verifiable calculation part of machine learning. The quote you cited was just one, so I was wondering if there's a big difference between you guys. verifiable calculus from the point of view of learning and verifiable calculus from the point of view that you know you have to model. I want to ask you something, so you have to prove that there is a correct calculation. Thank you very much, yes, that's good. um kind of point so I think um and I have to say this is not necessarily my uh experience in kind of verified calculations. I know there's been sort of there's a long history of purified computations in um in um kind of cryptography in general and I think yeah, they're pretty robust.
I think one of the things that, you know, could potentially be difficult in applying sort of verified calculations and I don't haveLots of ideas on how you would do it. This necessarily I think that, let's say, in the scenario where I give data to someone and say, can you perform a gradient update on this data? That's maybe easier to verify, but sort of in the context of saying I know. you have some data and I am asking you to perform a gradient update on that data, even if you perform the gradient update correctly, but let's say you just randomly generate some data to feed the model, it is difficult for me to check whether or not, the data that you put into the model was actually real data or not, which I think is maybe slightly outside of the verified calculation setup, but it's still kind of a problem with these kinds of corrupted calculations.
You know, I can still, let's say, perform. the calculation correctly, but if the data is just random noise that I find, then it still may not be that useful, so I think in the environment where you give data to someone and then they already know Computing something and give it back to them um that It seems simpler to me in terms of kind of a jump from traditional verified calculus to a type of verified calculus in learning um the kind of jump to like, well, the data could be anything um it's hard to do and I'm not sure, yeah, I'm not sure, I don't have an idea what the solution would be right now, we're in the moment and I don't want to let it go. but I don't want to let you go before the final question, which is if people are very enthusiastic about your work now, what couldn't they do to help your work in particular, the work of if you know your institution very concretely? uh, very focused on your individual, but um, that's a very good question, so I guess, like the field in general, I think that definitely, differential privacy would benefit from a lot more minds, like it feels very national even though it's actually been around for like almost 10 years, which is kind of crazy, but I think yeah, I'm putting more into that and I think especially putting more into the kind of privacy equity tradeoffs. um, this is like a big deal right now.
I think a lot of people are working on privacy utility tradeoffs, but privacy fairness, like most settings that require high privacy, also requires you to know some bias mitigation, so I think that's what I I would say, and yeah, also thank you so much for hosting me and if anyone wants to reach out, I have a website and you can also put my kind of contact information in that chat if that would be helpful. Oh, thank you. This was really cool. I also received many private messages. So it was amazing. I'm going to watch it again later so thank you so much for coming and I hope it's not the last time we had you and yes I'll be in touch with the video and it will be on YouTube soon but thank you so much for joining everyone who Have a wonderful rest of the day guys, yeah, great, thank you very much, bye.

If you have any copyright issue, please Contact