> If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge.
Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!
> Audio is never used to train commercial models without explicit consent
I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.
Reminds me of my experience when trying to remove my Airbnb account, they require my ID card scans of both sides. I said fuck it and never touch this company again
Per the WSJ article last week, I suspect Mercor's playing in a grey area of contracts. It wasn't just voice.[0]
A lot of people were basically wiretapping themselves AND their businesses!
While a lot of Mercor "contractors" claim Mercor over-reached with data gathering via Insightful, it's kind of smart because people are too afraid to complain too much knowing they'll not only lose their primary job, but also open themselves up to uncapped liability for willful misconduct.
This would eliminate the credit report, monitoring and fixing industry, which would be a good thing.
Court records are public in the US. If creditors want to know if you’ve been in financial trouble, they should check for bankruptcies and lawsuits, not the extrajudicial version of those that the credit reporting companies run based on hearsay.
I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.
There's also the other implication that the (East) Germans were Soviet just 35 years ago.
But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.
If only our past 20 year old self data could be so ephemeral…
Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.
I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
To some degree IMO big data is still a mindset when it might take a day to process your data in a normal SQL query. Some tech doesn't scale to the data size for all use cases, and you need different solutions.
We have thumb drives that can store petabytes of data?
Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data
Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.
Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.
Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.
10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.
Nowadays you just throw all the data into a black box and believe whatever it says blindly.
I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.
Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
That’s not the issue I’m hitting here primarily but yes.
My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.
Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.
I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.
Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.
> Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).
I wish all social media sites a very haha die in a fire.
Mate you're on a social media site right now that often has AI-generated content displayed at the top of whats "trending". Sure the general user-base does a better job here flagging that sort of stuff, as AI seems to be a shared interest in much of the community, but it still sneaks it's way by
Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.
Happy to discuss the forensic detection side. AudioSeal
watermarks, AASIST anti-spoofing, and how the detection landscape changes
once voice biometrics start leaking at scale.
Interesting - thanks for the rabbit hole today. ;)
Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.
So, they should all just rotate their voices ... right?
I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.
Functionally, biometrics are closer to a username than a password.
Fingerprints, DNA, iris scans, gait patterns, etc. are all something you can't change (much like a permanent account ID) and are constantly being presented to the world (much like an email address). In addition under US law, police can compel presentation of fingerprints, but passwords are protected under the 5th amendment.
the "it's easier" people operate on a fundamentally different way than you or I. they thrive in the world of plausible deniability and social trust. They almost dont care what happens to them as long as it isnt their fault. And they do not consider putting themselves at risk to be the same as being at fault
in a certain light, it's kind of admirable. they live like the world is the way it should be
One of the problems is that "forever passwords" is a term used positively when I worked in banking, as it was a password that the customer could not forget and would not need support using.
So I could easily see a lot of people viewing this as a positive.
That's a really good point. It lays bare some of my biases when it comes to thinking about and communicating with "normal people" about this sort of thing.
Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
What happens now is that a lot of clueless CTO that didn't know about this company now know it's name. So the outcome of this mess is probably more business for Mercor
I mean, just look at what happened to Crowdstrike....
The biometric pairing is what makes this particularly bad. A leaked password is recoverable. A leaked voiceprint combined with ID scans is permanent, you can not rotate your voice.
The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.
I'm pretty sure Google and Apple already have some decent examples of a LOT of people's voices in concert with other data collation. Google Voice IIRC was bought for audio sampling voicemail in the first place. Not sure if Apple has done similar, but would be more surprised if they didn't... Let alone the voice search options for both.
I'm curious: if i create an online sample from my voice, might this make it a lot harder for an AI model to identify me if every trainingdata contains my particular voice sample?
I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.
>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.
good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.
>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.
Someone who has hundreds or thousands of clients presumably couldn't remember every client's voice either, so no meaningful security is lost. They are approximately as secure or insecure as before
>presumably couldn't remember every client's voice either, so no meaningful security is lost
there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.
the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.
With most US banks, you can ask them to put in a note on your account file for a code word, it will show up anytime the account file is pulled up. Now, whether or not a customer service agent will know to do so is another question. Maybe as attack vectors like this are utilized more often it will become part of their SOP. Or just stop using voice verification. In my experience, even if you pass voice verification, it only grants you access to the account and check balance and txs but still requires information like PIN or a code sent in the app or phone number. There are attack vectors for these as well but not guaranteed.
The other use cases (like calling payroll, etc) likely don’t have the same protections and probably would be more effective.
Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly
The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.
Isn’t this going to immediately become daily news?
Half the time I call a company they say “we are recording your voice for security / authentication purposes”.
The companies that do that have all the information on me that they require for me to set up an account, so their data breaches will be just like this one, but 1000x larger.
Can we just fast forward through the part where this works for ID theft, past the firefox age verification plugin that uses these datasets, and even through the part where people in the plugin dataset are digital outcasts (this voice has been used too many times. Want to try another?)
At the end of this dark predictable tunnel, maybe there will be a ban on biometrics for important stuff, a repeal of the age verification laws, and actual privacy legislation with teeth.
You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?
I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.
These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.
Tell us more about that fraud story! Was the person your attorney or accountant? Or just some "smart" person who decided to wisely save time by doing fraud?
It was a fund administrator. I still find it unbelievable that they would so casually do this. And yes, they thought they were very smart... and helpful too...
Because historically that's how it worked, but officials just looked at the document and verified that it was the real thing. Then photocopiers came along and it became normalized to take copies of the documents. Then digital copies happened and that changed things completely when coupled with networking technology. What the officials in charge don't seem to understand is that by making digital copies in networked environments the IDs themselves lost their value completely, after all if the digital copy serves any purpose at all as a stand-in for the original then they have become that original.
I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
Fidelity seemed to sign you up for this when you called them on the phone almost automatically. Ridiculous since it was defeated easily in a hacker movie from the 1990s using a tape recorder.
Mercor is the most scummy company out there, run by a bunch of sleazeball 20 somethings who are getting a lot of press as the youngest billionaires in the making.
they literally handed over their voice, their face, and their government id to train ai models for peanuts - and now lapsus is sitting on 4tb of 'you' that you can never change like a password
Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!
> Audio is never used to train commercial models without explicit consent
I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.
A lot of people were basically wiretapping themselves AND their businesses!
While a lot of Mercor "contractors" claim Mercor over-reached with data gathering via Insightful, it's kind of smart because people are too afraid to complain too much knowing they'll not only lose their primary job, but also open themselves up to uncapped liability for willful misconduct.
[0] https://www.wsj.com/tech/ai/mercor-ai-startup-personal-data-...
Selling the solution to the problem you caused ought to be illegal.
Court records are public in the US. If creditors want to know if you’ve been in financial trouble, they should check for bankruptcies and lawsuits, not the extrajudicial version of those that the credit reporting companies run based on hearsay.
Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.
I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.
But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.
In the US of course the government buys this sort of information legally from corporations.
There is also the rather famous example of how earlier census data was used in the 40’s.
Once the government has your data, they have it. The next generation of representatives may not follow all the same rules and norms
Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.
Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data
We do?
https://www.tomshardware.com/pc-components/ssds/kioxia-unvei...
Or 250 of these ~$400 4tb flash drives and an insane number of dongles to connect them all:
https://www.slashgear.com/1847725/largest-usb-thumb-drive-hi...
Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.
Nowadays you just throw all the data into a black box and believe whatever it says blindly.
I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.
Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.
Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.
I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.
I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).
I wish all social media sites a very haha die in a fire.
Except no company is learning this lesson.
The enterprise threat model includes "our own users", and the modus operandi is to maintain as much information on that threat as possible.
[0] https://www.opensourceshakespeare.org/views/plays/play_view....
[1] https://www.etymonline.com/word/data
Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.
Mercor has definitely released statements with boilerplate "investigations are underway."
I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.
Fingerprints, DNA, iris scans, gait patterns, etc. are all something you can't change (much like a permanent account ID) and are constantly being presented to the world (much like an email address). In addition under US law, police can compel presentation of fingerprints, but passwords are protected under the 5th amendment.
in a certain light, it's kind of admirable. they live like the world is the way it should be
So I could easily see a lot of people viewing this as a positive.
I mean, just look at what happened to Crowdstrike....
The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.
Even have a nice UI on top.
https://voicebox.sh/
https://commonvoice.mozilla.org/en/languages
good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.
>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.
there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.
the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.
The other use cases (like calling payroll, etc) likely don’t have the same protections and probably would be more effective.
Half the time I call a company they say “we are recording your voice for security / authentication purposes”.
The companies that do that have all the information on me that they require for me to set up an account, so their data breaches will be just like this one, but 1000x larger.
Can we just fast forward through the part where this works for ID theft, past the firefox age verification plugin that uses these datasets, and even through the part where people in the plugin dataset are digital outcasts (this voice has been used too many times. Want to try another?)
At the end of this dark predictable tunnel, maybe there will be a ban on biometrics for important stuff, a repeal of the age verification laws, and actual privacy legislation with teeth.
I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.
These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.
:)
Can't wait for them to crash and burn.