Vana plans to let users rent out their Reddit data to train AI (2024)

In the generative AI boom, data is the new oil. So why shouldn't you be able to sell your own?

From Big Tech firms to startups, AI makers are licensing e-books, images, videos, audio and more from data brokers, all in the pursuit of training up more capable (and more legally defensible) AI-powered products. Shutterstock has deals with Meta, Google, Amazon and Apple to supply millions of images for model training, while OpenAI has signed agreements with several news organizations to train its models on news archives.

In many cases, the individual creators and owners of that data haven't seen a dime of the cash changing hands. A startup called Vana wants to change that.

Anna Kazlauskas and Art Abal, who met in a class at the MIT Media Lab focused on building tech for emerging markets, co-founded Vana in 2021. Prior to Vana, Kazlauskas studied computer science and economics at MIT, eventually leaving to launch a fintech automation startup, Iambiq, out of Y Combinator. Abal, a corporate lawyer by training and education, was an associate at The Cadmus Group, a Boston-based consulting firm, before heading up impact sourcing at data annotation company Appen.

With Vana, Kazlauskas and Abal set out to build a platform that lets users "pool" their data -- including chats, speech recordings and photos -- into datasets that can then be used for generative AI model training. They also want to create more personalized experiences -- for instance, daily motivational voicemail based on your wellness goals, or an art-generating app that understands your style preferences -- by fine-tuning public models on that data.

"Vana’s infrastructure in effect creates a user-owned data treasury," Kazlauskas told TechCrunch. "It does this by allowing users to aggregate their personal data in a non-custodial way … Vana allows users to own AI models and use their data across AI applications."

Here's how Vana pitches its platform and API to developers:

The Vana API connects a user's cross-platform personal data … to allow you to personalize your application. Your app gains instant access to a user's personalized AI model or underlying data, simplifying onboarding and eliminating compute cost concerns. … We think users should be able to bring their personal data from walled gardens, like Instagram, Facebook and Google, to your application, so you can create amazing personalized experiences from the very first time a user interacts with your consumer AI application.

Creating an account with Vana is fairly simple. After confirming your email, you can attach data to a digital avatar (e.g., selfies, a description of yourself and voice recordings) and explore apps built using Vana's platform and datasets. The app selection ranges from ChatGPT-style chatbots and interactive storybooks to a Hinge profile generator.

Vana plans to let users rent out their Reddit data to train AI (1)

Image Credits: Vana

Now, why, you might ask -- in this age of increased data privacy awareness and ransomware attacks -- would someone ever volunteer their personal info to an anonymous startup, much less a venture-backed one? (Vana has raised $20 million to date from Paradigm, Polychain Capital and other backers.) Can any profit-driven company really be trusted not to abuse or mishandle any monetizable data it gets its hands on?

Vana plans to let users rent out their Reddit data to train AI (2)

Image Credits: Vana

In response to that question, Kazlauskas stressed that the whole point of Vana is for users to "reclaim control over their data," noting that Vana users have the option to self-host their data rather than store it on Vana's servers and control how their data's shared with apps and developers. She also argued that, because Vana makes money by charging users a monthly subscription (starting at $3.99) and levying a "data transaction" fee on devs (e.g., for transferring datasets for AI model training), the company is disincentivized to exploit users and the troves of personal data they bring with them.

"We want to create models owned and governed users who all contribute their data," Kazlauskas said, "and allow users to bring their data and models with them to any application."

Now, while Vana isn't selling users' data to companies for generative AI model training (or so it claims), it wants to allow users to do this themselves if they choose -- starting with their Reddit posts.

This month, Vana launched what it's calling the Reddit Data DAO (Digital Autonomous Organization), a program that pools multiple users' Reddit data (including their karma and post history) and lets them decide together how that combined data is used. After joining with a Reddit account, submitting a request to Reddit for their data and uploading that data to the DAO, users gain the right to vote alongside other members of the DAO on decisions like licensing the combined data to generative AI companies for a shared profit.

We have crunched the numbers and r/datadao is now largest data DAO in history: Phase 1 welcomed 141,000 reddit users with 21,000 full data uploads.

— r/datadao (@rdatadao) April 11, 2024

https://platform.twitter.com/widgets.js

It's an answer of sorts to Reddit's recent moves to commercialize data on its platform.

Reddit previously didn’t gate access to posts and communities for generative AI training purposes. But it reversed course late last year, ahead of its IPO. Since the policy change, Reddit has raked in over $203 million in licensing fees from companies, including Google.

"The broad idea [with the DAO is] to free user data from the major platforms that seek to hoard and monetize it," Kazlauskas said. "This is a first and is part of our push to help people pool their data into user-owned datasets for training AI models."

Unsurprisingly, Reddit -- which isn't working with Vana in any official capacity -- isn't pleased about the DAO.

Reddit banned Vana's subreddit dedicated to discussion about the DAO. And a Reddit spokesperson accused Vana of "exploiting" its data export system, which is designed to comply with data privacy regulations like the GDPR and California Consumer Privacy Act.

"Our data arrangements allow us to put guardrails on such entities, even on public information," the spokesperson told TechCrunch. "Reddit does not share non-public, personal data with commercial enterprises, and when Redditors request an export of their data from us, they receive non-public personal data back from us in accordance with applicable laws. Direct partnerships between Reddit and vetted organizations, with clear terms and accountability, matters, and these partnerships and agreements prevent misuse and abuse of people’s data."

But does Reddit have any real reason to be concerned?

Kazlauskas envisions the DAO growing to the point where it impacts the amount Reddit can charge customers for its data. That's a long ways off, assuming it ever happens; the DAO has just over 141,000 members, a tiny fraction of Reddit's 73-million-strong user base. And some of those members could be bots or duplicate accounts.

Then there's the matter of how to fairly distribute payments that the DAO might receive from data buyers.

Currently, the DAO awards "tokens" -- cryptocurrency -- to users corresponding to their Reddit karma. But karma might not be the best measure of quality contributions to the dataset -- particularly in smaller Reddit communities with fewer opportunities to earn it.

Kazlauskas floats the idea that members of the DAO could choose to share their cross-platform and demographic data, making the DAO potentially more valuable and incentivizing sign-ups. But that would also require users to place even more trust in Vana to treat their sensitive data responsibly.

Personally, I don't see Vana's DAO reaching critical mass. The roadblocks standing in the way are far too many. I do think, however, that it won't be the last grassroots attempt to assert control over the data increasingly being used to train generative AI models.

Startups like Spawning are working on ways to allow creators to impose rules guiding how their data is used for training while vendors like Getty Images, Shutterstock and Adobe continue to experiment with compensation schemes. But no one's cracked the code yet. Can it even be cracked? Given the cutthroat nature of the generative AI industry, it's certainly a tall order. But perhaps someone will find a way -- or policymakers will force one.

Vana plans to let users rent out their Reddit data to train AI (2024)

References

Top Articles
Hasan Minhaj says his stand-up stories are embellished but rooted in 'emotional truth.' Because that's comedy
Hasan Minhaj Addressed His Stand-Up Controversy With a 'Daily Show'-Style Video
AMC Theatre - Rent A Private Theatre (Up to 20 Guests) From $99+ (Select Theaters)
No Hard Feelings Showtimes Near Metropolitan Fiesta 5 Theatre
Cold Air Intake - High-flow, Roto-mold Tube - TOYOTA TACOMA V6-4.0
Www.fresno.courts.ca.gov
Euro (EUR), aktuální kurzy měn
Botw Royal Guard
The Daily News Leader from Staunton, Virginia
Polyhaven Hdri
Gunshots, panic and then fury - BBC correspondent's account of Trump shooting
Daniela Antury Telegram
10 Great Things You Might Know Troy McClure From | Topless Robot
Hartland Liquidation Oconomowoc
Hair Love Salon Bradley Beach
Missing 2023 Showtimes Near Landmark Cinemas Peoria
Mineral Wells Independent School District
Directions To 401 East Chestnut Street Louisville Kentucky
Driving Directions To Bed Bath & Beyond
Jayah And Kimora Phone Number
Glenda Mitchell Law Firm: Law Firm Profile
Accident On 215
Gina Wilson All Things Algebra Unit 2 Homework 8
Purdue 247 Football
Like Some Annoyed Drivers Wsj Crossword
How to Make Ghee - How We Flourish
Obituaries Milwaukee Journal Sentinel
City Of Durham Recycling Schedule
Effingham Daily News Police Report
Obituaries, 2001 | El Paso County, TXGenWeb
Uno Fall 2023 Calendar
Parent Management Training (PMT) Worksheet | HappierTHERAPY
Wells Fargo Bank Florida Locations
Diggy Battlefield Of Gods
Mumu Player Pokemon Go
140000 Kilometers To Miles
Dumb Money, la recensione: Paul Dano e quel film biografico sul caso GameStop
Vitals, jeden Tag besser | Vitals Nahrungsergänzungsmittel
Staar English 1 April 2022 Answer Key
Fapello.clm
The Angel Next Door Spoils Me Rotten Gogoanime
Strange World Showtimes Near Century Stadium 25 And Xd
Streameast Io Soccer
Hdmovie2 Sbs
Rocket League Tracker: A useful tool for every player
Www Pig11 Net
60 Days From August 16
The Hardest Quests in Old School RuneScape (Ranked) – FandomSpot
Pilot Travel Center Portersville Photos
라이키 유출
What Are Routing Numbers And How Do You Find Them? | MoneyTransfers.com
Dinargurus
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6209

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.