Back to blog

How AI Secretly Gathers Data and What They're Not Telling You

Artificial Intelligence powers everything from chatbots to complex data analysis tools. But behind the sleek interfaces and impressive capabilities lies a hidden process – the petabytes of data collected. We sat down with our CEO, Vytautas Savickas, to discuss the AI revolution and how data is being collected to fuel various tools.

Kotryna Ragaišytė

Mar 12, 2025

6 min read

The AI rush for data

AI models, especially large language models (LLMs) like ChatGPT, Perplexity, and others, are data-hungry beasts. They require vast amounts of information to learn, adapt, and generate human-like responses. This data doesn't magically appear – it's actively collected from various sources across the web. So, where does all this data come from?

Public web data

AI tools thrive on publicly available data online. From news articles and Wikipedia entries to social media posts and forums, every corner of the internet is a knowledge source for LLMs, GPTs, and other AI-powered tools.

AI systems harvest such data to train models capable of mimicking human-like intelligence. Think about it – every tweet, blog post, or product review you’ve ever written could be part of a dataset powering the next chatbot or recommendation engine.

Books and research papers

Online sources with books are a treasure trove for AI training. They offer structured, high-quality information covering centuries of human knowledge.

Digitized books, whether from public domain collections like Project Gutenberg or proprietary datasets like Books3, provide AI with linguistic diversity and depth.

Academic research papers further enrich this pool by helping AI-powered tools explore scientific insights and formal writing styles.

User-generated content

Every interaction you have with an AI, whether it’s a chatbot query or feedback on generated text, goes back into the system to improve its performance. Search engines and virtual assistants quietly collect user inputs to refine algorithms and enhance personalization.

"Every digital interaction, even something as small as correcting a typo in an AI-generated response, can become part of a training dataset. While this iterative learning process improves AI models, the key challenge is ensuring transparency. Users should know when and how their data is used, with clear options to opt out. Responsible AI development must prioritize informed participation over passive data collection." – Vytautas Savickas, CEO of Smartproxy

Proprietary datasets

Some AI developers go beyond public data by purchasing or creating proprietary datasets tailored to specific needs. These datasets can include anything from medical records (with proper anonymization) to financial transactions or industry-specific information.

Proprietary data is a competitive edge for companies looking to build specialized applications like healthcare diagnostics or fraud detection systems. However, these datasets often come at a high cost.

These are only the primary sources of data for Artificial Intelligence. From public to the company’s internal data, AI-powered solutions can take learning materials from a range of channels:


How does AI collect data?

Web scraping is often the primary technique used to extract large amounts of data from websites automatically. Using automated solutions like Web Scraping API, users can gather data from various sources, including public databases, community forums, and even news sites, and export the results in their preferred format, which is used to train their AI tools.

This technique is critical for AI developers building complete datasets to train their AI models. Web scraping solutions offer several key benefits:

  • Efficiency – developers can automate public data extraction, saving countless hours of manual collection.
  • Scalability – web scraping tools can handle large-scale public data collection projects with ease.
  • Accuracy – data can be presented in a structured format, reducing errors and ensuring data quality.
  • Real-time data – users can get up-to-date information from various publicly available sources, which is crucial for AI applications requiring the latest data.

"AI is only as good as the data it learns from, and that data needs to be fresh, diverse, and reliable. Every day, we welcome more users who leverage our scraping solutions to collect publicly available data from various websites to build and train their AI tools. And while this use case is still fresh, we believe that web scraping will soon become one of the most essential drivers of AI innovation, enabling businesses to develop smarter, more adaptive models with real-world insights", Savickas added.

What they're not telling you

While AI developers are eager to showcase the capabilities of their AI tools, they're often less forthcoming about the data collection practices that power them. In January, research found that the exposed DeepSeek database contained sensitive information, including a large volume of chats, backend information, log data, and other operational details.

  • Massive data models – training AI models requires enormous amounts of data, often reaching petabyte scales. For example, IBM's AI training used over 14 petabytes of raw data from web crawls and other sources. In contrast, the average internet user generates 15.87 terabytes of data daily.
  • Opaque data practices – users frequently face challenges in understanding what data is collected, its usage, and its retention periods. This lack of transparency can erode trust and raise concerns about privacy and consent.
  • Bias in data – the data used to train AI models can contain biases, which the models may then replicate, leading to unfair or skewed outputs. Addressing these biases is crucial for ensuring AI systems provide accurate results.

Vytautas Savickas emphasizes the importance of ethical data collection, stating, "The responsibility for ethical data collection is not a mere organizational concern – it’s a collective imperative for the entire AI community."

Bottom line

AI's insatiable appetite for data fuels its rapid advancement, but this progress comes with its own challenges. While AI companies emphasize their technologies' capabilities, they often obscure the massive scale of data collection, the potential for bias in training sets, and the lack of transparency in data usage practices.

Please note that use of Smartproxy's Web Scraping API doesn’t grant you any rights with regards to the data you scrape with it, which may be subject to privacy, intellectual property or other rights. We advise you to consult a legal professional before starting your scraping activities.

About the author

Kotryna Ragaišytė

Head of Content & Brand Marketing

Kotryna Ragaišytė is a seasoned marketing and communications professional with over 8 years of industry experience, including 5 years specialized in tech industry communications. Armed with a Master's degree in Communications, she brings deep expertize in brand positioning, creative strategy, and content development to every project she tackles.


Connect with Kotryna via LinkedIn.

All information on Smartproxy Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may belinked therein.

© 2018-2025 smartproxy.com, All Rights Reserved