How AI is made: having fun with the data behind language models

Share

Beyond the Hype: How Data and Databases Power AI at Web Summit Lisbon 2025

(This article was generated with AI and it’s based on a AI-generated transcription of a real talk on stage. While we strive for accuracy, we encourage readers to verify important information.)

Alexey Milovidov

At Web Summit Lisbon 2025, Mr. Alexey Milovidov, Co-founder and CTO of ClickHouse, shifted focus from AI hype to its data foundation. He emphasized that AI models are built from vast, diverse datasets, highlighting databases’ critical role. His presentation explored how these massive data collections are managed and analyzed, underscoring their importance in the AI ecosystem.

Mr. Milovidov showcased open datasets vital for training language models, including Common Crawl (terabytes of web pages monthly) and GitHub (for coding agents). Other sources were Wikipedia, a factual source, and dynamic platforms like Reddit and 4-chan for varied content. Common Crawl contains 2.5 billion web pages, 500 terabytes uncompressed, or 10 terabytes of compressed text; Wikipedia is 45 gigabytes.

A key demonstration involved loading the Fine Web dataset into ClickHouse. This dataset, with 25 billion records and 46 terabytes in Parquet files, was ingested from Hugging Face’s API. Remarkably, loading this entire collection into ClickHouse, using 50 parallel threads, took only 72 minutes, showcasing ClickHouse’s rapid ingestion capabilities and superior compression, reducing 46 terabytes to 33 terabytes.

Post-ingestion, Mr. Milovidov presented various analytical queries. An analysis of top domains within Fine Web revealed a dominance of sites like Wikipedia, The Guardian, Business Insider, and Fox News. Further analysis contrasted word frequencies across platforms: Blue Sky showed “social,” “love,” and “hope,” while Hacker News featured “data,” “Google,” and “problem.” Wikipedia notably exhibited “you” and “me” as its most unpopular words.

Mr. Milovidov then introduced “style fingerprints” for websites. By hashing each token and mapping it to a 1,024-dimension vector, a unique stylistic signature is generated for every internet domain. Applying these fingerprints, he identified websites similar to ClickHouse.com, accurately matching other data-intensive technology companies like Imply and Redpanda, validating the analytical approach.

The presentation also explored tracking word trends on platforms like Reddit. Examples included the growth of “ClickHouse,” the decline of “Hadoop,” and the volatile yet strong presence of “blockchain.” This illustrated how data analysis can effectively reveal shifts in public interest and technological adoption over time.

Expanding to multimodal data, Mr. Milovidov discussed a dataset of one billion internet photos. He demonstrated loading these into ClickHouse and visualizing their density on a map. Users could interactively find the “best photo” for any location in real-time, showcasing the platform’s versatility for diverse data types.

In conclusion, Mr. Milovidov emphasized that large, messy datasets, typically challenging, become manageable and enjoyable with ClickHouse. By loading these extensive collections into the database, data manipulation and analysis are simplified, ultimately making the internet’s vast information more accessible and efficient.

Related
Talking loud, thinking big about the future of podcasting

Talking loud, thinking big about the future of podcasting

November 12, 2025 - 2 min read
Related
Scaling at the speed of purpose

Scaling at the speed of purpose

November 12, 2025 - 3 min read