/
Navigation
Chronicles
Browse all articles
Explore
Semantic exploration
Research
Entity momentum
Nexus
Correlations & relationships
Story Arc
Topic evolution
Drift Map
Semantic trajectory animation
Posts
Analysis & commentary
Pulse API
Tech news intelligence API
Browse
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
Concept Search
Semantic similarity search
High Impact Stories
Top coverage by position
Sentiment Analysis
Positive/negative coverage
Anomaly Detection
Unusual coverage patterns
Analysis
Rivalry Report
Compare two entities head-to-head
Semantic Pivots
Narrative discontinuities
Crisis Response
Event recovery patterns
Connected
Search: /
Command: ⌘K
Embeddings: large
TEXXR

Chronicles

The story behind the story

days · browse · Enter similar · o open

A look at the challenges some AI developers face in building models to extract trillions of high-quality tokens from PDFs, which are hard to parse, for training

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein

The Verge Josh Dzieza

Discussion

  • @dorialexander Alexander Doria on x
    Ah ah. Another part of my timeline that has not moved in a year. [image]
  • @prietschka Paul Rietschka on bluesky
    Uh, just going to point out that there's been zero progress on AI building “complex software,” and AI cannot, and will not, solve “advanced physics problems.”  —  These claims are lies, plain and simple.  [embedded post]
  • @theverge.com @theverge.com on bluesky
    Despite rapid progress in AI's ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge.  —  Read more from @joshdzieza.bsky.social: www.theverge.com/ai-artificia...  [image]
  • @joshdzieza Josh Dzieza on bluesky
    I wrote about the unconquerable frontier of AI capabilities: reading a pdf
  • @alex.bsky.team Alex on bluesky
    oh man I was working on ML for parsing PDFs over a decade ago and you could never get me to touch that space again www.theverge.com/ai-artificia...
  • @tcarmody Tim Carmody on bluesky
    The bill for storing so much information online in formats designed for printing rather than in HTML was always going to come due, although I confess this wasn't how I imagined it [embedded post]