claude-mem: Claude Code 18.8k Stars Memory Plugin

claude-mem: Claude Code Got Memory

  • GitHub Stars: 18,800+
  • Language: TypeScript
  • License: MIT

Why This Project is Trending

The biggest complaint from Claude Code users was “It forgets everything after the session ends.” [GitHub] claude-mem directly addresses this issue. It automatically captures and compresses all activities of the coding session and injects them as context into the next session.

In simple terms, it’s a plugin that gives Claude Code long-term memory. With over 18,800 stars and 1,300 forks, it has become the most popular extension tool in the Claude Code ecosystem. [GitHub]

What Can You Do?

  • Persistent Memory: Context doesn’t disappear even after the session ends. When you continue fixing a bug you were working on yesterday, you don’t need to explain it from the beginning.
  • Progressive Disclosure: Retrieves only the necessary information by searching memory by layer. Provides accurate context while minimizing token costs.
  • Natural Language Search: If you ask, “Where was the authentication logic I modified last week?”, it finds it in the project history.
  • Web UI Dashboard: You can check the real-time memory stream at localhost:37777. You can transparently see what is being saved.
  • Privacy Control: You can exclude sensitive information from memory with the <private> tag.

Quick Start

# Add from the plugin marketplace
> /plugin marketplace add thedotmack/claude-mem

# Install
> /plugin install claude-mem

After installation, just restart Claude Code. The context of the previous session will appear automatically. The key is that you don’t have to do anything manually. [GitHub]

Where is it Good to Use?

It is essential for developers working on long-term projects with Claude Code. It especially shines when dealing with complex codebases or implementing features over several days.

Personally, I think it’s more useful for freelancers or developers who work on multiple projects. Because the context is stored separately for each project, the flow is not interrupted even if you switch from project A to project B and then return.

Points to Note

  • Token usage may increase. As much as memory is injected, basic token consumption increases. However, it is optimized with a progressive disclosure method, so it is not as severe as you might think.
  • v9.0.12 is the latest version (released January 28, 2026). It is a stable project that has undergone 174 releases, but it is good to check regularly as the update cycle is fast.

Similar Projects

Cursor’s built-in context management is a tool for a similar purpose. However, if you mainly use Claude Code, claude-mem is close to the only option. Note that it is a community plugin, not an official Anthropic feature.

Frequently Asked Questions (FAQ)

Q: Is it free to use?

A: It’s completely free. It is distributed under the MIT license and is an open source project. You can use all functions just by installing it without a separate subscription or payment. However, the token cost of Claude Code itself is separate.

Q: Where is the memory data stored?

A: It is stored locally. It is not sent to an external server, so you can use it without worrying about code security. You can directly check the stored content in the web UI and delete it if necessary.

Q: Does it conflict with existing Claude Code settings?

A: Because it works in the form of a plugin, it does not affect existing settings. If a problem occurs after installation, you can return to the original state by simply deactivating the plugin. Stability has been verified through 174 releases.


If this article was helpful, please subscribe to AI Digester.

Reference Materials

When AI Plays Poker and Mafia: Game Arena Changes the Benchmark

AI Plays Poker and Mafia: Game Arena Changes the Benchmark

  • Poker and Mafia (Werewolf) Added to Kaggle Game Arena
  • Gemini 3 Pro/Flash Ranked 1st and 2nd on Chess and Mafia Leaderboards
  • Hikaru Nakamura Commentary Live Event in Progress for 3 Days

What Happened?

Google DeepMind added Poker and Werewolf to the Kaggle Game Arena. [Google Blog] “Chess is a perfect information game. The real world isn’t.” DeepMind’s Oran Kelly explained the reason for the expansion. [TechBuzz]

Why is it Important?

Frankly, existing AI benchmarks have clear limitations. Scores are hitting the ceiling, and data contamination is a serious problem. Game Arena takes a different approach.

Game Measured Ability Characteristics
Chess Strategic Reasoning Perfect Information
Poker Risk Assessment Incomplete Information + Probability
Mafia Social Reasoning, Deception Detection Natural Language Team Game

Mafia is also very useful for AI safety research. By playing both the role of deceiving and finding the truth, it tests the AI’s ability to deceive in a controlled environment. [TechBuzz]

Personally, I think it’s a necessary benchmark in the age of agent AI.

What Will Happen in the Future?

Gemini 3 Pro and Flash are ranked 1st and 2nd on the Chess and Mafia leaderboards. [Google Blog] A live event is in progress from February 2nd to 4th. Chess GM Hikaru Nakamura, poker pro Doug Polk, etc. will provide commentary. [TechBuzz]

Future plans include expansion to multiplayer video games and real-world simulations. The open-source harness is available on GitHub. [GitHub]

Frequently Asked Questions (FAQ)

Q: Can models other than Gemini participate?

A: Yes. Kaggle Game Arena is an independent public benchmark platform. It is structured so that various frontier models compete against each other. Anyone can participate because new models can be easily added through the open-source harness.

Q: Do game benchmarks reflect actual AI performance?

A: It is more realistic than existing multiple-choice benchmarks. Poker tests decision-making under uncertainty, and Mafia tests natural language social reasoning. However, games are also limited environments. It does not fully capture real-world complexity.

Q: Can LLMs beat chess engines like Stockfish?

A: Not yet. Stockfish calculates millions of moves per second, but LLMs rely on pattern recognition. Interestingly, the reasoning of LLMs is similar to that of human players. They utilize concepts such as piece activity and pawn structure.


If this article was helpful, please subscribe to AI Digester.

Reference Materials

Google AI Decodes Genomes of 17 Endangered Species: The Backup of Life Has Begun

UPDATE (2026-02-03): Expanded to 17 species, added specific species names and EBP 4,386 species data

Google AI Decodes Genomes of 17 Endangered Species: Backup of Life Begins

  • Google uses AI to decode the genomes of 17 endangered species
  • The key is a set of 3: DeepVariant, DeepConsensus, and DeepPolisher
  • Earth BioGenome Project, 4,386 species secured

What happened?

Google used AI to decode the genomes of 17 endangered species. 4 species were added from the initial 13. [Google Blog] Simply put, it’s backing up genes before extinction.

What specific species are they? They include the cotton-top tamarin living in the forests of northwestern Colombia, the golden mantella frog of Madagascar, and the African penguin off the coast of South Africa-Namibia. [Google Blog]

Google.org is expanding the project by supporting the AI for Science Fund at Rockefeller University. It collaborates with the Vertebrate Genomes Project and the Earth BioGenome Project (EBP). [New Atlas]

Why is it important?

Genomic data is needed to develop conservation strategies. New Zealand’s kakapo (a nocturnal flightless parrot) is recovering from near extinction through genome analysis. [Google Blog]

The key is speed and accuracy. DeepVariant reduced genetic variation detection errors by 22-52%. [DeepVariant] DeepConsensus increased high-quality sequencing throughput by 250%. [GitHub] DeepPolisher further reduced genome assembly errors by 50%. [Blockchain News]

Personally, I think this is the real value of Google AI, more than LLM.

What will happen in the future?

EBP has secured approximately 4,386 genomes, with an initial target of 10,000 species by 2026. [EBP] The ultimate goal is to decode all 1.8 million species. The cost is estimated at approximately $5 billion. [Wikipedia]

Frequently Asked Questions (FAQ)

Q: Can genome decoding prevent extinction?

A: It doesn’t directly prevent it. However, genetic diversity analysis can be used to design breeding programs. Like the kakapo, it becomes a key tool for identifying the risk of inbreeding and maintaining a healthy population.

Q: What is the difference between DeepVariant, DeepConsensus, and DeepPolisher?

A: DeepVariant is a CNN-based tool for finding genetic variations in sequencing data. DeepConsensus is a transformer model that corrects errors in PacBio long-read data. DeepPolisher further catches errors in the genome assembly stage. Using them together increases both accuracy and throughput.

Q: Can the general public contribute?

A: All three tools are open source. Researchers can use it directly from GitHub. The general public can contribute by participating in the EBP citizen science program or by sponsoring conservation organizations.


If this article was helpful, please subscribe to AI Digester.

Reference Materials

Anthropic $3 Billion Lawsuit: Allegations of 20,000 Illegal Downloads

Anthropic $3 Billion Lawsuit: Allegations of 20,000 Illegally Downloaded Songs

  • Concord and UMG sue Anthropic for $3 billion
  • Lawsuit jumps from 500 songs to 20,000
  • AI training is legal, but the method of acquisition is accused of piracy

What Happened?

Concord and UMG sued Anthropic for $3 billion.[TechCrunch] They claim that over 20,000 songs were illegally downloaded. It started with 500 songs, but thousands more were discovered during evidence discovery in the Bartz case.[The Wrap]

Why is this Important?

This lawsuit targets “data acquisition” rather than “AI training.” A judge ruled that AI training with copyrighted material is legal.[WebProNews] The problem is that the data was allegedly acquired through illegal downloads.

Personally, I think this will change the landscape of AI copyright lawsuits. “AI training = infringement” has consistently lost in court. But “illegal acquisition” is different. In the Bartz case, a settlement of $1.5 billion was reached. $3 billion means the music industry has a weapon to pressure AI companies.

What Happens Next?

Anthropic is likely to settle again. Losing billions again after already paying $1.5 billion will shake investor confidence. OpenAI and Google are also likely to be nervous. They haven’t disclosed the source of their training data, and this sets a precedent that they could face lawsuits for “illegal acquisition” allegations.

Frequently Asked Questions (FAQ)

Q: Wasn’t it legal to use copyrighted material for AI training?

A: Training is legal. But the problem is how the data was acquired. This lawsuit claims it was stolen through mass downloads without a license.

Q: Does $3 billion mean Anthropic will go bankrupt?

A: With a corporate value of $35 billion, it won’t go bankrupt immediately. But having already paid $1.5 billion, losing billions more will shake confidence.

Q: Will other AI companies be sued?

A: It’s possible. OpenAI and Google have not disclosed the source of their training data. If the music and publishing industries move collectively, the AI industry could be shaken.


If you found this helpful, please subscribe to AI Digester.

References

What Medical AI Misses: Linguistic Blind Spots in Clinical Decision-Making Extraction

Medical AI Shows 24-58% Accuracy Variance in Narrative Clinical Notes

  • Clinical decision extraction accuracy of transformer models varies depending on language characteristics.
  • Extraction performance drops to less than half in narrative sentences.
  • Recall improves from 48% to 71% when applying boundary-tolerant evaluation.

What happened?

A study presented at the EACL HeaLing Workshop 2026 revealed that the clinical decision extraction performance of medical AI depends on the linguistic characteristics of sentences.[arXiv] Mohamed Elgaar and Hadi Amiri’s research team analyzed discharge summaries using the DICTUM framework. Drug-related decisions showed a recall of 58%, while narrative advice fell to 24%.

Why is it important?

The adoption of AI decision support systems is accelerating in the medical field. This study shows that current systems may systematically miss certain types of clinical information.[arXiv] While drug prescriptions are well extracted, patient advice or precautions are easily missed. This is a problem directly related to patient safety.

Boundary-tolerant matching increased recall to 71%. This suggests that most failures of exact matching were boundary mismatches.[arXiv]

What happens next?

The research team recommended the introduction of boundary-tolerant evaluation and extraction strategies. Clinical NLP systems need to strengthen their ability to process narrative text. Regulatory agencies may also include performance variance by language type in their evaluation criteria.

Frequently Asked Questions (FAQ)

Q: How do transformers extract decisions from clinical notes?

A: They understand context bidirectionally using an attention mechanism. They calculate the relationship between each token to identify the scope of the decision text. They are trained with DICTUM data to classify drug prescriptions, test instructions, patient advice, etc.

Q: Why does extraction performance decrease in narrative sentences?

A: There are many stop words, pronouns, and hedging expressions, resulting in low semantic density. The lack of clear entities makes it difficult for the model to specify decision boundaries. Advice is expressed over several sentences, making it unsuitable for single-span extraction.

Q: What is boundary-tolerant matching and why is it effective?

A: It is a method of recognizing partial overlap even if the extraction range does not exactly match the correct answer. It handles cases where the core content is successfully captured, but only the boundaries are different. The increase in recall from 48% to 71% shows that many errors are boundary setting problems.


If you found this article helpful, please subscribe to AI Digester.

References

TMK Prompting Triples LLM Planning Ability: From 31% to 97%

LLM Planning Performance Soars from 31% to 97%

  • TMK prompting improves reasoning model accuracy by more than 3x
  • Breaks through the limitations of existing Chain-of-Thought with a cognitive science framework
  • Induces a shift from linguistic reasoning to formal code execution paths

What Happened?

A research team at Georgia Tech significantly improved the planning performance of LLMs by applying the Task-Method-Knowledge (TMK) framework, derived from cognitive science, to LLM prompting.[arXiv] In experiments on the Blocksworld domain of the PlanBench benchmark, the existing accuracy of 31.5% increased to 97.3%. Erik Goh, John Kos, and Ashok Goel conducted this research.[arXiv]

Unlike existing hierarchical frameworks that only deal with what to do (Task) and how to do it (Method), TMK explicitly expresses why the action is performed (Knowledge). It captures causal and teleological structures that existing approaches such as HTN or BDI miss.[arXiv]

Why is it Important?

This research comes at a time when skepticism about the reasoning ability of LLMs is growing. Chain-of-Thought (CoT) prompting is widely used, but the debate continues as to whether it is actual reasoning or pattern matching. TMK structurally bypasses this limitation.

Of particular note is the ‘performance reversal’ phenomenon. The reasoning model showed the highest performance in opaque and symbolic tasks where it previously failed at random levels. The research team interprets that TMK activates the formal code execution path, moving away from the model’s basic language mode.

From a practical point of view, it means that planning capabilities can be increased more than threefold with prompt engineering alone, without retraining the model. It can be immediately applied to agent systems or automated workflow design.

What Happens Next?

TMK prompting is a methodology that has been validated first in the field of education. It is an extension of the approach that has been effective in AI tutoring systems to LLM reasoning. Generalization to other domains will be the next research task.

The current experiment is limited to the classic planning problem of Blocksworld. It is necessary to verify whether the TMK effect is maintained in more complex real-world scenarios. However, the figure of 97.3% is impressive enough.

From a prompt design perspective, a meta-prompting technique that automatically generates the TMK structure can also be studied. It is a direction in which the model creates its own task decomposition structure without the user having to write the TMK directly.

Frequently Asked Questions (FAQ)

Q: Why is TMK prompting better than Chain-of-Thought?

A: CoT lists sequential thinking processes, but TMK explicitly structures hierarchical decomposition and causality. In particular, the Knowledge element, which explains why a particular action is performed, activates the formal processing path of the reasoning model, improving symbolic manipulation ability.

Q: What types of tasks are most effective?

A: According to research, the effect is maximized in semantically opaque symbolic manipulation tasks. Performance jumped from 31% to 97% in problems with clear rules but little linguistic meaning, such as block stacking. It is more suitable for abstract planning problems than tasks that can be explained in everyday language.

Q: How do I apply TMK to a real project?

A: You can specify three elements in the prompt. Task is the target state, Method is the subtask decomposition and execution order, and Knowledge is the reason and preconditions for each action. You can try applying it to agent systems or workflow automation that require complex planning.


If you found this article useful, please subscribe to AI Digester.

References

Pi-mono: Claude Code Alternative AI Coding Agent 5.9k Stars

pi-mono: Build an AI Coding Agent Directly in Your Terminal

  • GitHub Stars: 5.9k
  • Language: TypeScript 96.5%
  • License: MIT

Why This Project is Trending

A developer felt that Claude Code had become too complex. Mario Zechner experimented with LLM coding tools for three years before finally deciding to build his own.[Mario Zechner]

pi-mono is an AI agent toolkit born from the philosophy of “don’t build it if you don’t need it.” It starts with a 1000-token system prompt and four core tools (read, write, edit, bash). Compared to Claude Code’s thousands of tokens worth of prompts, it’s extremely lightweight.[GitHub]

What Can You Do?

  • Integrated LLM API: Use over 15 providers, including OpenAI, Anthropic, Google, Azure, Mistral, and Groq, with a single interface.
  • Coding Agent CLI: Write, test, and debug code interactively in the terminal.
  • Session Management: Interrupt and resume tasks, and even branch like with Git.
  • Slack Bot: Delegate Slack messages to the coding agent.
  • vLLM Pod Management: Deploy and manage your own models on GPU pods.
  • TUI/Web UI Library: Create your own AI chat interface.

Quick Start

# Installation
npm install @mariozechner/pi-coding-agent

# Execution
npx pi

# Or build from source
git clone https://github.com/badlogic/pi-mono
cd pi-mono
npm install && npm run build
./pi-test.sh

Where Would It Be Useful?

If you find Claude Code’s monthly fee of ₩200,000 burdensome and you’re a developer who primarily works in the terminal, pi could be an alternative. You only pay for the API costs.

If you want to use a self-hosted LLM but existing tools don’t support it well, pi is the answer. It even has a built-in vLLM pod management feature.

Personally, I think “transparency” is the biggest advantage. Claude Code has sub-agents running internally that you can’t see what they’re doing. With pi, you can directly check all model interactions.

Things to Note

  • Minimalism is the philosophy. MCP (Model Context Protocol) support is intentionally omitted.
  • Full access permissions, which they call “YOLO mode,” are the default. Be careful because permission checks are looser than Claude Code.
  • Documentation is still lacking. You need to read the AGENTS.md file carefully.

Similar Projects

Aider: Also an open-source terminal coding tool. It’s similar in that it’s model-agnostic, but pi covers a wider range (UI library, pod management, etc.).[AIMultiple]

Claude Code: Has more features, but requires a monthly subscription and has limited customization. pi allows you to freely add features with TypeScript extensions.[Northflank]

Cursor: An IDE with integrated AI. If you prefer a GUI over a terminal, Cursor is better.

Frequently Asked Questions (FAQ)

Q: Is it free to use?

A: pi itself is completely free under the MIT license. However, if you use external LLM APIs such as OpenAI or Anthropic, you will incur those costs. If you use Ollama or self-hosted vLLM locally, you can use it without API costs.

Q: Is the performance good enough to replace Claude Code?

A: In the Terminal-Bench 2.0 benchmark, pi with Claude Opus 4.5 attached showed results comparable to Codex, Cursor, and Windsurf. This proves that a minimalist approach doesn’t compromise performance.

Q: Is Korean supported?

A: The UI is in English, but if the LLM you connect to supports Korean, you can code while conversing in Korean. If you connect to Claude or GPT-4, you can write code with Korean prompts.


If you found this article helpful, please subscribe to AI Digester.

References

Claude Code Issues: 62 in 90 Days, Developers Again?

Claude Code Outage: 62 Incidents in 90 Days, Developers Say “Again?”

  • Claude Code access failure at 10:24 AM ET on February 3rd
  • 62 outages in 90 days — average duration 1 hour 19 minutes
  • Claude API, claude.ai also affected

What happened?

Claude Code is down again. Reports surged on Downdetector at 10:24 AM ET on February 3rd. [DesignTAXI] There was also an outage the day before.

Claude API and claude.ai were also affected. Developers complained on social media.

Why is it important?

Anthropic has experienced a total of 62 outages in 90 days. The average duration is 1 hour and 19 minutes. [IsDown]

On January 14, the error rate spiked in Opus 4.5 and Sonnet 4.5, with over 1,500 reports received. [NewsBytes] It took 4 hours to recover.

Frankly, the $200/month Max subscribers are probably the most frustrated.

What’s next?

Anthropic stated that they have fixed configuration issues and added safeguards. [Claude Status] But with 62 incidents in 90 days, improving infrastructure stability is urgent.

Frequently Asked Questions (FAQ)

Q: What are the alternatives when Claude Code is down?

A: You can temporarily use GitHub Copilot, Cursor, or open-source Goose. It’s a good idea to learn one backup tool.

Q: What is the reliability of Anthropic services?

A: The official 90-day uptime is 99.67%. However, with 62 outages averaging 1 hour and 19 minutes, the total downtime is considerable.

Q: How can I check the outage status?

A: You can see the official status at status.claude.com and user reports on Downdetector.


If you found this article helpful, please subscribe to AI Digester.

References

OpenAI Reveals Sora Feed Philosophy: “We Won’t Allow Doomscrolling

OpenAI Reveals Sora Feed Philosophy: “We Won’t Make You Doomscroll”

  • Creation first, consumption minimized is the core principle
  • New concept recommendation system that can adjust algorithms with natural language
  • Safety measures from the creation stage, a strategy completely opposite to TikTok

What Happened?

OpenAI officially announced the design philosophy of the recommendation feed for its AI video generation app, Sora.[OpenAI] The core message is clear: “A platform for creation, not doomscrolling.”

While TikTok has been criticized for optimizing viewing time, OpenAI has chosen the opposite direction. Instead of optimizing feed dwell time, it prioritizes exposing content that is likely to inspire users to create their own videos.[TechCrunch]

Why is it Important?

Frankly, this is a pretty significant experiment in social media history. Existing social platforms have maximized dwell time for advertising revenue. The longer users stay, the more money they make. The result was addictive algorithms and mental health problems.

OpenAI is already generating revenue with a subscription model (ChatGPT Plus). Since it doesn’t rely on advertising, it doesn’t need to “hold onto users.” Simply put, since the business model is different, the feed design can also be different.

Personally, I’m curious if this will actually work. Can a “creation-encouraging” feed actually maintain user engagement? Or will it eventually revert to dwell time optimization?

4 Principles of the Sora Feed

  • Optimize for Creation: Induce participation rather than consumption. The goal is active creation, not passive scrolling.[Digital Watch]
  • User Control: You can adjust the algorithm with natural language. Instructions such as “Only show me comedies today” are possible.
  • Connection First: Prioritize content from people you follow and people you know over viral global content.
  • Safety-Freedom Balance: Since all content is generated within Sora, harmful content is blocked at the creation stage.

How is it Technically Different?

OpenAI has developed a new type of recommendation algorithm using existing LLMs. The key differentiator is “natural language instructions.” Users can directly describe the type of content they want to the algorithm in words.[TechCrunch]

Personalization signals include Sora activity (likes, comments, remixes), IP-based location, ChatGPT usage history (can be turned off), and the number of followers of the creator. However, safety signals are also included, and harmful content is suppressed from exposure.

What Will Happen in the Future?

The Sora app ranked first in the App Store within 48 hours of its release. It had 56,000 downloads on the first day, and tripled on the second day.[TechCrunch] The initial reaction was enthusiastic.

But the problem is sustainability. As OpenAI acknowledged, this feed is a “living system.” It will continue to change based on user feedback. What happens if the creation-centric philosophy conflicts with actual user behavior? We’ll have to wait and see.

Frequently Asked Questions (FAQ)

Q: How is the Sora feed different from TikTok?

A: TikTok aims to keep users engaged by optimizing viewing time. Sora, on the other hand, prioritizes showing content that is likely to inspire users to create their own videos. It is designed with a focus on creation rather than consumption.

Q: What does it mean to adjust the algorithm with natural language?

A: Existing apps determine recommendations based only on behavioral data such as likes and viewing time. Sora allows users to directly enter instructions such as “Only show me SF videos today” in text, and the algorithm adjusts accordingly.

Q: Are there any youth protection features?

A: Yes. You can turn off feed personalization or limit continuous scrolling through ChatGPT parental controls. Youth accounts are by default limited in the number of videos they can create per day, and the Cameo (video featuring others) function also has stricter permissions applied.


If this article was helpful, please subscribe to AI Digester.

Reference Materials

How to Reduce FID by 30% in Text-to-Image AI Learning

Key Takeaways: 200K Step Secret, Muon Optimizer, Token Routing

  • REPA alignment is only an initial accelerator; it must be removed after 200K steps.
  • Achieved FID 18.2 → 15.55 (15% improvement) with just the Muon optimizer.
  • TREAD token routing pulls down FID to 14.10 at 1024×1024 high resolution.

What Happened?

The Photoroom team released Part 2 of their text-to-image generation model PRX learning optimization guide.[Hugging Face] While Part 1 covered the architecture, this time they poured out specific ablation results on what to do and how to do it when actually learning.

Frankly, most technical documents of this kind end with “Our model is great,” but this is different. They also disclosed failed experiments and showed the trade-offs of each technique numerically.

Why is it Important?

Training a text-to-image model from scratch is incredibly expensive. Thousands of GPU hours can be wasted with just one wrong setting. The data released by Photoroom reduces this trial and error.

Personally, the most notable finding is about REPA (representation alignment). Using REPA-DINOv3 drops the FID from 18.2 to 14.64. But there’s a problem. Throughput decreases by 13%, and after 200K steps, it actually hinders learning. Simply put, it’s just an early booster.

Also, the BF16 weight saving bug. If you save as BF16 instead of FP32 without knowing this, the FID jumps from 18.2 to 21.87. It goes up by 3.67. Surprisingly, many teams fall into this trap.

Practical Guide: Resolution-Specific Strategies

Technique 256×256 FID 1024×1024 FID Throughput
Baseline 18.20 3.95 b/s
REPA-E-VAE 12.08 3.39 b/s
TREAD 21.61 ↑ 14.10 ↓ 1.64 b/s
Muon Optimizer 15.55

At 256×256, TREAD actually degrades quality. But at 1024×1024, completely different results come out. This means that the higher the resolution, the more the token routing effect is maximized.

What Will Happen Next?

Photoroom will release the entire training code in Part 3 and conduct a 24-hour “speedrun.” They’re going to show how quickly you can make a decent model.

Personally, I think this release will have a significant impact on the open-source image generation model ecosystem. This is the first time since Stable Diffusion that training know-how has been disclosed so specifically.

Frequently Asked Questions (FAQ)

Q: When should REPA be removed?

A: After about 200K steps. It accelerates learning in the beginning, but after that, it actually hinders convergence. This was clearly revealed in the Photoroom experiment. Missing the timing will degrade the final model quality.

Q: Should I use synthetic data or real images?

A: Use both. Initially, learn the global structure with synthetic images, and later, capture high-frequency details with real images. If you only use synthetic images, the FID is good, but it doesn’t feel like a photo.

Q: How much better is the Muon optimizer than AdamW?

A: About 15% improvement based on FID. It dropped from 18.2 to 15.55. The computational cost is similar, so there’s no reason not to use it. However, hyperparameter tuning is a bit tricky.


If you found this article useful, please subscribe to AI Digester.

Reference Materials