<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://xavierlopez.me/feed.xml" rel="self" type="application/atom+xml" /><link href="https://xavierlopez.me/" rel="alternate" type="text/html" /><updated>2026-04-28T19:28:52-07:00</updated><id>https://xavierlopez.me/feed.xml</id><title type="html">Xavier G. Lopez | Platform Engineer</title><subtitle>Platform Engineer specializing in Infrastructure as Product and Executable Architecture.</subtitle><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><entry><title type="html">Building a Production-Grade k3s Cluster on Spare Capacity</title><link href="https://xavierlopez.me/devops/building-production-grade-k3s-cluster-on-spare-capacity/" rel="alternate" type="text/html" title="Building a Production-Grade k3s Cluster on Spare Capacity" /><published>2026-01-27T04:00:00-08:00</published><updated>2026-01-27T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/building-production-grade-k3s-cluster-on-spare-capacity</id><content type="html" xml:base="https://xavierlopez.me/devops/building-production-grade-k3s-cluster-on-spare-capacity/"><![CDATA[<h2 id="context">Context</h2>

<p>I needed a Kubernetes cluster that could run continuously for platform engineering work - deploying services, testing GitOps workflows, running applications. The cluster needed to be production-grade in its automation and reliability, even if it wasn’t serving production traffic.</p>

<p>I chose k3s running on KVM/libvirt virtual machines. The entire deployment is automated: Terraform provisions three VMs, cloud-init configures networking and installs k3s, and within about five minutes a working cluster is operational. No manual steps, no configuration drift, completely reproducible.</p>

<p>This post walks through the architecture, the automation decisions, and the technical solutions that make it work. If you’re building Kubernetes infrastructure on virtualization platforms, some of these patterns might be useful.</p>

<h2 id="the-architecture">The Architecture</h2>

<p>The cluster runs three virtual machines managed by KVM/libvirt:</p>

<ul>
  <li>One control plane node (k3s-cp-01)</li>
  <li>Two worker nodes (k3s-worker-01, k3s-worker-02)</li>
</ul>

<p>Each VM has 6 vCPUs, 10GB RAM, and 80GB disk. They run Ubuntu 24.04 LTS with the containerd runtime. The k3s version is pinned to v1.34.3+k3s1 - I’ll explain why that matters later.</p>

<p>The network uses libvirt’s default network (192.168.122.0/24) with static IP assignments:</p>

<ul>
  <li>Control plane: 192.168.122.10</li>
  <li>Worker 1: 192.168.122.11</li>
  <li>Worker 2: 192.168.122.12</li>
</ul>

<p>Static IPs are configured via cloud-init’s network_config, not DHCP reservations. This eliminates any dependency on DHCP lease stability and ensures nodes always come up with the correct addresses.</p>

<p>The deployment is managed entirely through infrastructure as code. Packer builds the base Ubuntu image with k3s prerequisites. Terraform provisions the VMs using the libvirt provider. Cloud-init handles node-specific configuration and k3s installation. The entire process is declarative and reproducible.</p>

<h2 id="automation-decisions">Automation Decisions</h2>

<h3 id="why-static-ips-instead-of-dhcp">Why Static IPs Instead of DHCP</h3>

<p>The initial implementation used DHCP with MAC address reservations in libvirt’s network configuration. This worked, but introduced instability. Occasionally nodes would come up with different IPs or fail to get leases at all. Debugging network issues in a distributed system is painful when you’re not sure if the problem is application-level or infrastructure-level.</p>

<p>Static IPs configured via cloud-init eliminate this entire class of problems. Each VM’s network configuration is defined in its cloud-init network_config. The IP addresses are assigned before any services start. There’s no lease negotiation, no timing dependencies, no opportunity for the network layer to behave differently between deployments.</p>

<p>The trade-off is that you need to manage IP address allocation manually. For a three-node cluster, that’s trivial. For larger deployments, you’d want tooling to generate network_config from an IP allocation database. But the reliability improvement is worth it.</p>

<h3 id="why-pinned-k3s-versions">Why Pinned k3s Versions</h3>

<p>K3s supports installing from a “stable” channel, which automatically pulls the latest stable release. This seems convenient - you always get the newest version without manual updates.</p>

<p>In practice, this creates deployment instability. The k3s version can change between cluster deployments, introducing variables when troubleshooting. Different nodes might end up with different versions if they pull from the channel at different times. And k3s releases sometimes introduce breaking changes that affect existing workloads.</p>

<p>Pinning to a specific version (v1.34.3+k3s1 in this case) makes deployments reproducible. Every node gets exactly the same k3s binary. If I destroy and recreate the cluster six months from now, it will be identical to today’s deployment. When I’m ready to upgrade, I test the new version, update the pin, and deploy deliberately.</p>

<p>This is a standard practice in production environments. Version pinning trades convenience for predictability. For a platform that exists to demonstrate capabilities, predictability matters more than running the absolute latest release.</p>

<h3 id="why-cloud-init-network-validation">Why Cloud-Init Network Validation</h3>

<p>The k3s installation script runs automatically via cloud-init. If networking isn’t fully operational when k3s starts, the installation can fail in subtle ways - certificates might be generated with wrong IPs, the API server might bind to the wrong interface, nodes might not be able to join the cluster.</p>

<p>The cloud-init configuration includes network validation before k3s installation:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">runcmd</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="pi">|</span>
    <span class="s"># Wait for network interface to be operational</span>
    <span class="s">for i in {1..30}; do</span>
      <span class="s">IFACE=$(ip -o -4 route show to default | awk '{print $5}' | head -n1)</span>
      <span class="s">if [ -n "$IFACE" ]; then</span>
        <span class="s">echo "Network interface $IFACE is up"</span>
        <span class="s">break</span>
      <span class="s">fi</span>
      <span class="s">echo "Waiting for network interface... ($i/30)"</span>
      <span class="s">sleep 2</span>
    <span class="s">done</span>
    
  <span class="pi">-</span> <span class="pi">|</span>
    <span class="s"># Verify DNS resolution</span>
    <span class="s">for i in {1..30}; do</span>
      <span class="s">if nslookup google.com &gt; /dev/null 2&gt;&amp;1; then</span>
        <span class="s">echo "DNS resolution working"</span>
        <span class="s">break</span>
      <span class="s">fi</span>
      <span class="s">echo "Waiting for DNS... ($i/30)"</span>
      <span class="s">sleep 2</span>
    <span class="s">done</span>
    
  <span class="pi">-</span> <span class="pi">|</span>
    <span class="s"># Install k3s after network is confirmed operational</span>
    <span class="s">curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.34.3+k3s1 sh -s - server</span>
</code></pre></div></div>

<p>This adds maybe 10-15 seconds to deployment time, but eliminates an entire class of timing-related failures. The k3s installation doesn’t start until the network is confirmed working. Simple, reliable, worth the wait.</p>

<h3 id="why-clean-base-images">Why Clean Base Images</h3>

<p>The Packer template builds a base Ubuntu 24.04 image with k3s prerequisites installed. Early versions of this image had problems: hardcoded MAC addresses from the build environment, residual netplan configurations, cloud-init state that didn’t reset properly between VM deployments.</p>

<p>These issues caused VMs to come up with duplicate MAC addresses or incorrect network configurations. The solution was a cleanup script that runs at the end of the Packer build:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Remove machine-specific identifiers</span>
<span class="nb">rm</span> <span class="nt">-f</span> /etc/machine-id
<span class="nb">rm</span> <span class="nt">-f</span> /var/lib/dbus/machine-id

<span class="c"># Clean cloud-init state</span>
cloud-init clean <span class="nt">--logs</span> <span class="nt">--seed</span>

<span class="c"># Remove netplan configs (cloud-init will regenerate)</span>
<span class="nb">rm</span> <span class="nt">-f</span> /etc/netplan/<span class="k">*</span>.yaml

<span class="c"># Clean logs</span>
find /var/log <span class="nt">-type</span> f <span class="nt">-delete</span>
</code></pre></div></div>

<p>This ensures the base image is truly generic. Each VM that boots from this image gets fresh identifiers, clean cloud-init state, and network configuration from its own cloud-init data source.</p>

<h2 id="technical-implementation">Technical Implementation</h2>

<h3 id="terraform-structure">Terraform Structure</h3>

<p>The Terraform configuration provisions VMs using the dmacvicar/libvirt provider. For each node, it creates:</p>

<ol>
  <li>A cloud-init ISO that contains user_data and network_config</li>
  <li>A disk volume cloned from the base image</li>
  <li>A domain (VM) with the disk and cloud-init ISO attached</li>
</ol>

<p>The libvirt provider connects to a remote libvirtd instance over SSH. This means Terraform runs on my laptop but manages VMs on a separate Linux host. The provider handles the SSH connection transparently.</p>

<p>Key Terraform resources:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">resource</span> <span class="s2">"libvirt_cloudinit_disk"</span> <span class="s2">"k3s_cp"</span> <span class="p">{</span>
  <span class="nx">name</span>           <span class="p">=</span> <span class="s2">"k3s-cp-01-cloudinit.iso"</span>
  <span class="nx">user_data</span>      <span class="p">=</span> <span class="nx">templatefile</span><span class="err">(</span><span class="s2">"${path.module}/cloud-init/k3s-cp.yml.tpl"</span><span class="err">,</span> <span class="p">{</span><span class="err">...</span><span class="p">}</span><span class="err">)</span>
  <span class="nx">network_config</span> <span class="p">=</span> <span class="nx">templatefile</span><span class="err">(</span><span class="s2">"${path.module}/cloud-init/network-config.yml.tpl"</span><span class="err">,</span> <span class="p">{</span><span class="err">...</span><span class="p">}</span><span class="err">)</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"libvirt_volume"</span> <span class="s2">"k3s_cp"</span> <span class="p">{</span>
  <span class="nx">name</span>           <span class="p">=</span> <span class="s2">"k3s-cp-01.qcow2"</span>
  <span class="nx">base_volume_id</span> <span class="p">=</span> <span class="nx">libvirt_volume</span><span class="err">.</span><span class="nx">base</span><span class="err">.</span><span class="nx">id</span>
  <span class="nx">size</span>           <span class="p">=</span> <span class="mi">85899345920</span>  <span class="c1"># 80GB</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"libvirt_domain"</span> <span class="s2">"k3s_cp"</span> <span class="p">{</span>
  <span class="nx">name</span>   <span class="p">=</span> <span class="s2">"k3s-cp-01"</span>
  <span class="nx">memory</span> <span class="p">=</span> <span class="mi">10240</span>
  <span class="nx">vcpu</span>   <span class="p">=</span> <span class="mi">6</span>
  
  <span class="nx">cloudinit</span> <span class="p">=</span> <span class="nx">libvirt_cloudinit_disk</span><span class="err">.</span><span class="nx">k3s_cp</span><span class="err">.</span><span class="nx">id</span>
  
  <span class="nx">disk</span> <span class="p">{</span>
    <span class="nx">volume_id</span> <span class="p">=</span> <span class="nx">libvirt_volume</span><span class="err">.</span><span class="nx">k3s_cp</span><span class="err">.</span><span class="nx">id</span>
  <span class="p">}</span>
  
  <span class="nx">network_interface</span> <span class="p">{</span>
    <span class="nx">network_name</span>   <span class="p">=</span> <span class="s2">"default"</span>
    <span class="nx">addresses</span>      <span class="p">=</span> <span class="p">[</span><span class="s2">"192.168.122.10"</span><span class="p">]</span>
    <span class="nx">wait_for_lease</span> <span class="p">=</span> <span class="kc">false</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">wait_for_lease = false</code> parameter is important. It tells Terraform not to wait for a DHCP lease, since we’re using static IPs. Without this, Terraform would hang waiting for a lease that will never come.</p>

<h3 id="cloud-init-configuration">Cloud-Init Configuration</h3>

<p>The cloud-init configuration has two parts: user_data (what to do) and network_config (how to configure networking).</p>

<p>The network_config is straightforward - static IP, gateway, DNS servers:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">version</span><span class="pi">:</span> <span class="m">2</span>
<span class="na">ethernets</span><span class="pi">:</span>
  <span class="na">ens3</span><span class="pi">:</span>
    <span class="na">addresses</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">192.168.122.10/24</span>
    <span class="na">gateway4</span><span class="pi">:</span> <span class="s">192.168.122.1</span>
    <span class="na">nameservers</span><span class="pi">:</span>
      <span class="na">addresses</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="s">8.8.8.8</span>
        <span class="pi">-</span> <span class="s">1.1.1.1</span>
</code></pre></div></div>

<p>The user_data is more complex. For the control plane:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#cloud-config</span>
<span class="na">hostname</span><span class="pi">:</span> <span class="s">k3s-cp-01</span>
<span class="na">fqdn</span><span class="pi">:</span> <span class="s">k3s-cp-01.local</span>

<span class="na">users</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">ubuntu</span>
    <span class="na">ssh_authorized_keys</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">${ssh_public_key}</span>
    <span class="na">sudo</span><span class="pi">:</span> <span class="s">ALL=(ALL) NOPASSWD:ALL</span>
    <span class="na">shell</span><span class="pi">:</span> <span class="s">/bin/bash</span>

<span class="na">runcmd</span><span class="pi">:</span>
  <span class="c1"># Network validation (shown earlier)</span>
  <span class="c1"># K3s installation</span>
  <span class="pi">-</span> <span class="s">curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.34.3+k3s1 sh -s - server</span>
  
  <span class="c1"># Wait for k3s to be ready</span>
  <span class="pi">-</span> <span class="pi">|</span>
    <span class="s">for i in {1..60}; do</span>
      <span class="s">if sudo k3s kubectl get nodes &gt; /dev/null 2&gt;&amp;1; then</span>
        <span class="s">echo "k3s control plane is ready"</span>
        <span class="s">break</span>
      <span class="s">fi</span>
      <span class="s">sleep 5</span>
    <span class="s">done</span>
</code></pre></div></div>

<p>For worker nodes, the configuration is similar but joins the cluster instead of initializing it:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">runcmd</span><span class="pi">:</span>
  <span class="c1"># Network validation</span>
  <span class="c1"># K3s agent installation</span>
  <span class="pi">-</span> <span class="pi">|</span>
    <span class="s">curl -sfL https://get.k3s.io | \</span>
      <span class="s">INSTALL_K3S_VERSION=v1.34.3+k3s1 \</span>
      <span class="s">K3S_URL=https://192.168.122.10:6443 \</span>
      <span class="s">K3S_TOKEN=${k3s_token} \</span>
      <span class="s">sh -s - agent</span>
</code></pre></div></div>

<p>The k3s token comes from Terraform variables and is templated into the cloud-init configuration. In a production environment, you’d want to manage this more securely (maybe HashiCorp Vault or AWS Secrets Manager), but for a demonstration cluster, templating it directly works fine.</p>

<h3 id="deployment-workflow">Deployment Workflow</h3>

<p>From a clean state, deploying the cluster:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Terraform provisions VMs and attaches cloud-init</span>
terraform apply

<span class="c"># Wait ~5 minutes for cloud-init to complete</span>
<span class="c"># VMs boot, network configures, k3s installs</span>

<span class="c"># Verify cluster is operational</span>
ssh ubuntu@192.168.122.10 <span class="s1">'sudo k3s kubectl get nodes -o wide'</span>
</code></pre></div></div>

<p>The entire process takes about five minutes. Most of that is waiting for VMs to boot and cloud-init to run. The actual k3s installation is fast once the network is ready.</p>

<p>Destroying and recreating the cluster:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terraform destroy
terraform apply
<span class="c"># Another 5 minutes, identical cluster</span>
</code></pre></div></div>

<p>Complete reproducibility. No manual steps. No configuration drift.</p>

<h2 id="whats-next">What’s Next</h2>

<p>This cluster is the foundation for platform services. The next steps are:</p>

<p><strong>Install Flux GitOps controllers</strong> to manage platform and application deployments declaratively. Flux will sync from Git repositories and keep the cluster state in sync with the desired state defined in code.</p>

<p><strong>Deploy Big Bang platform services</strong> - the DoD-maintained DevSecOps baseline that provides Istio service mesh, Prometheus monitoring, GitLab CI/CD, and other core capabilities. This gives the cluster a production-grade platform layer.</p>

<p><strong>Add applications</strong> - once the platform is operational, deploy actual applications to demonstrate the full stack working together.</p>

<p><strong>AWS compatibility</strong> - This architecture is designed with Kubernetes portability in mind. The application layer uses standard Kubernetes primitives. Environment-specific infrastructure differences (storage classes, load balancers, ingress) are handled through Kustomize overlays. The same manifests and GitOps workflows that work on k3s can deploy to AWS EKS.</p>

<p>Future posts will cover the Flux and Big Bang deployment process, and eventually demonstrate AWS integrations using services like Direct Connect, Storage Gateway, and Outposts.</p>

<h2 id="lessons-learned">Lessons Learned</h2>

<p><strong>Static networking is worth the small amount of extra configuration.</strong> DHCP adds moving parts and timing dependencies. For infrastructure that needs to be reliable, static IPs eliminate an entire class of problems.</p>

<p><strong>Version pinning trades convenience for predictability.</strong> Always getting the latest version sounds good until you need to debug why a deployment behaves differently than it did last month.</p>

<p><strong>Network validation before service installation prevents subtle failures.</strong> The 10-15 seconds spent confirming DNS works and interfaces are up saves hours of debugging certificate problems and API server binding issues.</p>

<p><strong>Clean base images are not optional.</strong> Residual state in images causes mysterious problems that are hard to diagnose. Taking the time to properly clean machine IDs, cloud-init state, and network configs is essential.</p>

<p><strong>Automation enables experimentation.</strong> Being able to destroy and recreate the cluster in 5 minutes means you can test ideas without fear. If something breaks, just rebuild. This changes how you approach learning and troubleshooting.</p>

<p>The cluster runs continuously now, stable and operational. It’s ready for the platform services layer. The automation is solid enough that I trust it, which means I can focus on building the interesting parts - the platform and applications - instead of fighting infrastructure problems.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><summary type="html"><![CDATA[I built a 3-node k3s cluster that deploys automatically in 5 minutes using Terraform and cloud-init. Here's how it works and what I learned.]]></summary></entry><entry><title type="html">Operational Guardrails for Multi-Tenant PostgreSQL</title><link href="https://xavierlopez.me/databases/devops/operational-guardrails-for-multi-tenant-postgres/" rel="alternate" type="text/html" title="Operational Guardrails for Multi-Tenant PostgreSQL" /><published>2026-01-06T00:00:00-08:00</published><updated>2026-01-06T00:00:00-08:00</updated><id>https://xavierlopez.me/databases/devops/operational-guardrails-for-multi-tenant-postgres</id><content type="html" xml:base="https://xavierlopez.me/databases/devops/operational-guardrails-for-multi-tenant-postgres/"><![CDATA[<h2 id="context">Context</h2>

<p>Running PostgreSQL in a multi-tenant configuration is a powerful cost-optimization strategy—especially in environments where dozens or hundreds of isolated workloads coexist. But as I wrote previously in <a href="/databases/operations/platform/operational-realities-of-running-postgresql/">Operational Realities of Running PostgreSQL</a>, security isolation is only half the story.</p>

<p>PostgreSQL is extremely stable when respected, but it has sharp edges when pushed into resource contention. Multi-tenant architectures amplify those failure modes. Even with perfect security isolation (role-per-tenant, database-per-tenant, schema hardening), tenants still share a single set of physical resources:</p>

<ul>
  <li>CPU</li>
  <li>memory</li>
  <li>IOPS</li>
  <li>WAL throughput</li>
  <li>background workers</li>
  <li>connection slots</li>
</ul>

<p>If one tenant misbehaves, it can degrade the experience for all others—even without violating a single privilege boundary.</p>

<p>This post explains the operational guardrails required to ensure <strong>safe</strong>, <strong>predictable</strong>, and <strong>compliant</strong> multi-tenant PostgreSQL deployments. All guardrails described here are fully implemented and verifiable in the accompanying project:</p>

<p><strong>Project:</strong> <a href="https://github.com/zavestudios/pg">Multi-Tenant PostgreSQL Security &amp; Operational Isolation</a></p>

<hr />

<h2 id="why-operational-guardrails-matter">Why Operational Guardrails Matter</h2>

<p>Multi-tenant PostgreSQL is only viable when both of these are true:</p>

<h3 id="1-security-boundaries-must-be-provable">1. Security boundaries must be provable</h3>

<p>No tenant should ever be able to read or affect another tenant’s data.</p>

<h3 id="2-operational-behavior-must-be-controlled">2. Operational behavior must be controlled</h3>

<p>No tenant should be able to destabilize the shared database server.</p>

<p>The first requirement is handled by:</p>

<ul>
  <li>database-per-tenant</li>
  <li>role-per-tenant</li>
  <li>schema-per-tenant</li>
  <li>hardened <code class="language-plaintext highlighter-rouge">public</code> schema</li>
  <li>restricted <code class="language-plaintext highlighter-rouge">search_path</code></li>
  <li>default privilege hardening</li>
  <li>extension restrictions</li>
  <li>negative security tests</li>
</ul>

<p>The second requirement requires <strong>operational guardrails</strong>, which this post covers in detail.</p>

<p>Both sets of controls are implemented and actively tested in the project linked above.</p>

<hr />

<h2 id="operational-risks-in-multi-tenant-postgresql">Operational Risks in Multi-Tenant PostgreSQL</h2>

<h3 id="1-connection-exhaustion--the-classic-failure-mode">1. Connection Exhaustion — The Classic Failure Mode</h3>

<p>Every PostgreSQL instance has a <em>global</em> connection budget (<code class="language-plaintext highlighter-rouge">max_connections</code>). All tenants draw from this shared pool.</p>

<p>A single tenant with:</p>

<ul>
  <li>an oversized ORM pool</li>
  <li>idle-in-transaction leaks</li>
  <li>a bug sending excessive connections</li>
</ul>

<p>…can exhaust all connections and knock the instance offline.</p>

<h3 id="guardrail-per-role-connection-limits">Guardrail: Per-role connection limits</h3>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">ROLE</span> <span class="n">tenant_a_app</span> <span class="k">CONNECTION</span> <span class="k">LIMIT</span> <span class="mi">2</span><span class="p">;</span>
</code></pre></div></div>

<p>Small limits dramatically reduce blast radius.</p>

<h3 id="in-the-project">In the project</h3>

<p>The test suite spawns multiple concurrent connections and confirms one fails once the limit is exceeded.</p>

<hr />

<h3 id="2-runaway-or-long-running-queries">2. Runaway or Long-Running Queries</h3>

<p>A single long query—or a stuck transaction—can tie up CPU, I/O, locks, and memory.</p>

<h3 id="guardrail-per-role-timeouts">Guardrail: Per-role timeouts</h3>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">ROLE</span> <span class="n">tenant_a_app</span> <span class="k">SET</span> <span class="n">statement_timeout</span> <span class="o">=</span> <span class="s1">'3s'</span><span class="p">;</span>
<span class="k">ALTER</span> <span class="k">ROLE</span> <span class="n">tenant_a_app</span> <span class="k">SET</span> <span class="n">lock_timeout</span> <span class="o">=</span> <span class="s1">'2s'</span><span class="p">;</span>
<span class="k">ALTER</span> <span class="k">ROLE</span> <span class="n">tenant_a_app</span> <span class="k">SET</span> <span class="n">idle_in_transaction_session_timeout</span> <span class="o">=</span> <span class="s1">'10s'</span><span class="p">;</span>
</code></pre></div></div>

<p>These serve as circuit breakers against runaway behavior.</p>

<h3 id="in-the-project-1">In the project</h3>

<p><code class="language-plaintext highlighter-rouge">SELECT pg_sleep(10)</code> is used to confirm the timeout fires predictably.</p>

<hr />

<h3 id="3-lock-contention--autovacuum-starvation--bloat">3. Lock Contention → Autovacuum Starvation → Bloat</h3>

<p>Long-lived locks stop autovacuum from doing its job. The result:</p>

<ul>
  <li>rising dead tuples</li>
  <li>bloated indexes</li>
  <li>WAL amplification</li>
  <li>I/O latency spikes</li>
</ul>

<p>In multi-tenant environments, <em>all</em> tenants suffer.</p>

<h3 id="guardrails">Guardrails</h3>

<ul>
  <li>enforce idle transaction timeouts</li>
  <li>surface lock metrics</li>
  <li>alert on autovacuum lag</li>
</ul>

<p>These are documented operational expectations for production RDS deployments.</p>

<hr />

<h3 id="4-shared-wal-checkpoints-and-io">4. Shared WAL, Checkpoints, and I/O</h3>

<p>PostgreSQL’s background processes operate at the <strong>instance</strong> level:</p>

<ul>
  <li>checkpointer</li>
  <li>WAL writer</li>
  <li>autovacuum workers</li>
</ul>

<p>A high-churn tenant can degrade performance for everyone.</p>

<h3 id="guardrails-1">Guardrails</h3>

<ul>
  <li>WAL monitoring</li>
  <li>Instance sizing</li>
  <li>Enforced workload limits</li>
</ul>

<hr />

<h3 id="5-backups-and-snapshots-include-all-tenants">5. Backups and Snapshots Include All Tenants</h3>

<p>On AWS RDS, a snapshot contains <strong>all tenant databases</strong>.</p>

<h3 id="guardrails-2">Guardrails</h3>

<ul>
  <li>strict IAM permissions for snapshot creation/restoration</li>
  <li>KMS key policy constraints</li>
  <li>auditing of all snapshot actions</li>
</ul>

<p>This is essential for IL4 workloads.</p>

<hr />

<h2 id="guardrails-implemented-in-the-project">Guardrails Implemented in the Project</h2>

<h3 id="security-controls">Security Controls</h3>

<ul>
  <li>role-per-tenant</li>
  <li>database-per-tenant</li>
  <li>schema-per-tenant</li>
  <li>hardened <code class="language-plaintext highlighter-rouge">public</code> schema</li>
  <li>restricted <code class="language-plaintext highlighter-rouge">search_path</code></li>
  <li>enforced default privileges</li>
  <li>blocked extension creation</li>
  <li>negative cross-tenant isolation tests</li>
</ul>

<h3 id="operational-controls">Operational Controls</h3>

<ul>
  <li>per-role connection limits</li>
  <li>per-role statement timeouts</li>
  <li>per-role lock timeouts</li>
  <li>per-role idle-in-transaction timeouts</li>
</ul>

<h3 id="automated-tests">Automated Tests</h3>

<ul>
  <li>connection-limit exceedance validated via concurrency</li>
  <li>long-query timeout enforcement</li>
  <li>concurrency behaviors tested safely and repeatably</li>
</ul>

<h3 id="compliance-documentation">Compliance Documentation</h3>

<ul>
  <li>NIST 800-53 mapping</li>
  <li>FedRAMP Moderate alignment</li>
  <li>DoD IL2/IL4 considerations</li>
  <li>pgAudit integration strategy</li>
</ul>

<hr />

<h2 id="when-not-to-use-multi-tenant-postgresql">When Not To Use Multi-Tenant PostgreSQL</h2>

<p>Avoid multi-tenant PostgreSQL when:</p>

<ul>
  <li>tenants require <em>strict</em> performance isolation</li>
  <li>tenants need independent backup/restore capabilities</li>
  <li>tenants have materially different compliance requirements</li>
  <li>tenant load is unpredictable or unbounded</li>
  <li>applications cannot follow connection pool discipline</li>
</ul>

<p>These constraints are architectural realities, not limitations of PostgreSQL itself.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Multi-tenant PostgreSQL can be secure, cost-effective, and IL4-aligned — but only when operational guardrails are enforced. These include:</p>

<ul>
  <li>per-tenant connection limits</li>
  <li>per-tenant timeouts</li>
  <li>lock and idle-in-transaction protection</li>
  <li>shared resource awareness (WAL, checkpoints, autovacuum)</li>
  <li>auditable configuration</li>
</ul>

<p>The accompanying project provides a complete, <a href="https://github.com/zavestudios/pg">reproducible reference architecture</a></p>

<p>Upcoming work: Terraform integration using the PostgreSQL provider, RDS automation, and CI/CD validation pipelines.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="databases" /><category term="devops" /><summary type="html"><![CDATA[Operational guardrails needed to safely run PostgreSQL in a multi-tenant configuration, including connection limits, timeouts, lock protection, shared resource considerations, and how these are enforced and tested in the multi-tenant Postgres project.]]></summary></entry><entry><title type="html">Recovering from Toolchain Drift on macOS</title><link href="https://xavierlopez.me/development/devops/recovering-from-toolchain-drift-on-macos/" rel="alternate" type="text/html" title="Recovering from Toolchain Drift on macOS" /><published>2024-03-11T01:00:00-07:00</published><updated>2025-01-13T00:00:00-08:00</updated><id>https://xavierlopez.me/development/devops/recovering-from-toolchain-drift-on-macos</id><content type="html" xml:base="https://xavierlopez.me/development/devops/recovering-from-toolchain-drift-on-macos/"><![CDATA[<h2 id="context">Context</h2>

<p>Modern development environments move fast.</p>

<p>Package managers update.
Libraries deprecate APIs.
Defaults change.
Previously working builds suddenly fail.</p>

<p>This post documents a real case of <strong>toolchain drift</strong> on macOS involving Homebrew and OpenSSL, and—more importantly—how to reason through recovery when the ecosystem moves out from under you.</p>

<p>This is not a recommendation to stay on old versions indefinitely.<br />
It’s about <strong>getting unstuck responsibly</strong>.</p>

<hr />

<h2 id="what-toolchain-drift-looks-like">What Toolchain Drift Looks Like</h2>

<p>Toolchain drift usually presents as:</p>

<ul>
  <li>build failures after an unrelated update</li>
  <li>cryptic linker or compilation errors</li>
  <li>software that worked yesterday but not today</li>
  <li>incompatibilities between system libraries and expected versions</li>
</ul>

<p>In this case, the symptoms appeared after routine updates to Homebrew and OpenSSL.</p>

<p>Nothing in the application code changed.</p>

<p>The environment did.</p>

<hr />

<h2 id="why-this-happens">Why This Happens</h2>

<p>On macOS, Homebrew:</p>

<ul>
  <li>aggressively tracks upstream releases</li>
  <li>removes or unlinks deprecated versions</li>
  <li>updates formulae with breaking changes</li>
</ul>

<p>OpenSSL:</p>

<ul>
  <li>has major version boundaries</li>
  <li>frequently breaks ABI compatibility</li>
  <li>is depended on implicitly by many tools</li>
</ul>

<p>When those two collide, downstream tooling often breaks first.</p>

<p>This is not negligence. It’s the cost of a fast-moving ecosystem.</p>

<hr />

<h2 id="the-immediate-constraint">The Immediate Constraint</h2>

<p>At the moment of failure:</p>

<ul>
  <li>the project needed to build and run</li>
  <li>rewriting dependencies was not an option</li>
  <li>upgrading the application code was non-trivial</li>
  <li>time mattered</li>
</ul>

<p>The goal was <strong>restoration of functionality</strong>, not architectural perfection.</p>

<hr />

<h2 id="the-pragmatic-recovery-strategy">The Pragmatic Recovery Strategy</h2>

<p>The chosen approach was to:</p>

<ul>
  <li>temporarily switch to older, compatible versions</li>
  <li>restore a known-good toolchain</li>
  <li>unblock work</li>
  <li>document the decision</li>
</ul>

<p>This is a <strong>containment strategy</strong>, not a permanent fix.</p>

<hr />

<h2 id="reverting-homebrew-and-openssl-versions">Reverting Homebrew and OpenSSL Versions</h2>

<p>The recovery involved:</p>

<ul>
  <li>installing an older OpenSSL version</li>
  <li>ensuring it was correctly linked</li>
  <li>preventing accidental upgrades during the recovery window</li>
</ul>

<p>Commands like the following were used during diagnosis and recovery:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew info openssl
brew <span class="nb">install </span>openssl@1.1
brew <span class="nb">unlink </span>openssl
brew <span class="nb">link </span>openssl@1.1 <span class="nt">--force</span>
</code></pre></div></div>

<p>The exact commands matter less than the intent:</p>

<p>Restore the environment the software was built against.</p>

<p>Once the expected library versions were present again, the failures disappeared. Nothing about the application itself had changed. The mismatch between the application’s expectations and the system-provided libraries was the entire problem.</p>

<h2 id="why-this-worked">Why This Worked</h2>

<p>Most native tooling is sensitive to ABI and linking changes. Tools often assume specific library versions, paths, or symbols will exist. When those assumptions are violated, failures surface in ways that look unrelated to the actual cause.</p>

<p>By reverting to a known-good toolchain, the assumptions held again. The system returned to a stable state without modifying application code.</p>

<p>This is why toolchain drift so often manifests as “random” build failures. The failure is deterministic, but the dependency chain is opaque.</p>

<h2 id="risks-and-tradeoffs">Risks and Tradeoffs</h2>

<p>Downgrading or pinning dependencies is not without cost.</p>

<p>It can:</p>

<ul>
  <li>delay security updates</li>
  <li>make future upgrades more difficult</li>
  <li>introduce divergence between machines</li>
  <li>hide underlying upgrade work that still needs to happen</li>
</ul>

<p>This approach should always be treated as temporary. It is a recovery technique, not a long-term strategy.</p>

<p>The important part is not the downgrade itself, but the discipline around it: documenting the change, understanding why it was necessary, and planning how to move forward.</p>

<h2 id="what-id-do-differently-next-time">What I’d Do Differently Next Time</h2>

<p>With more time and less pressure, better long-term solutions include:</p>

<ul>
  <li>containerizing the build environment</li>
  <li>explicitly versioning toolchains</li>
  <li>documenting expected dependency versions</li>
  <li>avoiding reliance on system-wide libraries</li>
  <li>making upgrades deliberate rather than incidental</li>
</ul>

<p>The goal is not to freeze the environment forever, but to control when and how it changes.</p>

<h2 id="practical-guidance">Practical Guidance</h2>

<p>When toolchain drift causes failures:</p>

<ul>
  <li>identify what actually changed</li>
  <li>avoid trial-and-error fixes</li>
  <li>restore a known-good state first</li>
  <li>document the deviation</li>
  <li>plan a proper upgrade path</li>
</ul>

<p>Stability first. Improvements second.</p>

<h2 id="closing-thought">Closing Thought</h2>

<p>Fast-moving ecosystems are powerful, but unforgiving.</p>

<p>Toolchain drift is not a personal failure or a lack of skill. It is a reminder that reproducibility is something you must design for.</p>

<p>Sometimes the correct move is not forward.</p>

<p>It is back — briefly, intentionally, and with full awareness of why.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="development" /><category term="devops" /><summary type="html"><![CDATA[A real-world example of toolchain drift on macOS, why it happens, and how pinning or downgrading dependencies can be a pragmatic recovery strategy.]]></summary></entry><entry><title type="html">Operational Realities of Running PostgreSQL</title><link href="https://xavierlopez.me/databases/devops/operational-realities-of-running-postgresql/" rel="alternate" type="text/html" title="Operational Realities of Running PostgreSQL" /><published>2024-03-08T00:00:00-08:00</published><updated>2025-01-12T00:00:00-08:00</updated><id>https://xavierlopez.me/databases/devops/operational-realities-of-running-postgresql</id><content type="html" xml:base="https://xavierlopez.me/databases/devops/operational-realities-of-running-postgresql/"><![CDATA[<h2 id="context">Context</h2>

<p>PostgreSQL is often treated like a dependency:</p>

<ul>
  <li>install it</li>
  <li>point an app at it</li>
  <li>scale when it gets slow</li>
</ul>

<p>In reality, PostgreSQL is a <strong>stateful system</strong> with strong opinions about memory, disk, and durability. When those expectations aren’t met, performance problems and outages tend to look mysterious.</p>

<p>This post captures practical realities of running PostgreSQL in production—especially in containerized and Kubernetes environments—without turning into a tuning checklist.</p>

<hr />

<h2 id="postgresql-is-a-system-not-a-library">PostgreSQL Is a System, Not a Library</h2>

<p>PostgreSQL:</p>

<ul>
  <li>runs multiple cooperating processes</li>
  <li>manages its own memory aggressively</li>
  <li>assumes durable storage</li>
  <li>trades performance for correctness by default</li>
</ul>

<p>You don’t “embed” Postgres. You <strong>host</strong> it.</p>

<p>Treating it like a stateless service almost always leads to surprises.</p>

<hr />

<h2 id="memory-connections-matter-more-than-queries">Memory: Connections Matter More Than Queries</h2>

<p>One of the most common misconceptions is that PostgreSQL memory usage scales primarily with data size or query complexity.</p>

<p>In practice, it scales with <strong>connections</strong>.</p>

<p>Each connection:</p>

<ul>
  <li>consumes memory</li>
  <li>spawns backend processes</li>
  <li>increases scheduling and locking overhead</li>
</ul>

<p>A large number of idle connections can be just as harmful as active ones.</p>

<p>This is why:</p>

<ul>
  <li>connection pooling matters</li>
  <li>unbounded client connections are dangerous</li>
  <li>“it works locally” doesn’t translate to production</li>
</ul>

<hr />

<h2 id="cpu-is-rarely-the-first-bottleneck">CPU Is Rarely the First Bottleneck</h2>

<p>When PostgreSQL is slow, adding CPU is often the first instinct.</p>

<p>In reality, PostgreSQL performance issues are more commonly caused by:</p>

<ul>
  <li>disk I/O latency</li>
  <li>WAL contention</li>
  <li>excessive connections</li>
  <li>lock contention</li>
  <li>memory pressure</li>
</ul>

<p>CPU becomes a bottleneck <em>after</em> those are addressed.</p>

<hr />

<h2 id="disk-and-wal-are-central-to-performance">Disk and WAL Are Central to Performance</h2>

<p>PostgreSQL’s durability guarantees rely heavily on the <strong>Write-Ahead Log (WAL)</strong>.</p>

<p>This means:</p>

<ul>
  <li>every write involves disk I/O</li>
  <li>latency matters more than raw throughput</li>
  <li>storage performance directly affects commit speed</li>
</ul>

<p>Slow or inconsistent disks show up as:</p>

<ul>
  <li>slow transactions</li>
  <li>replication lag</li>
  <li>unexplained query latency</li>
</ul>

<p>This is especially important in virtualized or networked storage environments.</p>

<hr />

<h2 id="containers-dont-change-the-fundamentals">Containers Don’t Change the Fundamentals</h2>

<p>Running PostgreSQL in a container does not change how PostgreSQL works.</p>

<p>It still:</p>

<ul>
  <li>writes to disk</li>
  <li>uses shared memory</li>
  <li>expects predictable I/O</li>
  <li>assumes stable filesystem semantics</li>
</ul>

<p>Common container mistakes include:</p>

<ul>
  <li>ephemeral storage for data directories</li>
  <li>ignoring filesystem sync behavior</li>
  <li>assuming resource limits replace tuning</li>
  <li>placing Postgres on storage designed for stateless workloads</li>
</ul>

<p>Containers change packaging, not physics.</p>

<hr />

<h2 id="kubernetes-adds-indirection-not-immunity">Kubernetes Adds Indirection, Not Immunity</h2>

<p>Kubernetes can help manage PostgreSQL, but it does not remove operational requirements.</p>

<p>In Kubernetes:</p>

<ul>
  <li>PersistentVolumes define durability</li>
  <li>StorageClasses define behavior</li>
  <li>the underlying storage still matters</li>
  <li>noisy neighbors still exist</li>
</ul>

<p>If the storage layer is slow or misconfigured, PostgreSQL will faithfully surface those problems.</p>

<hr />

<h2 id="defaults-are-conservative-for-a-reason">Defaults Are Conservative (for a Reason)</h2>

<p>PostgreSQL defaults prioritize:</p>

<ul>
  <li>correctness</li>
  <li>durability</li>
  <li>broad compatibility</li>
</ul>

<p>They are intentionally conservative.</p>

<p>This is good for safety, but it means:</p>

<ul>
  <li>defaults are rarely optimal for high-throughput systems</li>
  <li>tuning should be intentional and informed</li>
  <li>copying random config snippets is risky</li>
</ul>

<p>Understanding <em>why</em> a setting exists matters more than memorizing values.</p>

<hr />

<h2 id="monitoring-tells-the-truth">Monitoring Tells the Truth</h2>

<p>PostgreSQL is verbose when asked correctly.</p>

<p>Key signals include:</p>

<ul>
  <li>connection counts</li>
  <li>transaction duration</li>
  <li>lock waits</li>
  <li>WAL write latency</li>
  <li>disk I/O wait times</li>
</ul>

<p>When Postgres is unhealthy, it usually tells you—just not always in the place people look first.</p>

<hr />

<h2 id="common-anti-patterns">Common Anti-Patterns</h2>

<p>A few patterns show up repeatedly in troubled deployments:</p>

<ul>
  <li>treating Postgres like stateless infrastructure</li>
  <li>scaling application replicas without considering DB impact</li>
  <li>ignoring connection pooling</li>
  <li>placing data on slow or inconsistent storage</li>
  <li>assuming Kubernetes abstracts database concerns away</li>
</ul>

<p>None of these fail immediately. They fail under load.</p>

<hr />

<h2 id="practical-guidance">Practical Guidance</h2>

<ul>
  <li>plan connections before planning CPU</li>
  <li>treat storage latency as a first-class concern</li>
  <li>assume containers do not change database fundamentals</li>
  <li>understand WAL behavior before tuning performance</li>
  <li>observe before optimizing</li>
</ul>

<p>PostgreSQL rewards understanding. It punishes assumptions.</p>

<hr />

<h2 id="closing-thought">Closing Thought</h2>

<p>Most PostgreSQL outages aren’t caused by bugs.</p>

<p>They’re caused by mismatches between:</p>

<ul>
  <li>what PostgreSQL expects</li>
  <li>and what the platform provides</li>
</ul>

<p>Once you treat Postgres as a system with real physical constraints, its behavior becomes predictable—and manageable.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="databases" /><category term="devops" /><summary type="html"><![CDATA[Practical lessons about running PostgreSQL as a system: memory, storage, I/O, and why defaults and containers don’t remove operational responsibility.]]></summary></entry><entry><title type="html">Kubernetes ServiceAccount Tokens and CI/CD Authentication</title><link href="https://xavierlopez.me/devops/security/kubernetes-serviceaccount-tokens-and-ci-authentication/" rel="alternate" type="text/html" title="Kubernetes ServiceAccount Tokens and CI/CD Authentication" /><published>2024-03-05T00:00:00-08:00</published><updated>2025-01-11T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/security/kubernetes-serviceaccount-tokens-and-ci-authentication</id><content type="html" xml:base="https://xavierlopez.me/devops/security/kubernetes-serviceaccount-tokens-and-ci-authentication/"><![CDATA[<h2 id="context">Context</h2>

<p>CI/CD systems frequently need non-interactive access to Kubernetes clusters.</p>

<p>Historically, this was straightforward:</p>

<ul>
  <li>create a ServiceAccount</li>
  <li>bind it with RBAC</li>
  <li>extract a token</li>
  <li>embed it in a kubeconfig</li>
  <li>deploy</li>
</ul>

<p>In Kubernetes 1.24 and later, that workflow quietly broke.</p>

<hr />

<h2 id="how-cicd-auth-to-kubernetes-works">How CI/CD Auth to Kubernetes Works</h2>

<p>In a CI/CD environment:</p>

<ul>
  <li>the job runs outside the cluster</li>
  <li>it uses a kubeconfig file</li>
  <li>the kubeconfig authenticates as a ServiceAccount</li>
  <li>Kubernetes evaluates RBAC rules for that identity</li>
</ul>

<p>This requires a long-lived credential.</p>

<hr />

<h2 id="how-serviceaccount-tokens-used-to-work">How ServiceAccount Tokens Used to Work</h2>

<p>Before Kubernetes 1.24:</p>

<ul>
  <li>ServiceAccounts automatically created token Secrets</li>
  <li>tokens were long-lived</li>
  <li>stored as Kubernetes Secrets</li>
  <li>easy to extract for CI usage</li>
</ul>

<p>Many pipelines relied on this behavior.</p>

<hr />

<h2 id="what-changed-in-kubernetes-124">What Changed in Kubernetes 1.24</h2>

<p>Starting in Kubernetes 1.24:</p>

<ul>
  <li>token Secrets are no longer auto-created</li>
  <li>Kubernetes uses bound ServiceAccount tokens</li>
  <li>tokens are short-lived</li>
  <li>projected only into pods</li>
  <li>not stored as Secrets</li>
</ul>

<p>This improves security but breaks external CI workflows.</p>

<hr />

<h2 id="why-cicd-pipelines-break">Why CI/CD Pipelines Break</h2>

<p>CI systems:</p>

<ul>
  <li>run outside the cluster</li>
  <li>cannot receive projected tokens</li>
  <li>cannot refresh short-lived credentials</li>
</ul>

<p>The ServiceAccount exists.
RBAC is correct.
But no token exists to authenticate with.</p>

<hr />

<h2 id="detecting-the-issue">Detecting the Issue</h2>

<p>You can confirm this by inspecting the ServiceAccount:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get serviceaccount deployer-service-account -o jsonpath='{.secrets}'
</code></pre></div></div>

<p>If the output is empty, no token Secret exists.</p>

<hr />

<h2 id="bound-tokens-vs-secret-tokens">Bound Tokens vs Secret Tokens</h2>

<p>Bound tokens:</p>

<ul>
  <li>short-lived</li>
  <li>pod-scoped</li>
  <li>secure by default</li>
  <li>unsuitable for external CI</li>
</ul>

<p>Secret-based tokens:</p>

<ul>
  <li>long-lived</li>
  <li>manually created</li>
  <li>usable by CI systems</li>
  <li>require explicit lifecycle management</li>
</ul>

<hr />

<h2 id="creating-a-token-secret-explicitly">Creating a Token Secret Explicitly</h2>

<p>When CI access is required, a token Secret can be created manually:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply -f - &lt;&lt;EOF
apiVersion: v1
kind: Secret
metadata:
  name: deployer-sa-token
  annotations:
    kubernetes.io/service-account.name: deployer-service-account
type: kubernetes.io/service-account-token
EOF
</code></pre></div></div>

<p>Kubernetes will populate the token automatically.</p>

<hr />

<h2 id="security-implications">Security Implications</h2>

<p>Manually created tokens:</p>

<ul>
  <li>reintroduce long-lived credentials</li>
  <li>require rotation discipline</li>
  <li>increase blast radius if leaked</li>
</ul>

<p>They should be treated as exceptions, not defaults.</p>

<hr />

<h2 id="modern-alternatives">Modern Alternatives</h2>

<p>More robust approaches include:</p>

<ul>
  <li>OIDC federation</li>
  <li>cloud IAM integrations</li>
  <li>exec-based kubeconfig plugins</li>
  <li>workload identity systems</li>
</ul>

<p>These eliminate static tokens entirely.</p>

<hr />

<h2 id="practical-guidance">Practical Guidance</h2>

<ul>
  <li>do not assume ServiceAccounts have tokens</li>
  <li>distinguish authentication from authorization</li>
  <li>validate permissions with kubectl auth can-i</li>
  <li>treat long-lived tokens as transitional</li>
</ul>

<p>The system did not break.
The defaults changed.
The model finally caught up with security reality.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><category term="security" /><summary type="html"><![CDATA[A practical explanation of how Kubernetes ServiceAccount authentication works for CI/CD systems, what changed in Kubernetes 1.24, and why previously working pipelines broke.]]></summary></entry><entry><title type="html">Creating and Understanding kubeconfig Files</title><link href="https://xavierlopez.me/devops/security/creating-and-understanding-kubeconfig-files/" rel="alternate" type="text/html" title="Creating and Understanding kubeconfig Files" /><published>2024-03-02T00:00:00-08:00</published><updated>2025-01-10T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/security/creating-and-understanding-kubeconfig-files</id><content type="html" xml:base="https://xavierlopez.me/devops/security/creating-and-understanding-kubeconfig-files/"><![CDATA[<h2 id="context">Context</h2>

<p><code class="language-plaintext highlighter-rouge">kubectl</code> feels simple once it works.</p>

<p>But when access breaks—or when you need to create access from scratch—the kubeconfig file suddenly becomes mysterious. Tokens, certificates, contexts, users, clusters: everything is there, but rarely explained clearly.</p>

<p>This post explains what a kubeconfig actually is, how it’s structured, and how to create or modify one intentionally instead of relying on magic.</p>

<hr />

<h2 id="what-a-kubeconfig-really-is">What a kubeconfig Really Is</h2>

<p>A kubeconfig file is <strong>not credentials</strong>.</p>

<p>It is a <strong>configuration document</strong> that tells <code class="language-plaintext highlighter-rouge">kubectl</code>:</p>

<ul>
  <li>which cluster to talk to</li>
  <li>how to authenticate</li>
  <li>which identity to use</li>
  <li>which context ties those together</li>
</ul>

<p>Think of it as a <strong>connection profile</strong>, not a secret store.</p>

<hr />

<h2 id="the-four-core-concepts">The Four Core Concepts</h2>

<p>Every kubeconfig is built from four pieces.</p>

<h3 id="cluster">Cluster</h3>

<p>Defines:</p>

<ul>
  <li>API server endpoint</li>
  <li>CA certificate used to trust the server</li>
</ul>

<h3 id="user">User</h3>

<p>Defines:</p>

<ul>
  <li>how authentication happens</li>
  <li>certificates, tokens, or exec plugins</li>
</ul>

<h3 id="context-1">Context</h3>

<p>Binds:</p>

<ul>
  <li>one cluster</li>
  <li>one user</li>
  <li>optionally a namespace</li>
</ul>

<h3 id="current-context">Current Context</h3>

<p>Tells <code class="language-plaintext highlighter-rouge">kubectl</code> which context to use by default.</p>

<p>Nothing works unless all four line up.</p>

<hr />

<h2 id="inspecting-an-existing-kubeconfig">Inspecting an Existing kubeconfig</h2>

<p>To see what <code class="language-plaintext highlighter-rouge">kubectl</code> is currently using:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config view
</code></pre></div></div>

<p>To see only the active context:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config current-context
</code></pre></div></div>

<p>To list all contexts:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config get-contexts
</code></pre></div></div>

<p>These commands are often enough to diagnose access confusion.</p>

<hr />

<h2 id="creating-a-kubeconfig-manually-step-by-step">Creating a kubeconfig Manually (Step by Step)</h2>

<p>Creating a kubeconfig intentionally makes the model click.</p>

<h3 id="step-1-define-the-cluster">Step 1: Define the Cluster</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config set-cluster example-cluster <span class="se">\</span>
  <span class="nt">--server</span><span class="o">=</span>https://api.example.internal:6443 <span class="se">\</span>
  <span class="nt">--certificate-authority</span><span class="o">=</span>/path/to/ca.crt
</code></pre></div></div>

<p>This tells <code class="language-plaintext highlighter-rouge">kubectl</code> <em>where</em> the API server is and <em>how to trust it</em>.</p>

<hr />

<h3 id="step-2-define-the-user">Step 2: Define the User</h3>

<p>Example using a client certificate:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config set-credentials example-user <span class="se">\</span>
  <span class="nt">--client-certificate</span><span class="o">=</span>/path/to/client.crt <span class="se">\</span>
  <span class="nt">--client-key</span><span class="o">=</span>/path/to/client.key
</code></pre></div></div>

<p>Other authentication methods exist, but the structure is the same.</p>

<hr />

<h3 id="step-3-create-a-context">Step 3: Create a Context</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config set-context example-context <span class="se">\</span>
  <span class="nt">--cluster</span><span class="o">=</span>example-cluster <span class="se">\</span>
  <span class="nt">--user</span><span class="o">=</span>example-user <span class="se">\</span>
  <span class="nt">--namespace</span><span class="o">=</span>default
</code></pre></div></div>

<p>This binds identity to destination.</p>

<hr />

<h3 id="step-4-activate-the-context">Step 4: Activate the Context</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl config use-context example-context
</code></pre></div></div>

<p>At this point, <code class="language-plaintext highlighter-rouge">kubectl</code> is fully configured.</p>

<hr />

<h2 id="where-kubeconfig-files-live">Where kubeconfig Files Live</h2>

<p>By default, <code class="language-plaintext highlighter-rouge">kubectl</code> looks for:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.kube/config
</code></pre></div></div>

<p>You can override this with:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">KUBECONFIG</span><span class="p">=</span><span class="s">/path/to/config kubectl get pods</span>
</code></pre></div></div>

<p>Multiple kubeconfig files can be merged automatically via the <code class="language-plaintext highlighter-rouge">KUBECONFIG</code> environment variable.</p>

<hr />

<h2 id="why-contexts-matter-more-than-credentials">Why Contexts Matter More Than Credentials</h2>

<p>Most access mistakes are <strong>context mistakes</strong>, not auth failures.</p>

<p>Common issues include:</p>

<ul>
  <li>talking to the wrong cluster</li>
  <li>using the wrong namespace</li>
  <li>reusing similarly named contexts</li>
  <li>assuming the current context is what you think it is</li>
</ul>

<p>Always check the context before acting.</p>

<hr />

<h2 id="how-this-relates-to-rbac">How This Relates to RBAC</h2>

<p>A kubeconfig:</p>

<ul>
  <li>defines <em>how</em> you authenticate</li>
  <li>does <strong>not</strong> define <em>what you can do</em></li>
</ul>

<p>Authorization is enforced by:</p>

<ul>
  <li>Kubernetes RBAC</li>
  <li>Roles and RoleBindings</li>
  <li>ClusterRoles and ClusterRoleBindings</li>
</ul>

<p>If access is denied, the kubeconfig is usually fine—the permissions are not.</p>

<hr />

<h3 id="verifying-access-with-kubectl-auth-can-i">Verifying Access with kubectl auth can-i</h3>

<p>Once authentication is working, the next question is authorization.</p>

<p><code class="language-plaintext highlighter-rouge">kubectl auth can-i</code> answers a simple but critical question:</p>

<blockquote>
  <p>“Is this identity allowed to do this action?”</p>
</blockquote>

<h3 id="basic-check">Basic check</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl auth can-i get pods
</code></pre></div></div>

<p>This checks whether the <strong>current context’s user</strong> is allowed to list pods in the current namespace.</p>

<h3 id="explicit-namespace-check">Explicit namespace check</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl auth can-i create deployments <span class="nt">-n</span> example-namespace
</code></pre></div></div>

<p>This avoids false assumptions caused by the active namespace.</p>

<h3 id="cluster-scoped-permissions">Cluster-scoped permissions</h3>

<pre><code class="language-base">kubectl auth can-i list nodes
</code></pre>

<p>This verifies permissions that are not namespace-bound.</p>

<h2 id="why-this-command-is-so-valuable">Why This Command Is So Valuable</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl auth can-i
</code></pre></div></div>

<p>Is the fastest way to distinguish between:</p>

<ul>
  <li>authentication problems (kubeconfig, credentials)</li>
  <li>authorization problems (RBAC)</li>
</ul>

<p>If the command returns <code class="language-plaintext highlighter-rouge">no</code>, authentication succeeded but permissions are insufficient.</p>

<p>If the command errors, the kubeconfig itself may be misconfigured.</p>

<h2 id="make-it-a-habit">Make It a Habit</h2>

<p>Before debugging:</p>

<ul>
  <li>forbidden errors</li>
  <li>CI/CD access failures</li>
  <li>“works for me” discrepancies</li>
  <li>broken automation</li>
</ul>

<p>Run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl auth can-i &lt;verb&gt; &lt;resource&gt;
</code></pre></div></div>

<p>It turns RBAC from guesswork into something concrete.</p>

<hr />

<h2 id="practical-guidance">Practical Guidance</h2>

<ul>
  <li>treat kubeconfig as connection metadata</li>
  <li>keep contexts clearly named</li>
  <li>never assume the current context</li>
  <li>avoid sharing kubeconfig files directly</li>
  <li>regenerate credentials instead of copying them</li>
</ul>

<p>Understanding kubeconfig reduces both mistakes and anxiety.</p>

<hr />

<h2 id="why-this-mental-model-scales">Why This Mental Model Scales</h2>

<p>Once this clicks:</p>

<ul>
  <li>EKS/GKE/AKS configs make sense</li>
  <li>CI/CD kubeconfigs are less scary</li>
  <li>access rotation becomes manageable</li>
  <li>multi-cluster workflows are predictable</li>
</ul>

<p>The file didn’t change—your understanding did.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><category term="security" /><summary type="html"><![CDATA[A practical mental model for creating and understanding kubeconfig files, including clusters, users, contexts, and how kubectl actually authenticates.]]></summary></entry><entry><title type="html">Understanding Byte Size Units (Without Overthinking Them)</title><link href="https://xavierlopez.me/systems/storage/understanding-byte-size-units/" rel="alternate" type="text/html" title="Understanding Byte Size Units (Without Overthinking Them)" /><published>2024-02-28T00:00:00-08:00</published><updated>2025-01-09T00:00:00-08:00</updated><id>https://xavierlopez.me/systems/storage/understanding-byte-size-units</id><content type="html" xml:base="https://xavierlopez.me/systems/storage/understanding-byte-size-units/"><![CDATA[<h2 id="context">Context</h2>

<p>Most engineers <em>know</em> that there’s a difference between decimal and binary byte units.</p>

<p>Fewer engineers can confidently say:</p>

<ul>
  <li>which one a given system is using</li>
  <li>when the distinction matters</li>
  <li>when it’s safe to ignore</li>
</ul>

<p>This post explains byte size units in the way that’s actually useful in practice—without turning it into a standards lecture.</p>

<hr />

<h2 id="the-two-systems-youll-encounter">The Two Systems You’ll Encounter</h2>

<p>There are <strong>two</strong> byte size systems in common use:</p>

<h3 id="decimal-base-10">Decimal (Base-10)</h3>

<p>Used primarily for:</p>

<ul>
  <li>disk marketing</li>
  <li>network throughput</li>
  <li>vendor specifications</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 KB = 1,000 bytes
1 MB = 1,000,000 bytes
1 GB = 1,000,000,000 bytes
1 TB = 1,000,000,000,000 bytes
</code></pre></div></div>

<p>These scale cleanly by powers of 10.</p>

<hr />

<h3 id="binary-base-2">Binary (Base-2)</h3>

<p>Used primarily by:</p>

<ul>
  <li>operating systems</li>
  <li>memory reporting</li>
  <li>filesystems</li>
  <li>low-level tooling</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 KiB = 1,024 bytes
1 MiB = 1,048,576 bytes
1 GiB = 1,073,741,824 bytes
1 TiB = 1,099,511,627,776 bytes
</code></pre></div></div>

<p>These scale by powers of 2.</p>

<hr />

<h2 id="why-the-names-look-so-similar">Why the Names Look So Similar</h2>

<p>The confusion comes from history.</p>

<p>For years, binary quantities were labeled using decimal names:</p>

<ul>
  <li>“KB” meant 1024 bytes</li>
  <li>“MB” meant 1024² bytes</li>
</ul>

<p>That shorthand stuck—long after it became misleading.</p>

<p>The <strong>IEC standard</strong> introduced:</p>

<ul>
  <li>KiB, MiB, GiB, TiB</li>
</ul>

<p>Not to complicate things—but to be precise.</p>

<hr />

<h2 id="where-this-actually-matters">Where This Actually Matters</h2>

<p>In practice, you’ll most often see:</p>

<ul>
  <li><strong>Disks advertised in GB/TB (decimal)</strong></li>
  <li><strong>Operating systems reporting GiB/TiB (binary)</strong></li>
  <li><strong>Memory measured in GiB</strong></li>
  <li><strong>Network speeds measured in Gb/s (decimal bits)</strong></li>
</ul>

<p>This is why a “1 TB disk” doesn’t show up as “1 TB” in your OS.</p>

<p>Nothing is missing. Nothing is broken.</p>

<p>The units changed.</p>

<hr />

<h2 id="a-practical-example">A Practical Example</h2>

<p>A disk advertised as <strong>1 TB</strong> contains:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1,000,000,000,000 bytes
</code></pre></div></div>

<p>Your OS reports in GiB:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1,000,000,000,000 ÷ 1,073,741,824 ≈ 931 GiB
</code></pre></div></div>

<p>That ~7% difference is expected.</p>

<p>It’s not overhead. It’s arithmetic.</p>

<hr />

<h2 id="when-you-should-care">When You Should Care</h2>

<p>You should pay attention to units when:</p>

<ul>
  <li>capacity planning</li>
  <li>comparing vendor claims</li>
  <li>sizing storage or memory limits</li>
  <li>troubleshooting “missing” space</li>
  <li>interpreting monitoring metrics</li>
</ul>

<p>This is especially true in:</p>

<ul>
  <li>Kubernetes resource limits</li>
  <li>cloud storage pricing</li>
  <li>filesystem usage reports</li>
</ul>

<hr />

<h2 id="when-you-can-mostly-ignore-it">When You Can Mostly Ignore It</h2>

<p>You can often ignore the distinction when:</p>

<ul>
  <li>working at small scales</li>
  <li>eyeballing approximate usage</li>
  <li>doing relative comparisons within the same system</li>
</ul>

<p>Just don’t mix unit systems mid-calculation.</p>

<hr />

<h2 id="practical-guidance">Practical Guidance</h2>

<p>A simple rule of thumb:</p>

<ul>
  <li><strong>If it’s hardware, bandwidth, or marketing → decimal</strong></li>
  <li><strong>If it’s an OS, memory, or filesystem → binary</strong></li>
</ul>

<p>When precision matters, check the unit label explicitly.</p>

<p>If the tool says <code class="language-plaintext highlighter-rouge">GiB</code>, believe it.</p>

<hr />

<h2 id="why-this-is-still-worth-knowing">Why This Is Still Worth Knowing</h2>

<p>This confusion persists because:</p>

<ul>
  <li>both systems are valid</li>
  <li>both are widely used</li>
  <li>tools are inconsistent about labeling</li>
</ul>

<p>Understanding the distinction once prevents years of second-guessing.</p>

<p>It’s a small mental model with a long shelf life.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="systems" /><category term="storage" /><summary type="html"><![CDATA[A practical explanation of byte size units—KB vs KiB, MB vs MiB—and why the distinction matters in real systems.]]></summary></entry><entry><title type="html">Installing the NFS Subdir External Provisioner with Helm</title><link href="https://xavierlopez.me/devops/storage/installing-nfs-subdir-external-provisioner-with-helm/" rel="alternate" type="text/html" title="Installing the NFS Subdir External Provisioner with Helm" /><published>2024-02-26T00:00:00-08:00</published><updated>2025-01-08T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/storage/installing-nfs-subdir-external-provisioner-with-helm</id><content type="html" xml:base="https://xavierlopez.me/devops/storage/installing-nfs-subdir-external-provisioner-with-helm/"><![CDATA[<h2 id="context">Context</h2>

<p>Understanding how Kubernetes storage works is one thing.<br />
Actually <strong>enabling</strong> that capability in a cluster is another.</p>

<p>If you want dynamic NFS-backed Persistent Volumes, Kubernetes needs a component that can:</p>

<ul>
  <li>watch for PersistentVolumeClaims</li>
  <li>create directories on an NFS server</li>
  <li>register those directories as PersistentVolumes</li>
</ul>

<p>That component is the <strong>NFS Subdir External Provisioner</strong>.</p>

<p>This post focuses on installing it intentionally using Helm—and understanding what you’re enabling when you do.</p>

<hr />

<h2 id="what-this-provisioner-does">What This Provisioner Does</h2>

<p>The NFS Subdir External Provisioner:</p>

<ul>
  <li>runs as a pod in your cluster</li>
  <li>listens for PVCs referencing its StorageClass</li>
  <li>creates subdirectories on an external NFS server</li>
  <li>dynamically provisions PersistentVolumes</li>
</ul>

<p>Kubernetes itself does <strong>not</strong> talk to NFS directly.<br />
This provisioner is the bridge.</p>

<hr />

<h2 id="prerequisites">Prerequisites</h2>

<p>Before installing anything, you need:</p>

<ul>
  <li>a reachable NFS server</li>
  <li>an exported directory writable by the provisioner</li>
  <li>network connectivity from cluster nodes to the NFS server</li>
  <li>Helm installed and configured</li>
</ul>

<p>If the NFS server isn’t healthy, this installation will succeed—but provisioning will not.</p>

<hr />

<h2 id="adding-the-helm-repository">Adding the Helm Repository</h2>

<p>First, add the Helm repository that hosts the chart:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">helm</span> <span class="n">repo</span> <span class="k">add</span> <span class="n">nfs</span><span class="o">-</span><span class="n">subdir</span><span class="o">-</span><span class="k">external</span><span class="o">-</span><span class="n">provisioner</span> <span class="err">\</span>
  <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">kubernetes</span><span class="o">-</span><span class="n">sigs</span><span class="p">.</span><span class="n">github</span><span class="p">.</span><span class="n">io</span><span class="o">/</span><span class="n">nfs</span><span class="o">-</span><span class="n">subdir</span><span class="o">-</span><span class="k">external</span><span class="o">-</span><span class="n">provisioner</span><span class="o">/</span>

<span class="n">helm</span> <span class="n">repo</span> <span class="k">update</span>
</code></pre></div></div>

<p>This makes the chart available locally.</p>

<hr />

<h2 id="installing-the-provisioner">Installing the Provisioner</h2>

<p>The core installation uses <code class="language-plaintext highlighter-rouge">helm install</code> with a small but important set of values.</p>

<p>Example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">helm</span> <span class="n">install</span> <span class="n">nfs</span><span class="o">-</span><span class="n">provisioner</span> <span class="err">\</span>
  <span class="n">nfs</span><span class="o">-</span><span class="n">subdir</span><span class="o">-</span><span class="k">external</span><span class="o">-</span><span class="n">provisioner</span><span class="o">/</span><span class="n">nfs</span><span class="o">-</span><span class="n">subdir</span><span class="o">-</span><span class="k">external</span><span class="o">-</span><span class="n">provisioner</span> <span class="err">\</span>
  <span class="c1">--namespace storage-system \</span>
  <span class="c1">--create-namespace \</span>
  <span class="c1">--set nfs.server=192.0.2.50 \</span>
  <span class="c1">--set nfs.path=/exports/kubernetes \</span>
  <span class="c1">--set storageClass.name=managed-nfs</span>
</code></pre></div></div>

<p>Key values explained:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">nfs.server</code></p>

    <p>Address of the external NFS server</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">nfs.path</code></p>

    <p>Base directory where subdirectories will be created</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">storageClass.name</code></p>

    <p>The StorageClass PVCs will reference</p>
  </li>
</ul>

<p>This command installs the provisioner and registers a new StorageClass.</p>

<hr />

<h2 id="what-helm-is-actually-creating">What Helm Is Actually Creating</h2>

<p>After installation, you should see:</p>

<ul>
  <li>a Deployment running the provisioner</li>
  <li>a Pod connected to the NFS server</li>
  <li>a StorageClass pointing to this provisioner</li>
</ul>

<p>Helm handles object creation, but <strong>you are responsible</strong> for understanding the consequences.</p>

<hr />

<h2 id="verifying-the-installation">Verifying the Installation</h2>

<p>Check that the pod is running:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get pods <span class="nt">-n</span> storage-system
</code></pre></div></div>

<p>Confirm the StorageClass exists:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get storageclass
</code></pre></div></div>

<p>You should see the <code class="language-plaintext highlighter-rouge">managed-nfs</code> StorageClass listed.</p>

<hr />

<h2 id="validating-dynamic-provisioning">Validating Dynamic Provisioning</h2>

<p>Create a simple PVC referencing the StorageClass:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply <span class="nt">-f</span> - <span class="o">&lt;&lt;</span><span class="no">EOF</span><span class="sh">
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: managed-nfs
  resources:
    requests:
      storage: 1Gi
</span><span class="no">EOF
</span></code></pre></div></div>

<p>If provisioning works:</p>

<ul>
  <li>a PersistentVolume will be created automatically</li>
  <li>a new directory will appear on the NFS server</li>
  <li>the PVC will bind successfully</li>
</ul>

<p>This confirms end-to-end functionality.</p>

<p>If the reported size appears smaller than expected, this is often a unit conversion issue rather than a provisioning failure.<br />
See: <a href="/2024/02/28/understanding-byte-size-units/">Understanding Byte Size Units (Without Overthinking Them)</a></p>

<hr />

<h2 id="common-failure-modes">Common Failure Modes</h2>

<p>If things don’t work, check:</p>

<ul>
  <li>NFS server permissions</li>
  <li>firewall rules</li>
  <li>pod logs for the provisioner</li>
  <li>correctness of <code class="language-plaintext highlighter-rouge">nfs.server</code> and <code class="language-plaintext highlighter-rouge">nfs.path</code></li>
  <li>whether the StorageClass name matches the PVC</li>
</ul>

<p>Most failures are external to Kubernetes.</p>

<hr />

<h2 id="when-this-is-and-isnt-the-right-choice">When This Is (and Isn’t) the Right Choice</h2>

<p>This approach works well for:</p>

<ul>
  <li>shared storage</li>
  <li>development clusters</li>
  <li>on-prem environments</li>
  <li>workloads needing <code class="language-plaintext highlighter-rouge">ReadWriteMany</code></li>
</ul>

<p>It may not be appropriate for:</p>

<ul>
  <li>high-performance databases</li>
  <li>latency-sensitive workloads</li>
  <li>cloud-native block storage replacements</li>
</ul>

<p>NFS is a tool, not a default.</p>

<hr />

<h2 id="how-this-fits-the-bigger-picture">How This Fits the Bigger Picture</h2>

<p>This installation enables the architecture described in:</p>

<blockquote>
  <p><em>How NFS-Backed Persistent Volumes Actually Work in Kubernetes</em></p>
</blockquote>

<p>Understanding the model first makes this step predictable instead of magical.</p>

<p>Helm just applies the intent—you still own the system.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><category term="storage" /><summary type="html"><![CDATA[How to install and reason about the NFS Subdir External Provisioner using Helm, enabling dynamic NFS-backed Persistent Volumes in Kubernetes.]]></summary></entry><entry><title type="html">Fixing Kubernetes Namespaces Stuck in Terminating</title><link href="https://xavierlopez.me/devops/development/fixing-kubernetes-namespaces-stuck-in-terminating/" rel="alternate" type="text/html" title="Fixing Kubernetes Namespaces Stuck in Terminating" /><published>2024-02-22T00:00:00-08:00</published><updated>2025-01-07T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/development/fixing-kubernetes-namespaces-stuck-in-terminating</id><content type="html" xml:base="https://xavierlopez.me/devops/development/fixing-kubernetes-namespaces-stuck-in-terminating/"><![CDATA[<h2 id="context">Context</h2>

<p>Most Kubernetes resources delete cleanly.<br />
Namespaces are the exception.</p>

<p>When a namespace gets stuck in <code class="language-plaintext highlighter-rouge">Terminating</code>, it’s usually not because Kubernetes is broken—it’s because Kubernetes is waiting for something <em>else</em> to finish its job.</p>

<p>Understanding why that happens requires understanding <strong>finalizers</strong>.</p>

<hr />

<h2 id="what-a-namespace-deletion-actually-means">What a Namespace Deletion Actually Means</h2>

<p>Deleting a namespace is not a single operation.</p>

<p>When you run:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">delete</span> <span class="n">namespace</span> <span class="n">example</span><span class="o">-</span><span class="n">namespace</span>
</code></pre></div></div>

<p>Kubernetes:</p>

<ol>
  <li>marks the namespace for deletion</li>
  <li>enumerates all namespaced resources</li>
  <li>waits for controllers to clean up what they own</li>
  <li>removes finalizers</li>
  <li>deletes the namespace object</li>
</ol>

<p>If <em>any</em> step stalls, the namespace remains in <code class="language-plaintext highlighter-rouge">Terminating</code>.</p>

<hr />

<h2 id="what-finalizers-are-conceptually">What Finalizers Are (Conceptually)</h2>

<p>A <strong>finalizer</strong> is a promise.</p>

<p>It says:</p>
<blockquote>
  <p>“Do not delete this object until I have cleaned something up.”</p>
</blockquote>

<p>Finalizers are commonly added by:</p>

<ul>
  <li>controllers</li>
  <li>operators</li>
  <li>storage provisioners</li>
  <li>custom resources</li>
</ul>

<p>They exist to prevent data loss and orphaned infrastructure.</p>

<p>The downside: if the controller is gone or broken, the promise is never fulfilled.</p>

<hr />

<h2 id="why-namespaces-get-stuck">Why Namespaces Get Stuck</h2>

<p>Namespaces typically get stuck when:</p>

<ul>
  <li>a controller was removed before cleanup finished</li>
  <li>a CRD was deleted before its instances</li>
  <li>a storage provisioner no longer exists</li>
  <li>a webhook or operator is failing</li>
  <li>finalizers reference resources that no longer respond</li>
</ul>

<p>At that point, Kubernetes is waiting for a cleanup step that will never occur.</p>

<hr />

<h2 id="confirming-the-problem">Confirming the Problem</h2>

<p>First, verify the namespace state:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get namespace example-namespace
</code></pre></div></div>

<p>If it shows:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>STATUS   Terminating
</code></pre></div></div>

<p>Inspect it more closely:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl describe namespace example-namespace
</code></pre></div></div>

<p>Often, you’ll see references to remaining resources or finalizers.</p>

<p>For deeper inspection:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get namespace example-namespace <span class="nt">-o</span> json
</code></pre></div></div>

<p>Look specifically at:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spec.finalizers
</code></pre></div></div>

<hr />

<h2 id="why-force-deletion-usually-doesnt-work">Why Force Deletion Usually Doesn’t Work</h2>

<p>Commands like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">delete</span> <span class="n">namespace</span> <span class="n">example</span><span class="o">-</span><span class="n">namespace</span> <span class="c1">--force --grace-period=0</span>
</code></pre></div></div>

<p>are commonly tried—and commonly ineffective.</p>

<p>That’s because:</p>

<ul>
  <li>finalizers live at the API level</li>
  <li>force deletion does not bypass finalizers</li>
  <li>Kubernetes is still honoring the contract</li>
</ul>

<p>Force only skips graceful termination, not cleanup guarantees.</p>

<hr />

<h2 id="the-last-resort-fix-removing-finalizers">The Last-Resort Fix: Removing Finalizers</h2>

<p>⚠️ <strong>This is an administrative recovery action.</strong><br />
You are explicitly telling Kubernetes to stop waiting.</p>

<p>Proceed only when:</p>

<ul>
  <li>you understand what’s stuck</li>
  <li>the owning controller no longer exists</li>
  <li>cleanup cannot complete naturally</li>
</ul>

<hr />

<h3 id="step-1-export-the-namespace-definition">Step 1: Export the Namespace Definition</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get namespace example-namespace <span class="nt">-o</span> json <span class="o">&gt;</span> namespace.json
</code></pre></div></div>

<hr />

<h3 id="step-2-remove-the-finalizers">Step 2: Remove the Finalizers</h3>

<p>Edit <code class="language-plaintext highlighter-rouge">namespace.json</code> and remove the <code class="language-plaintext highlighter-rouge">finalizers</code> field entirely.</p>

<p>Before:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nl">"finalizers"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"kubernetes"</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>After:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w">
</span></code></pre></div></div>

<hr />

<h3 id="step-3-submit-the-finalized-object">Step 3: Submit the Finalized Object</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl replace <span class="nt">--raw</span> <span class="s2">"/api/v1/namespaces/example-namespace/finalize"</span> <span class="se">\</span>
  <span class="nt">-f</span> namespace.json
</code></pre></div></div>

<p>This bypasses the normal deletion workflow and tells the API server:</p>
<blockquote>
  <p>“Delete this namespace now.”</p>
</blockquote>

<p>If successful, the namespace disappears immediately.</p>

<hr />

<h2 id="what-youre-skipping-by-doing-this">What You’re Skipping by Doing This</h2>

<p>Removing finalizers means:</p>

<ul>
  <li>controllers do <strong>not</strong> clean up external resources</li>
  <li>storage or cloud artifacts may remain</li>
  <li>audit trails may be incomplete</li>
</ul>

<p>This is why this approach is <strong>corrective</strong>, not routine.</p>

<hr />

<h2 id="when-this-is-the-right-call">When This Is the Right Call</h2>

<p>This approach is appropriate when:</p>

<ul>
  <li>the cluster is already inconsistent</li>
  <li>the namespace is blocking automation</li>
  <li>recovery is impossible via normal controllers</li>
  <li>the resources are already orphaned</li>
</ul>

<p>In practice, this is often the only viable path forward.</p>

<hr />

<h2 id="preventing-this-in-the-future">Preventing This in the Future</h2>

<p>A few practices reduce the odds of hitting this:</p>

<ul>
  <li>delete CR instances before deleting CRDs</li>
  <li>remove operators last, not first</li>
  <li>monitor namespaces during teardown</li>
  <li>understand which controllers add finalizers</li>
  <li>treat namespace deletion as a process, not a command</li>
</ul>

<p>Finalizers are powerful—but they require discipline.</p>

<hr />

<h2 id="practical-takeaways">Practical Takeaways</h2>

<ul>
  <li>namespaces don’t delete instantly by design</li>
  <li>finalizers exist to protect external state</li>
  <li>stuck namespaces usually mean broken cleanup</li>
  <li>force deletion does not bypass finalizers</li>
  <li>removing finalizers is safe <em>only when cleanup is impossible</em></li>
</ul>

<p>This is one of those Kubernetes behaviors that feels mysterious—until it isn’t.</p>

<p>Once you understand the contract, the fix becomes deliberate instead of desperate.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><category term="development" /><summary type="html"><![CDATA[A practical explanation of why Kubernetes namespaces get stuck in Terminating and how to safely resolve the issue by understanding and managing finalizers.]]></summary></entry><entry><title type="html">Creating Kubernetes Secrets from the Command Line (and When Not To)</title><link href="https://xavierlopez.me/devops/security/creating-kubernetes-secrets-from-the-command-line/" rel="alternate" type="text/html" title="Creating Kubernetes Secrets from the Command Line (and When Not To)" /><published>2024-02-19T00:00:00-08:00</published><updated>2025-01-06T00:00:00-08:00</updated><id>https://xavierlopez.me/devops/security/creating-kubernetes-secrets-from-the-command-line</id><content type="html" xml:base="https://xavierlopez.me/devops/security/creating-kubernetes-secrets-from-the-command-line/"><![CDATA[<h2 id="context">Context</h2>

<p>Kubernetes Secrets are often introduced early, but rarely explained clearly.</p>

<p>Most examples focus on <em>how</em> to create a Secret, not:</p>

<ul>
  <li><strong>why</strong> you’d choose one method over another</li>
  <li>what tradeoffs you’re making</li>
  <li>how Secrets fit into a broader operational model</li>
</ul>

<p>This post focuses specifically on creating Secrets from the command line using <code class="language-plaintext highlighter-rouge">kubectl create secret</code>, and—just as importantly—when <em>not</em> to do that.</p>

<hr />

<h2 id="what-kubectl-create-secret-actually-does">What kubectl create secret Actually Does</h2>

<p>At a high level, <code class="language-plaintext highlighter-rouge">kubectl create secret</code>:</p>

<ul>
  <li>takes input (literals, files, or environment variables)</li>
  <li>base64-encodes the values</li>
  <li>submits a Secret object to the Kubernetes API server</li>
</ul>

<p>It does <strong>not</strong>:</p>

<ul>
  <li>encrypt values by itself</li>
  <li>manage secret rotation</li>
  <li>track provenance</li>
  <li>enforce security policies</li>
</ul>

<p>It is a creation mechanism, not a secrets management system.</p>

<hr />

<h2 id="creating-a-secret-from-literal-values">Creating a Secret from Literal Values</h2>

<p>The most direct pattern uses literals:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">create</span> <span class="n">secret</span> <span class="n">generic</span> <span class="n">example</span><span class="o">-</span><span class="n">db</span><span class="o">-</span><span class="n">creds</span> <span class="err">\</span>
  <span class="c1">--from-literal=username=example_user \</span>
  <span class="c1">--from-literal=password=example_password</span>
</code></pre></div></div>

<p>This is useful for:</p>

<ul>
  <li>quick experiments</li>
  <li>local clusters</li>
  <li>validating application wiring</li>
</ul>

<p>It is <strong>not</strong> ideal for long-lived or production secrets.</p>

<hr />

<h2 id="creating-a-secret-from-a-file">Creating a Secret from a File</h2>

<p>A more common pattern is file-based creation:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">create</span> <span class="n">secret</span> <span class="n">generic</span> <span class="n">example</span><span class="o">-</span><span class="n">config</span> <span class="err">\</span>
  <span class="c1">--from-file=application.yaml</span>
</code></pre></div></div>

<p>This creates a Secret where:</p>

<ul>
  <li>the key is the filename</li>
  <li>the value is the file contents</li>
</ul>

<p>This works well for:</p>

<ul>
  <li>config blobs</li>
  <li>certificates</li>
  <li>structured files</li>
</ul>

<p>But it still raises questions about where that file lives and how it’s protected.</p>

<hr />

<h2 id="creating-secrets-from-environment-files">Creating Secrets from Environment Files</h2>

<p>Environment-style files can also be used:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">create</span> <span class="n">secret</span> <span class="n">generic</span> <span class="n">example</span><span class="o">-</span><span class="n">env</span> <span class="err">\</span>
  <span class="c1">--from-env-file=.env</span>
</code></pre></div></div>

<p>This is convenient, but dangerous if:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.env</code> files are committed accidentally</li>
  <li>shell history is not managed carefully</li>
  <li>multiple environments share similar filenames</li>
</ul>

<p>Convenience and risk scale together here.</p>

<hr />

<h2 id="namespaces-matter-more-than-syntax">Namespaces Matter More Than Syntax</h2>

<p>By default, Secrets are created in the <strong>current namespace</strong>.</p>

<p>This is one of the most common failure modes.</p>

<p>Always be explicit:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kubectl</span> <span class="k">create</span> <span class="n">secret</span> <span class="n">generic</span> <span class="n">example</span><span class="o">-</span><span class="n">db</span><span class="o">-</span><span class="n">creds</span> <span class="err">\</span>
  <span class="c1">--from-literal=username=example_user \</span>
  <span class="c1">--from-literal=password=example_password \</span>
  <span class="o">-</span><span class="n">n</span> <span class="n">example</span><span class="o">-</span><span class="n">namespace</span>
</code></pre></div></div>

<p>Secrets in the wrong namespace are indistinguishable from missing secrets.</p>

<hr />

<h2 id="inspecting-what-you-created">Inspecting What You Created</h2>

<p>You can confirm a Secret exists with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get secret example-db-creds <span class="nt">-n</span> example-namespace
</code></pre></div></div>

<p>And inspect metadata with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl describe secret example-db-creds <span class="nt">-n</span> example-namespace
</code></pre></div></div>

<p>Avoid decoding values casually unless you need to verify wiring.</p>

<p>If you <em>do</em> decode, do it intentionally and clean up afterward.</p>

<hr />

<h2 id="when-kubectl-create-secret-is-the-right-tool">When kubectl create secret Is the Right Tool</h2>

<p>This approach works well when:</p>

<ul>
  <li>bootstrapping a cluster</li>
  <li>validating application configuration</li>
  <li>working in ephemeral environments</li>
  <li>teaching or learning Kubernetes mechanics</li>
</ul>

<p>It’s a <strong>mechanical tool</strong>, not a long-term strategy.</p>

<hr />

<h2 id="when-kubectl-create-secret-becomes-a-liability">When kubectl create secret Becomes a Liability</h2>

<p>Problems arise when:</p>

<ul>
  <li>secrets are created manually and forgotten</li>
  <li>values live in shell history</li>
  <li>environments drift</li>
  <li>rotation becomes manual and error-prone</li>
  <li>auditability matters</li>
</ul>

<p>At scale, this approach does not age well.</p>

<hr />

<h2 id="better-patterns-for-the-long-term">Better Patterns for the Long Term</h2>

<p>As systems mature, secrets creation usually moves toward:</p>

<ul>
  <li>GitOps workflows</li>
  <li>external secret managers</li>
  <li>sealed or encrypted manifests</li>
  <li>automated rotation</li>
</ul>

<p>In those models:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">kubectl create secret</code> is often replaced</li>
  <li>or used only as a bootstrap mechanism</li>
</ul>

<p>That’s not a failure—it’s progress.</p>

<hr />

<h2 id="practical-takeaways">Practical Takeaways</h2>

<ul>
  <li><code class="language-plaintext highlighter-rouge">kubectl create secret</code> is about <em>object creation</em>, not security</li>
  <li>be explicit about namespaces</li>
  <li>understand where secret material lives</li>
  <li>treat manual creation as transitional</li>
  <li>plan for replacement as systems grow</li>
</ul>

<p>Secrets are less about syntax and more about discipline.</p>

<p>Understanding the limits of your tools is part of operating Kubernetes responsibly.</p>]]></content><author><name>{&quot;name&quot;=&gt;&quot;&quot;, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;&quot;&quot;, &quot;location&quot;=&gt;&quot;Los Angeles, CA&quot;, &quot;email&quot;=&gt;&quot;xavier@xavierlopez.me&quot;, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;LinkedIn&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;, &quot;url&quot;=&gt;&quot;https://linkedin.com/in/zavelopez&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/zavestudios&quot;}]}</name><email>xavier@xavierlopez.me</email></author><category term="devops" /><category term="security" /><summary type="html"><![CDATA[Practical notes on creating Kubernetes Secrets from the command line, including when kubectl create secret is appropriate—and when it becomes a liability.]]></summary></entry></feed>