EngSecLabs

Credential isolation and least privilege for AWS agents

2026-05-08T00:00:00+00:00

Two problems come up every time you give an AI agent AWS access: the agent has exfiltratable credentials, and you have to guess what permissions it needs in the form of an IAM policy.

iam-agent-proxy is an HTTPS proxy for AWS CLI/SDK calls that validates requests using fake AWS keys and re-signs with real credentials. And because the proxy intercepts every request, it can resolve each one to an IAM action string, generate, and even enforce a least-privilege policy from what the agent actually called.

Getting started

Start the proxy with whatever AWS profile has the permissions your agent needs:

AWS_PROFILE=my-real-profile iam-agent-proxy

In a second terminal, point the agent at it:

export AWS_PROFILE=iam-agent-proxy
export HTTPS_PROXY=http://localhost:8080

The agent gets proxy-issued fake keys — no IAM identity behind them:

{
  "Version": 1,
  "AccessKeyId": "AKIAPROXY0000000001",
  "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "Expiration": "2026-05-08T15:00:00Z"
}

Make some AWS calls:

aws sts get-caller-identity
aws s3 ls

The proxy terminal logs each resolved action:

[14:32:01] ALLOWED  sts:GetCallerIdentity
[14:32:09] ALLOWED  s3:ListAllMyBuckets

Run the agent through a representative workload, then extract the observed policy:

iam-agent-proxy policy

That emits standard IAM policy JSON you can use as an inline policy or session policy. Set PROXY_MODE=enforce and point ALLOWLIST_PATH at that file and the proxy starts blocking anything outside it, returning a well-formed AccessDenied 403 so the agent’s error handling works as designed.

The workflow inverts the usual least-privilege approach: instead of guessing what the agent needs before it runs, you observe what it actually does and lock in that baseline.

Check it out at github.com/engseclabs/iam-agent-proxy. If you’re building in this space or hit a case it doesn’t cover, reach out on LinkedIn or Mastodon.

AWS Credential Isolation for Local AI Agents

2026-04-27T00:00:00+00:00

If you run local agents, you need to make tough choices between autonomy and safety. Setting dangerously-skip-permissions while sword fighting on desk chairs and letting the tokens burn bright is an out-of-reach dream when you’re forced to babysit file access and “can I use the internet for this?” requests. But the impact of an agent doing something stupid needs to be pretty low for humans to check out of the loop.

I’ve been interacting with AWS from my local coding agents and ran into a source of flummoxation. If you’ve got even a moderately complicated AWS IAM setup, profile management gets gnarly. The problem has three parts:

The agent should only have access to a specific AWS identity, not whatever creds happens to be in your shell
That identity should use short-lived credentials that refresh automatically, not static keys you rotate manually
And (ideally) the permissions on that identity are scoped to what the agent actually needs for the task at hand

Personal opinion, but trailtool is an awesome way to handle the third point - see my blog post for how TrailTool can generate least privilege IAM roles from observed CloudTrail activity. For the first two, I ran into a tool called elhaz that solves the credentials part nicely. By tying them together, you can get a sandboxed agent scoped to a specific AWS identity, with automatically refreshing short-lived credentials, locked down to least privilege.

elhaz

Elhaz is a local credential broker daemon that manages in-memory AWS STS credentials and serves them over a Unix socket at ~/.elhaz/sock/daemon.sock. A core concept is a config - a saved set of parameters for assuming an AWS role (the role ARN, session configuration, etc). When you run elhaz daemon add -n my-config you’re telling the daemon to actually assume that role using your active AWS cred chain and start managing the session. From that point the daemon holds live, automatically refreshing credentials for that config in memory.

# Create a config for the role you want the agent to assume
elhaz config add

# Start the daemon and initialize the session
elhaz daemon start
elhaz daemon add -n my-agent-role

# Verify it's working
elhaz whoami -n my-agent-role

A very cool aspect of elhaz is that it exposes the assumed role credentials via Unix socket IPC rather than through the standard env var or filesystem approaches that underlie the AWS credential chain. That’s what makes the sandboxing story clean.

Sandboxing AWS credentials in a container

Agent sandboxing is moving fast and anything I say about specific tools will get stale quickly. There’s a whole ecosystem here: sandbox-exec, Bubblewrap, fence, Codex and Claude sandboxes, and more landing constantly. The point is that these mechanisms restrict agent access to resources using OS-level primitives. The credential delivery mechanism needs to fit the affordances, and it doesn’t seem env vars, files, and ports fit the bill. In my research, I’ve been using Docker for agent isolation, and what follows is what I found, listed in increasing order of how well each approach actually holds up.

Note: this assumes you’re using ephemeral, role-based credentials throughout. IAM user access keys are static, long-lived, and the top entry point for attackers when they leak..

Environment variables

When we need isolated creds, you might consider reaching for aws-vault exec or aws configure export-credentials (elhaz also support export), capture the output, and inject it into the container. It works for a one-off, but those are snapshot credentials. STS session tokens typically expire in an hour, at which point your long-running agent session breaks and you’re back to manually refreshing and re-injecting. You may be able to wire something like this up where the container stays updated over time, but it’s not a hands-off approach.

Credential files

Mounting ~/.aws/ into the container is a step up since the SDK can re-read profiles and SSO tokens directly. But it gives the agent access to every profile in that directory, not just the one you intended. And SSO tokens cached on the host are not refreshable from inside the container without a browser. In a multi-agent setup the access boundary is too coarse: you can’t easily say “Agent A gets dev access and Agent B gets prod access” without a lot of manual wiring.

One note: Claude Code has a native sandbox worth understanding here, because it does not solve the credential isolation problem. The sandbox restricts writes to an allowlist of paths and filters outbound HTTP through a proxy. Reads are unrestricted by default. That means Claude can read your entire home directory out of the box, including ~/.aws/credentials etc. You can add ~/.aws to a deny list, but it doesn’t hold: the deny list applies to Claude’s built-in file read tool, not bash subprocesses. Running cat ~/.aws/credentials via Bash succeeds regardless.

Metadata emulation

The more sophisticated approach is emulating the AWS instance metadata service locally. This is how ECS delivers credentials to tasks in production: the SDK makes HTTP requests to a well-known local endpoint and gets fresh credentials back on demand. Tools like aws-vault replicate this locally via --server mode, pointing AWS_CONTAINER_CREDENTIALS_FULL_URI at a local HTTP server that vends credentials.

The problem is that this doesn’t work on macOS, and the workarounds are dead ends. Docker Desktop runs the engine inside a Linux VM, so --network host reaches the VM’s loopback, not the Mac’s (Docker host networking doesn’t work on macOS). And even if you could route it, the AWS SDK has a hardcoded allowlist for HTTP credential endpoints: only localhost, 127.0.0.1, and the ECS/EKS metadata IPs are permitted. Adding host.docker.internal was requested and closed as “not planned.” The aws-vault issue tracker has an open thread on this that has been sitting unresolved for years.

Unix sockets

Unix sockets represent a convenient way to use filesystem access to express AWS dynamic credential access. If you don’t explicitly mount a socket into a container, that socket doesn’t exist in that container’s universe. Run two elhaz configs, get two socket files, and the volume mount itself becomes the authorization decision. Agent B cannot reach Agent A’s credentials as the storage locations are partitioned.

Here’s what it looks like in practice:

# On the host: start the daemon and assume the role
elhaz daemon start
elhaz daemon add -n my-agent-role

# Run a container with only that socket mounted
docker run --rm \
  -v ~/.elhaz/sock/daemon.sock:/tmp/elhaz.sock \
  -e AWS_DEFAULT_REGION=us-east-1 \
  -e AWS_PROFILE=elhaz \
  python:3.12-slim \
  bash -c '
pip install elhaz awscli -q &&
mkdir -p ~/.aws &&
cat > ~/.aws/config <

The container never touches your ~/.aws directory or environment variables. Credentials are scoped to a single named config, automatically refreshed by the daemon, and never written to disk inside the container. The same pattern likely works with non-Docker sandboxing approaches since Unix socket access is controlled by filesystem permissions, a primitive that most sandboxing tools expose in some form (swapping Docker for Bubblewrap, testing on different Linux platforms, etc. is left as an exercise to the reader).

TrailTool: CloudTrail for AI Agents

2026-03-23T00:00:00+00:00

Running security for AWS-centric companies means getting down and dirty with CloudTrail. Not only will you crawl the logs with SIEMs to “find the baddies” via IoCs; as a proactive engineering-focused security team, you’ll rely on them to implement access control, validate changes, and debug problems.

In the agentic AI era¹ these tasks can be carried out by Claude et al., but CloudTrail logs are hard to synthesize. Answering “did contractor@company.com update this S3 bucket in the last 30 days?” could mean sifting through terabytes of logs, toiling with custom queries, and correlating role assumptions by hand. You can connect agents with MCP tooling and build skills to standardize query patterns, but it’s a fair amount of non-trivial configuration and undifferentiated heavy lifting. Every query (especially loops where the agent needs to learn/fix the syntax) wastes time and tokens while bloating the context window.

Enter TrailTool. The big idea is to process (Lamba) and cache (DynamoDB) CloudTrail based on access patterns, grouping events into entities (People, Sessions, Roles, Services, Resources). When you want to ask common questions (what has this role accessed?, who accessed this resource?, etc.) you get quick trustable answers (trailtool roles detail ). TrailTool is open source - deploy the Ingestor Lambda via SAM and your agent (or you) query with a CLI that works with standard AWS credentials.

Here are four workflows I’ve been running with it. For each one, there’s a prompt I gave Claude Code and the resulting session transcript.

Detecting “ClickOps” modifications

Ah, ClickOps, the primordial ooze from which fully realized cloud services emerge. Who amongst us hasn’t felt the rush, the thrill, of building software with only a faint understanding of the resources being created using a wizard UI and a prayer? It’s the original way to vibe software development.

If you do cloud security, you know that ClickOps resources bypass some kinda important security mechanisms like “change control” and “cloud hardening standards.” They may represent a drift from Infrastructure as Code that needs to be rectified. Or they may represent an opportunity to nudge someone onto an IaC pattern. At the very least, it’s an opportunity to review the resource for security best practices.

Prompt:

Use trailtool to identify resources created or modified via ClickOps over the last 30 days, import them into Terraform state, and create the relevant Terraform configuration.

Session transcript (view gist)

GraphGRC v2: SOC 2 Compliance in GitHub

2026-01-13T00:00:00+00:00

tl;dr — GraphGRC is available at github.com/engseclabs/graphgrc. Fork it to get pre-written SOC 2 controls, policies, processes, and standards in Markdown. GitHub Actions validate your docs and generate a compliance site. Free and open source.

You need SOC 2 for your first enterprise deal. The sales team says you need it yesterday. You look at Vanta or Drata - $12,000 per year minimum, probably more. They’ll give you pre-written policies and a nice dashboard. They’ll also lock all your compliance documentation in their proprietary platform, where it lives as long as you keep paying.

The alternative is Google Docs or Confluence. You copy policies from the internet, paste them into a folder somewhere, and hope your auditor accepts them. There’s no structure, no validation, no way to ensure your documentation stays current. When someone asks “what’s our incident response process?” you send them a link to a doc that was last updated 18 months ago.

Your compliance documentation should live in version control, not a SaaS platform you're renting.

GRC as code

GraphGRC is compliance documentation in GitHub. Fork the repo and you get pre-written controls, policies, processes, and standards. Everything’s in Markdown with semantic linking between documents. GitHub Actions validate your docs and generate a static compliance site. Free and open source.

This isn’t a full GRC platform - there’s no vendor management module, no training attestation tracker, no fancy dashboard. It’s the foundational documentation you need to pass SOC 2, structured in a way that actually makes sense and validates itself automatically.

How it works

The documentation model has four layers that map to each other:

Controls map SOC 2 requirements to your actual documentation. Each control references the policies, processes, and standards that satisfy it. These are the things your auditor cares about.

Standards describe technical requirements. AWS security baseline. GitHub security configuration. Laptop security standards. These are the concrete security configurations you need to maintain.

Processes are step-by-step procedures. Incident response runbook. Access review process. Onboarding and offboarding workflows. These tell people what to do.

Policies set objectives for different roles. What the security team expects from engineers, from support, from finance. Not generic boilerplate about password complexity - specific guidance about handling customer data, using AI tools, managing secrets.

Everything links together via heading-level anchors. A control references specific sections of policies and processes. A process references relevant standards. Change a policy and the validation checks ensure the control mappings stay accurate. Update a process and automated reviews flag when it needs attention.

The validation runs in GitHub Actions. It checks that every control maps to documentation that exists. It verifies review dates are current and owners are assigned. It ensures your processes reference standards that are actually maintained. When you open a pull request to update documentation, the checks tell you if you broke anything.

Who this is for

This works if you’re preparing for SOC 2 and don’t want to pay for Vanta. It works if your security team values owning your compliance documentation and wants it in version control. It works if your team is comfortable with Git and Markdown workflows.

It doesn’t work if you need extensive hand-holding or want a full-featured GRC platform with vendor questionnaires and training modules. It requires some technical comfort - you need to be able to fork a repo, edit Markdown files, and run GitHub Actions.

The documentation is still being validated. I’ve used this model at multiple companies, but GraphGRC v2 as an open source project is new. Feedback is welcome. If you find gaps or things that don’t make sense, open an issue.

Try it

Check it out on GitHub: github.com/engseclabs/graphgrc

Fork it, customize it for your company, let me know what’s missing. Open to contributions and feedback. If you have questions or want to talk through whether this makes sense for your situation, reach out on LinkedIn or Mastodon.

Your compliance documentation should be yours. This is a start.

Fix Dependabot Security Alerts That Don’t Open Pull Requests

2026-01-10T00:00:00+00:00

Dependabot catches security vulnerabilities and opens pull requests to fix them. Except when it doesn’t. If Dependabot can’t create a PR for a security alert, dependabot-wolf automatically sends the details to Copilot to figure it out. It’s a GitHub Action that monitors Dependabot security alerts and automatically sends any without PRs to GitHub Copilot for resolution.

How it works:

Checks for Dependabot security alerts that don’t have pull requests
Extracts the vulnerability details and dependency conflict information
Sends the context to an issue and assigns Copilot

You need to create a fine-grained PAT and put it as an Action secret with perms to create an issue and assign to Copilot.

Installation

Add the action to your repository’s workflow:

name: Dependabot Wolf
on:
  schedule:
    - cron: '0 0 * * *'  # Daily check
  workflow_dispatch:

jobs:
  check-dependabot:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: read
      issues: write
    steps:
      - uses: engseclabs/dependabot-wolf@v1

Found this useful? Check out the repo or let me know if you run into issues.

Backyard APT: A Raccoon Story

2025-11-07T00:00:00+00:00

Living in urban coastal California means you’re never far from an APT: an Advanced Persistent Trashpanda. Cute, yes. Harmless? Not even close. In my battle against backyard raccoons, I learned some things about security. I learned some things about myself. I learned some things about MCP servers and microscopic nematodes. Allow me to tell you my story.

I Can Haz Grub?

Every fall, we’d observe their nocturnal wanderings through our little grass patch of a backyard. Sometimes they’d bring the whole family on their explorations - furry assorted-size blobs clambering over the fence. The raccoons began to wear out their welcome, though, when they started “rolling the grass” - tearing up big chunks of lawn to (as I learned from our gardener) look for grubs growing beneath the sod until their emergence as mature bugs. This foraging made our lawn an eyesore and muddy mess.

Looking for grubs in all the wrong places

Muddy lawns made me grumpy, but hardly battle-ready. That changed the night our beloved chihuahua Jolene darted out between my legs through a cracked-open back door to “sweep the perimeter” - part of her canine rituals. Perhaps she had sensed the family of raccoons perusing the yard at that moment. Once she spotted them, she immediately gave chase. The largest raccoon broke towards her and began a fierce attack as Jolene yelped in fear and pain.

Panicking, I leapt off the deck onto the lawn and gave chase. The raccoon had Jo and was heading under the patio. I bent down, snatched Jo from its clutches, and issued a stern Birkenstock-ed kick to the intruder.

Your browser doesn't support embedded videos.

The fateful encounter

As I carried Jo back inside, the raccoon followed me, glaring at us from the back door as if to continue the altercation. My inner Jersey Shore came out as I raged - “you wanna go bro?” - but my wife wisely encouraged me to keep the door shut and alerted me to Jo’s wound: a bite mark opened on her belly. We spent that night at an emergency vet (waiting in our car, as was the style in those COVID-era times) before Jojo returned to us, stitched up, rabies-inoculated, and more than a little terrified by her assault.

I’m generally a peacenik, but I was stirred with a revenge-fueled craving for justice. First you come for my lawn, then you come for my dog? I made it a goal - a mission, a blood pact, even - to keep these pests out of my backyard. As a security thought leader, I first wanted to evaluate the effectiveness of my defenses. So I put out a camera to see the action.

Jo investigating the surveillance system

Security Lesson #1: You Can’t Stop Advanced Threats with Basic Defense

I searched Amazon for “raccoon deterrent” and purchased a few different ultrasonic/flashing light yard stakes. Well, if you’re in my shoes in the future, let me save you a few dollars and suggest you skip these devices. he raccoons looked at my fancy deterrents and kept right on digging. Much like deploying a flashy new security product without tuning, my defenses looked good on paper and did nothing in the field. Maybe these things work for rural raccoons unaccustomed to strange noises and signs of humans, but my backyard baby bug burglars were entirely nonplussed.

Ornamental e-waste

Security Lesson #2: Too Many False Positives Spoil the Detection

My next defensive strategy: spray them with water. I bought a motion-activated sprinkler to soak their spirits. I did get a few good direct shots. But more than once, I took out the trash and forgot to disarm, activating the system and soaking myself - a serious case of “alert fatigue”.

My petard on which I was hoist

I also found water just wasn’t a big deal to these raccoons. I’d see them outside and accost them with a hose and spray nozzle: full blast, point blank. They would give no ground, just stare at me with their glowing eyes from the tree branches.

I commiserated with neighbors, consulted message boards, and talked with friends. A buddy whose family grew up on a farm suggested cages to trap the raccoons. “Well what do you do then?”, I asked. “Well,” he said, “my dad was a bit of a softie and would take them somewhere far away and release them. My mom, though, she would go get the gun and…” I blanched at the notion of dispatching anything more sentient than a housefly, so blowing a raccoon’s brains out was a hard no. Also, the idea of Ubering raccoon after raccoon from Oakland to Vacaville didn’t seem like a sustainable strategy.

Security Lesson #3: Custom Code to Wire Defensive Controls Together is a Superpower

I’d set up Home Assistant and been tinkering with a few different motion sensors and smart switches. That’s when I got an idea - what if I connected something that moved to my own motion sensor? This would also let me build increasingly sophisticated detection and response flows.

My first challenge: what could I plug in outdoors that would move around and be threatening to a raccoon? Of course - those wild wacky inflatable arm guys! It’s basically just a fan and a sock with both ends open. I had to solve two problems. One was figuring out the right search term to buy one (“air dancer”), and the second was finding a supplier that had the right size - “scare a backyard raccoon”-size rather than “pay attention to my used car lot from the freeway”-size.

Like a scarecrow, but make it wacky

After building a few automations to wire the motion sensor to the plug (plus some patio lights) and turn the whole formula on only at night, I watched in glee as raccoons startled at the initial whir of the fan and then retreated from the wacky air dancing.

While I thought my problem was solved, a new wrinkle emerged. While the raccoons had made the “flight” decision, another backyard visitor was provoked to aggression - the normally placid possum. Reviewing my video audit log footage one night, I was shocked to see a possum react to my air dancer not with deference, but instead grab the flapping sock in its mouth and carry it off to somewhere I never found it.

Your browser doesn't support embedded videos.

Fearless possum claiming its prize

Supposing this may have been a one-time act of bravery, I replaced the lost investment and purchased a spooky ghost version that I believe will intimidate even the most resolute possum.

Your browser doesn't support embedded videos.

The ghost air dancer - surely possum-proof

Through all of this, I’ve been using Home Assistant to power my surveillance and wildlife PsyOps architecture. In this AI era, I’ve started using Claude Code and the Home Assistant MCP server to make some sick dashboards.

Vibe coding and the vibe is - raccoons begone!

Security Lesson #4: Remove the Attacker Incentives

Despite my creativity in keeping these garden grub guzzlers at bay, I’d still wander out some mornings to find a corner of lawn beyond sensor distance torn up. This is where I discovered the most effective element of my defensive system. In a line of thinking that has deep profundity for all security endeavors, I realized the best defense isn’t higher fences. It’s changing what makes the target appealing in the first place. That’s true for raccoons and ransomware alike.

Toxic pesticides may have worked, but I didn’t want to poison my poor beset small dog. Instead, I identified glorious nematodes - microscopic organisms that feed on grubs and can be bought by the millions, mixed with water, and sprayed on the lawn.

Microscopic, macro-effective

With these latest technological advantages, I believe I’ve won the cat and mouse game - or rather, the raccoon and person game - at least for now. And though there have been costs to my lawn’s integrity, my dog’s safety, and my sanity, I gained valuable security insights which I share with you. Good luck out there, whether your APTs wear hoodies or fur coats.

Sharing the raccoon saga on LinkedIn

Data Retention is Two Different Problems

2025-10-23T00:00:00+00:00

A common artifact in security programs is something called a data retention policy. Like other policies, there’s a lot of jargon, but the centerpiece is typically a big table, with categories of data pointing to specific timeframes - for example:

Data Category	Retention Period
Customer data	30 days
Employee records	2 years
Security event logs	5 years
Financial records	7 years

To me, these policies are confusing because they conflate two different goals:

Data preservation for the minimum amount of time you need to keep important data. This applies to stuff like audit logs you need if you get hacked, or financial records you need to investigate fraud.
Data deletion for the maximum amount of time you’re allowed to keep personal data. This is driven by contracts and privacy laws which require you to delete personal data when you don’t need it anymore.

It’s a good idea to write clear data retention policies to detangle these two goals. But also - you should provide advice to the people who are storing data, at design time, to help make sure you actually do the thing you said you’d do. Two simple and valuable ideas I’ve brought to numerous conversational tables to support this fidelity are immutability (for preservation) and ephemerality (for deletion).

Can it be immutable?

Remember that for data you need to preserve - audit logs, security events, compliance records - the numeric duration specified in the policy is a minimum. In the median case, I propose any duration other than indefinitely is cargo culture copy-paste with zero explicit rationale. Presuming it’s not privacy-risk bearing personal data, other traditional reasons to delete old data - storage costs, database performance - matter far less than they used to. Cloud storage is cheap. Modern data warehouses handle large datasets well. The operational complexity of managing data lifecycle policies costs more than just keeping the data.

Beyond keeping archival data indefinitely, consider immutability - write once, never modify or delete. For data in cloud-based SaaS, indefinite is the default retention strategy, but immutability is more rigorous. The difference is ensuring the data is not just retained - it also can’t be deleted. The controls for immutability are built into cloud storage systems like S3: object lock or versioning. A more indirect route to immutability is restricting access with service control policies or MFA delete. Combining with resilient isolated backups, you can uphold retention policies even in black swan events like data breaches or accidental outages.

Can it be be ephemeral?

If you store personal data, you should delete it when you lose the rights to it, lest you suffer the wrath of privacy legislators. You may lose your rights when a contract ends, or when someone sends you a deletion request.

The traditional approach requires building deletion machinery to plumb these requests through every system where personal data lives: your production database, analytics warehouse, ML training datasets, etc. Each system needs its own deletion endpoint, and you need orchestration to coordinate all of it. Yuck.

Ephemeral data solves this by design. If you automatically expire old data with something like S3 lifecycle management, your retention period becomes your deletion mechanism. You retain the data for a little while, and then it naturally disappears. This doesn’t work for your main durable stores of personal data (e.g. your production database), but is worth considering everywhere else.

Like, you probably don’t want to selectively delete data from database backups. Same with logs that capture personal identifiers or detailed analytics that include user-level information. Aging this data out means no-touch compliance with your data retention policies.

For data lakes and warehouses with personal data AND long-term analytics requirements, enforce a lifecycle retention on fact tables with personal data, then apply pseudonymization or anonymization on transformed tables as part of your data pipeline. I’ll note that validating the effective removal of personal information during these data transformations is an exercise left to the not-faint-of-heart reader.

Design for immutability or ephemerality from the start

So, when you’re doing security design reviews or threat modeling for systems that will store data, ask these questions:

For archival data: Can we make this immutable?
For personal data: Can we make this ephemeral?

If the answer is yes, use storage systems with built-in immutability or expiration. Make the data lifecycle automatic. If the answer is no, then you need to build custom machinery - either deletion controls for durable personal data, or integrity and availability controls for archival data. Pointing people in the right direction early - during design, not after the system is built - is what keeps operational complexity at bay.

Remember - the hard part of security isn’t writing a policy. It’s the work that goes into making it reality.

Role-Based Everything: Aligning Access Control, Policies, and Training

2025-10-14T00:00:00+00:00

Your new hire’s first day looks something like this: they sit through two hours of generic security training about phishing and password hygiene, they click through a 47-page acceptable use policy that covers everything from clean desk requirements to data retention, and then they get added to whatever Slack channels and tools their manager remembers to request. Three months later, they’re blocked on something, so they ping their manager who pings their manager, and suddenly they have production access. The security policies they acknowledged on day one? Nobody’s looked at those since.

This is broken. We have three systems (RBAC, security policies, and security training) that don’t talk to each other. Policies gather dust in Confluence. Training is a checkbox exercise everyone resents. And RBAC becomes a tangled mess of individual exceptions and Okta group rules hanging off job titles that hiring managers invented on the spot. There’s a better way. What if we stopped pretending these are separate problems?

If someone has access to production, they have specific security responsibilities. Those responsibilities should be written as policies. Those policies should be their training.

Roles describe responsibilities, not just access

With great power comes great responsibility. If someone has elevated access, they have specific security obligations. Those obligations should be documented as role-specific security policies.

Security policies, properly defined, are rules that guide how people do their work securely. They’re not just documents for the security team. They’re objectives the security team has for everyone else - telling engineers how to handle secrets, telling support how to verify customer requests, telling finance how to process vendor payments securely.

These role-specific policies should form the basis of role-specific training. Training is just teaching people how to meet the objectives you’ve set for them. It all fits together.

What this looks like in practice

Let’s say you create a “Product Engineer” role. This role gets you everything you need to contribute code to the main product: GitHub access to the relevant repos, your CI/CD pipeline, your observability stack, your cloud environments. All the things an engineer needs to do their job.

Because this role has access to production systems and customer data, it also comes with a “Product Engineering Security Policy”. Not generic copy about password complexity requirements or physical document shredding that was written in 2015 and hasn’t been updated since. Specific policies about things like: can you use AI coding assistants (and if so, with what guardrails), how do you handle customer data in development, what’s the process for emergency production access, how do we think about secrets management.

For this system to work, your policies need to actually be good. I’ve written about better security policies in software engineering environments before. The core principle: write policies for the people who need to follow them, not for auditors. When policies drive training and connect to access control, that principle becomes critical.

With role-specific policies, you can build role-specific training. Your onboarding training for Product Engineers walks through these policies. You don’t need expensive learning management systems to make this work. Use your existing knowledge management software (Notion, Confluence, whatever) to host the content and track who’s viewed it. Add a Google Form at the end for attestation. Write some light scripts against your identity system to check completion, send reminder emails or Slack messages to people who haven’t finished, and generate the compliance reports your auditors want.

Then there’s annual refresher training. Many compliance frameworks require it, and the spirit of the requirement is sound: security knowledge needs reinforcement and updating. But forcing people to take the exact same training again feels like a waste of time. It is a waste of time. The only people who benefit from unchanged annual training are security teams who don’t have to think creatively about how to fulfill the requirement.

Instead, use refreshers to do two things: briefly review your role-specific policies and highlight any recent changes or additions, then dive into substantive content that’s temporally relevant. If you have policies around secure coding standards for Product Engineers, your annual refresher should walk through the bugs from your bug bounty program over the past year. Show how they were introduced and how to prevent them. The best predictor of the next bug is the last bug. This is infinitely more valuable than generic OWASP content nobody remembers, and it directly reinforces the policies you’ve already established.

Here’s where it gets interesting: let’s say you have a product manager who wants to learn to code and contribute to the product. Great! They can get the Product Engineer role. But they also need to understand the policies that now apply to them, and they need to go through the training. Access and responsibility travel together.

Details that make it work

Start with your highest-risk roles. Identify the 2-3 roles with the most elevated access (Product Engineer, Support, Infrastructure). If you have a data classification, use that to think through who has access to your most sensitive data. Check with your auditors early that the evidence you’ll collect for security policies and training controls works for them. Document the policies for these roles, build the training, and get people through it. Once you have the pattern working, the next roles get easier.

Keep the granularity manageable. Think 10 to 20 roles maximum, at the job family level. If you need more than you can count on your fingers and toes, you’ve gone too far. These are the big blocks, not the fine-grained technical permissions you’d configure in AWS or Salesforce.

Integrate with hiring. This is where most RBAC systems go off the rails. Job descriptions get created by hiring managers, titles get invented, and then later someone tries to figure out what access these people need. Flip it around. When a new job description is created, security should be involved. What role does this map to? What access does this role need? Don’t retrofit access patterns onto org structures that were designed without any thought to security.

Handle exceptions gracefully. You won’t be able to say that every product manager needs the Product Engineer role. You need a system that’s flexible enough to handle exceptions. But at a high level, at a structural big block level, the roles should do most of the heavy lifting.

Layer your policies. You can have a baseline employee security policy that applies to everyone. But then layer on role-specific policies for people with elevated access. Support engineers who have broad access to customer data and can perform administrative actions on behalf of customers? They get specific policies about customer privacy and how to verify requests. Infrastructure engineers with cloud admin permissions? They get specific policies about root account usage and infrastructure changes.

This is hard work for the security team. You can’t just copy and paste policies from a SOC 2 in a box product. You can’t buy generic security awareness web video training and call it done. You actually need to own the underlying content, make sure it communicates what’s important, and build deep empathy with the people in your organization and the work they do. But when you align RBAC, policies, and training, you get better outcomes: people understand what matters to them specifically instead of drowning in generic noise, you spend less on overpriced compliance theater while actually reducing risk, and the system scales naturally as people move into new roles. This isn’t about finding a shortcut. It’s about doing the work that actually matters, respecting people’s intelligence enough to give them exactly what they need to know for the access they have.

Refocusing Vendor Security on Risk Reduction

2025-10-01T00:00:00+00:00

Modern software companies use a lot of software services. Data flows across organizational boundaries, and security risk moves from first-party to third-party, challenging security teams that have responsibility without control. Traditional security teams address this through certifications and questionnaires, supporting risk visibility and acceptance. What’s often overlooked is the opportunity to actually reduce risk by collaborating with implementation teams on secure configuration decisions specific to the software in question. Similar to the challenge of design involvement (threat modeling) in AppSec versus post-implementation analysis (running scans), this approach requires an embedded, empowered, and empathetic security team.

Vendor security reviews focus on what vendors do. But the real risk (and opportunity) lies in what you do with their software.

The old way: Marketing wants to buy HubSpot. Security asks for a SOC 2 and pen test, sends a 200-question spreadsheet, waits 3 weeks for responses, notes that the vendor doesn’t enforce password rotation, asks an exec to “accept the risk,” then moves on. Six months later, a developer connects HubSpot to the entire customer database via a Zapier integration using an API key that never expires, tied to an intern’s account.

The better way: Security sits with marketing during onboarding. Together they: enable OIDC, set up role-based access so only marketing can see marketing data, configure audit logging to the SIEM, document that the Salesforce integration uses a service account with read-only permissions that auto-rotates credentials.

Avoid generic “vendor risk”

Security teams that care about third-party SaaS risk find themselves involved in the procurement process. This is reasonable - knowing (asset inventory) is half the battle. But typical security team activities are surface-level and generic:

Request and review a SOC 2 report
Request and review a penetration test
Request and review responses to a security questionnaire
Consult a third-party vendor risk service (e.g. SecurityScorecard)

Sure, sometimes these exercises reduce risk. Don’t have a SOC 2? No purchase. Unresolved high-risk findings? Commit to fix before we sign a contract. But that only works when there’s a favorable power differential. While a large purchaser may influence a small supplier’s security practices, there’s less potential in more equal or inverse dynamics. If the supplier won’t budge, the security team needs to spend reputational capital (or convince someone else to spend it) to push back against compelling business interests.

In the median case, vendors have already prepared industry-standard responses by hiring audit and pen test firms incentivized to give them the sales-enabling materials they need. The riskiest outcome from a “vendor security review” is asking a harried executive to approve and accept risk for deficiencies divorced from the context of how the software will actually be used. A marketing team isn’t abandoning their new analytics platform because the vendor’s audit noted they didn’t offboard personnel within 30 days, or had a stored XSS vulnerability.

Perhaps most distressing for those who’ve seen the sausage made: the nature of SOC 2, pen tests, and other external security attestations is very low value for understanding risk. It’s extremely difficult to reason about first-party risk, let alone third-party risk. Large security teams spend tremendous effort identifying and reducing causes of security breaches in their own infrastructure and often fail. The idea that we can elucidate third-party system risk as part of a brief “vendor review” feels unrealistic. Cookie-cutter audits and commoditized pen tests simply don’t provide much assurance. We might get some signal about security maturity, but exhaustive security analysis that would let organizations make confident buy/no-buy decisions wouldn’t be cost-effective or practical for the hundreds of vendors organizations use.

Like many “supply chain” problems, there aren’t easy answers. As an ecosystem, we’ve accepted the risk of this free-flowing interconnectedness to enjoy the benefits of collaborative integrations and software specialization. That said, my take is we spend too much time on risk visibility and not enough on risk reduction. For SaaS vendors, risk reduction activity is essentially hardening the service - making sure any decisions made while configuring the software make optimal security choices, accounting for tradeoffs.

I’m not saying don’t ask for SOC 2 or pen tests. Latacora’s SOC 2 Starting Seven suggests simply “tracking all the software you subscribe to, buy, or install in a spreadsheet and start doing some simple risk tracking”. Maybe you even want security questionnaires to look good for auditors or cover your bases with regulators in the event of a breach. But if you do it and don’t do what I describe below, you’re missing an opportunity to optimize the energy you spend on your security program.

One note: these activities are different from the very “external” vendor security review process that asks for evidence and risk approvals. They require the security team to be deeply connected to the IT function that clicks the buttons to enable identity/access/logs, and to have strong relationships with teams using software so there’s trust to accept recommendations. Expect less paperwork and more teamwork.

Instead, support smart security decisions

Understand data flows

New software review should start by understanding what data the vendor will collect and how. Similar to threat modeling, this requires context gathering - sitting down with implementers to make sure the security team understands what the heck this software is doing in the first place.

Recommendations often come out of just this step: Can we self-host rather than use cloud? Do we intend to turn on such-and-such third-party integrations?

Define access

Pretty much any software will have you make security-related decisions about access. Typically this means roles and permissions (coarse- or fine-grained) plus identity (username/password, MFA, OAuth, SAML). I’ve seen plenty of wild west situations with access to new software tooling. This is an ideal opportunity for a security team to have a win-win by making easy, out-of-the-gates access work through well-defined role-based access control strategies that can be applied to new use cases.

This is where you make an impact: “Hey, they offer SAML and SCIM, let’s use that.”

Configure auditing

If the software has particularly sensitive data, it may be wise to track access, changes, or sensitive operations. Consider what you’d do in the event of a breach - are there audit logs configurable for the software? Would it make sense to write detections? Does the software have capability to ship audit logs to your SIEM?

Lock down integrations

This feels like the biggest area where vendor risk gets introduced, and the part that’s hardest to get a handle on as a security team because these integrations are often nuanced and not well understood except by those configuring or administering software.

Nearly all software these days integrates with other systems. Does it read/write from your data analytics pipeline? Tie into your observability stack? Does it need to integrate with your marketing website?

The boom in “Non-Human Identity” security points to the risk associated with these integrations if they’re created and maintained without security rigor. We see API keys stored plaintext on endpoints, full administrative access, integrations tied to accounts belonging to real humans that break when they leave the company. Rather than trying to solve these problems after the fact with tools, providing guidance upfront during onboarding helps reduce risk from the get-go and obviates the need for potentially breaking changes later.

Harden with security guides

Beyond these standard security configuration areas, many software services have unique configuration knobs that can profoundly affect security. Larger vendors have started publishing “security guides” that help implementers understand their options and how they work. It should be the security team’s responsibility to consume these and ensure best practices are followed if they make sense for the organization’s context.

The bottom line: vendor security isn’t just about collecting attestations and checking boxes. It’s about being embedded enough in your organization to understand how software actually gets used, building relationships strong enough that people trust your recommendations, and focusing your energy on the configuration decisions that actually reduce risk. That’s harder than asking for a SOC 2 report, but it’s also where the real security work happens.

What Should I Work on Next? A Framework for High-Impact Security Work

2025-06-03T00:00:00+00:00

“So, my last project is basically wrapped up. What should I work on next?”

This question comes up often in my 1:1s, and how I respond influences if my report will do high-impact work or spin their wheels. I believe among the most valuable things I do as a security engineering manager is steer my team towards the best work. When a security engineer looks to me for advice on what to work on, it’s a high leverage opportunity. I’ve succeeded by aligning people with work that mattered; I’ve also failed to recognize and redirect wasteful work energy. Over time, I’ve developed three criteria/questions to guide this discussion:

Business Goals: What is the impact of this work on the business?
Implicit Interest: How engaged are you when doing this work?
Personal Growth: How much does this work align with your stated career interests?

Side-note: Decision making as an optimization problem

Back in 2012 at Twitter, I joined an internal group completing the Coursera “Machine Learning” course. One insight that stuck: behind all the matrix math, we were asking a hill climbing question. Given a bunch of dimensions, what are the best weights to predict outcomes?

That optimization mindset applies to simpler problems too, like “what to work on next”. Without a decision making rubric, you might jump to recent conversations with your boss, backlog items, or what would look good for promotion. Context is king so circumstances differ, but the best choice is unlikely to be based on just one criteria. The “right work” involves tradeoffs. Similar to “cheap/good/fast - pick two”, you can’t maximize everything. So, with that in mind, not all work will crush Business Goals, ignite Implicit Interests, and speed-run Personal Growth. But they’re all heuristics to consider for making optimal(-ish) choices about what to work on.

Business Goals

Information security isn’t a first-order business goal. While no company wants bad security outcomes, security work rarely moves the needle for top company metrics like revenue growth. Like other “platform” work and technical debt, it’s difficult to show direct value. As I mentioned in Building Effective Security OKRs, it’s easy to describe project value in ways that aren’t connected to business goals (“we shipped a thing”). Security engineers rarely explain their work’s value with spreadsheets full of numbers.

I’ve found one way to communicate security business value to stakeholders is through narrative. The story of why you’re doing what you’re doing needs to land with people thinking about business from the broadest perspective - executives, investors, customers, etc. The work may be a small part of this story. It may require explanation, and some squinting and hand waving. But work disconnected from business value won’t be watered and grown over time.

“Zero trust” is a narrative - “having network access shouldn’t get you access to services, there should be authentication/device verification involved”. There’s a deep risk discussion about why that’s good, but the narrative encodes it. Similar to “paved roads”, “detection engineering”, or “shift left” - these concepts connect projects to business goals via “strategy”. This is the connective tissue from grungy security work to IPOs/dollar signs.

Take static analysis tools as “shift left” thinking. The business narrative is preventing vulnerabilities before they become expensive bug bounty payouts or worse. But implementation details matter for that story to hold - you need the tool to catch vulnerability classes you’re paying for, not just generate noise engineers ignore. Similarly, when implementing detections, work backwards from credible threats people already understand - what’s in the news, what happened to similar companies - and tell the story from that threat-focused angle.

Some projects are inherently difficult to connect to business value. System rewrites (like migrating secret managers), building internal tooling that generates too many alerts to triage effectively, or access approval systems that slow productivity have very indirect customer impact. This is a problem many engineering leaders deal with, and selecting your portfolio is an art on itself. Suffice to say, it’s best to have clear examples of how you’re impacting users to offset the ennui around big projects where the successful outcome is “everything works like it did before”.

The other way security projects impact business is unlocking revenue through compliance and customer assurance. This may not excite security engineers the way a novel threat detection algorithm does, but it’s often the most direct path from security work to business value. When your SOC 2 Type II enables your first big enterprise deal or your FedRAMP certification opens government contracts, the ROI conversation becomes much easier. People focused on business outcomes recognize these compliance achievements as tangible value they can point to with customers and prospects.

Regardless of framing, workshopping a project’s business impact is a great way to prioritize it. How would you communicate the progress and outcome to the organization? If the message isn’t clearly “here’s how we’re driving our business forward”, take heed.

Implicit Interest

People prefer different types of work. One person may love diving into obscure technical challenges - a big “nerd snipe” target. Another may relish presenting to the organization about how security works. Others may be “red team” or “blue team”-coded.

The more someone gravitates toward and is absorbed by their work, the more productive they’ll be. Ask someone who dreads writing to complete a big documentation exercise, and it’ll take forever. Some engineers are allergic to frontend code. But find a problem that fits someone’s implicit interest, and they’ll often fly with it, bulldozing through obstacles and bringing unexpected creativity to problem solving.

Don’t forget you can combine people with complementary interests on the same project. Pair the person who loves prototyping with someone who enjoys writing thorough documentation. Match the frontend enthusiast with the API builder. This often produces better outcomes than forcing one person to do everything.

Given work with equal business value, figure out what project is more interesting. This is a great way to figure out if someone fits the team. If they’re not interested in the work, it’s hard to expect them to be more productive than a replacement who is.

Personal Growth

A final dimension to consider: how the work aligns with their stated career goals. A trap managers and reports fall into involves losing the plot about the report’s growth. If they don’t feel they’re headed in the direction they want at the pace they expect, their engagement and productivity will plummet while their departure becomes imminent.

If people know what they want to be when they grow up, connect the work they’re doing to that path. For engineers interested in the management track, prioritize work involving mentoring interns or working cross-functionally with other teams. For those who want to build their external presence, look for projects that could become “conference talk worthy” - novel approaches or interesting problems the broader security community would want to hear about.

In an ideal world, we’d all have deep clarity about how our career journey should unfold. Reality is that many people don’t know how they want to grow. Something like a career ladder can be useful as a default “here’s what growth looks like” description. Remember there’s no one true path for anyone’s career, and if you’re creative about how work can foster growth you can connect world-weary engineers with new challenges and kindle productive energy.

The world of security is broad and deep, and people are unique. Figuring out how to get the most of your team will continue to be the mark of effective security management. Use this three-part framework to evaluate options and get people onboard with winning projects.