What is the autoresearch pattern for WAF optimization?

The autoresearch pattern gives an AI agent a scoring function and configuration space, then lets it run experiments autonomously. For WAF optimization, the agent modifies ModSecurity CRS configuration, evaluates against a dataset of real HTTP requests, keeps improvements, and reverts failures.

Can I use these optimized CRS configs in production?

The configs were optimized against a specific dataset and should be tested against your own traffic before production deployment. They serve as a strong starting point, but every deployment has different traffic patterns.

What dataset was used for the experiments?

5,354 real HTTP requests: 4,500 legitimate browsing captures from 692 websites (openappsec project) and 854 malicious payloads across 7 attack categories plus WAF bypasses (mgm-security-partners and payload-box).

Why does CRS v4 have more false positives than v3?

CRS v4 includes stricter detection rules that catch more attacks but also flag more legitimate traffic. The default anomaly scoring threshold of 5 is too aggressive for v4's rule set, causing 29.7% false positive rate vs 15.2% on v3.

We Let an AI Agent Run Autonomous Experiments on ModSecurity CRS. Here's What It Found.

What happens when you let an AI agent modify WAF configuration, measure the results, and iterate, all on its own?

We built a system to find out. Starting from stock OWASP CRS defaults, the agent improved balanced accuracy from 86.7% to 96.7% on CRS v3.3.8, and from 80.8% to 98.4% on CRS v4.24.0, finding configuration improvements that most human operators would never think to try.

This post covers the full experiment: methodology, every finding, downloadable configs, and what it means for WAF operators. We tested against both CRS v3.3.8 and v4.24.0 because a lot of deployments are still on v3.

Background

The OWASP Core Rule Set (CRS) is the most widely deployed open-source WAF ruleset. It ships with sensible defaults, but those defaults are a compromise. Every WAF deployment faces the same tension: catch more attacks (true positive rate) vs. block fewer legitimate requests (false positive rate).

Tuning CRS is tedious manual work. Change a setting, restart, observe production for a few days, check logs, adjust, repeat. Most operators never get past the defaults.

What if an agent could do hundreds of those cycles in a single night?

The Autoresearch Pattern

We followed the autoresearch pattern (Andrej Karpathy, 2026): give an AI agent a scoring function, a configuration space to explore, and let it run experiments in a loop. The agent proposes changes, measures impact, keeps improvements, reverts failures.

Infrastructure

WAF: OWASP ModSecurity CRS on nginx (official Docker image)
Backend: nginx returning 200 for all valid requests
Agent: Claude Code (Anthropic) in --print mode, one experiment per invocation
Orchestrator: Zoutje, an AI assistant built on OpenClaw, managed infrastructure, dataset construction, and the agent loop
Loop: Bash wrapper with watchdog for resilience. Each experiment: modify config, git commit, evaluate, log results, revert if worse.
Hardware: MacBook Pro (Apple Silicon), Docker via OrbStack

Dataset: 5,354 Real HTTP Requests

This is where most WAF benchmarks fall short. Testing with hand-crafted payloads tells you nothing about false positive rates. We used real captured browsing sessions.

Legitimate Traffic (4,500 requests)

Sourced from openappsec/waf-comparison-project: real browsing captures from 692 websites across 14 categories.

Category	Requests	Notes
E-Commerce	~2,100	Largest group. Complex cookies, cart operations, product searches
Travel	~400	Booking flows, date/location parameters
Information	~300	News sites, Wikipedia, documentation
Food	~200	Restaurant sites, delivery apps
Games	~120	Gaming platforms, in-game web views
Social Media	~100	Facebook, Twitter, Reddit interactions
Files Uploads	~90	Cloud storage, file sharing
Content Creation	~80	CMS, blogging platforms
Videos	~50	YouTube, streaming platforms
Search Engines	~50	Google, Bing searches
Applications	~40	SaaS applications
Technology	~15	Developer tools
Files Download	~15	Download managers
Streaming	~10	Music/audio streaming

These are full HTTP requests with real headers, cookies, query parameters, and POST bodies. Not synthetic traffic.

Malicious Traffic (854 requests)

7 attack categories plus WAF bypass payloads:

Category	Requests	Source
XSS	150	mgm-web-attack-payloads (true-positives)
SQLi	150	mgm-web-attack-payloads (true-positives)
Path Traversal	150	mgm-web-attack-payloads (true-positives)
Command Execution	150	mgm-web-attack-payloads (true-positives)
Log4Shell	50	mgm-web-attack-payloads (true-positives)
Shellshock	24	mgm-web-attack-payloads (true-positives)
XXE	35	mgm-web-attack-payloads (true-positives)
WAF Bypasses	95	payload-box/waf-bypass-payload-list

Sources: - mgm-security-partners/mgm-web-attack-payloads -- curated payloads with labeled true-positives and false-positives - payload-box/waf-bypass-payload-list -- community WAF bypass collection

Traffic ratio: 16% malicious / 84% legitimate. Real-world ratios are typically 1-10% malicious, so this is attack-heavy but keeps evaluation fast while providing meaningful false positive signal.

Scoring: Balanced Accuracy

Balanced Accuracy = (True Positive Rate + True Negative Rate) / 2

This penalizes both missed attacks AND false positives equally. A WAF that blocks everything scores 50%. A WAF that allows everything also scores 50%. You need to be good at both.

Evaluation Speed

Each evaluation cycle sends all 5,354 requests to the WAF and records HTTP status codes. Takes about 35 seconds. That means the agent can run roughly 100 experiments per hour.

CRS v3.3.8 Results

Baseline

Stock CRS v3.3.8, Paranoia Level 1, all defaults:

Metric	Value
Balanced Accuracy	86.7%
True Positive Rate	88.5% (756/854 attacks caught)
True Negative Rate	84.8% (3,815/4,500 legit passed)
False Positive Rate	15.2% (685 legit blocked)
Missed Attacks	98

15.2% false positive rate. One in seven legitimate requests gets blocked. E-Commerce alone: 424 false positives. For a real store, that's customers unable to browse, add to cart, or check out.

The Experiments

The agent ran 11 experiments over approximately 3 hours. 9 were kept, 2 discarded.

Experiment 1: Expand Allowed Content Types (KEPT)

CRS blocks requests with Content-Type headers it doesn't recognize. Real traffic sends text/plain, application/octet-stream, application/grpc, application/x-protobuf, and others not in the default list.

Change: Added common content types to tx.allowed_request_content_type.

Metric	Before	After
BA	86.7%	91.6%
FPR	15.2%	5.2%
FP count	685	236

One configuration line eliminated 449 false positives. Detection unchanged.

Experiment 2: Paranoia Level 2 (DISCARDED)

The obvious move. PL2 enables additional detection rules.

Metric	PL1	PL2
BA	91.6%	72.6%
TPR	88.5%	94.0%
FPR	5.2%	48.9%

Nearly half of all legitimate traffic blocked. The agent correctly discarded this and never tried raw PL2 again.

Experiment 3: Custom Path Traversal Rules (KEPT)

CRS was missing 51 traversal attacks. The agent wrote two rules:

LFI target path detection: matches etc/passwd, windows/system32, proc/self with flexible separators that catch encoding evasion (percent-encoding, hex escapes, overlong UTF-8)
Overlong UTF-8 traversal encoding: catches %c0%af, %e0%80%af, IIS %u002f -- invalid UTF-8 that is never used in legitimate traffic

Result: BA 91.6% to 94.0%. Caught 41 more attacks, zero new false positives.

Experiment 4: HTTP Methods + Cookie SQLi Exclusions (KEPT)

Added PUT, DELETE, PATCH to allowed methods (real e-commerce uses these)
Excluded cookies from SQLi detection (tracking cookies contain URL-encoded JSON that looks like SQL)

Result: BA 94.0% to 94.4%. FPR dropped from 5.2% to 4.5%.

Experiments 5-6: Cookie + Referer Exclusions (KEPT)

Extended cookie exclusions to XSS, RCE, LFI, RFI, PHP injection, and session fixation rules. Excluded Referer header from all attack detection.

The insight: cookies and Referer contain legitimate content (URLs, encoded JSON, HTML fragments) that triggers signature rules. In this dataset, zero attacks used cookies or Referer as vectors.

Result: BA 94.4% to 94.9%. FPR down to 4.0%.

Experiment 7: Custom SQL Detection Rules (KEPT)

Three new rules for SQLi patterns CRS missed: - SELECT ... FROM <table> in parameters - Boolean tautologies (OR 1=1, AND 'x'='x') - ORDER BY <number> column enumeration

Result: BA 94.9% to 96.1%. Biggest single TPR jump: 93.7% to 96.1%.

Experiment 8: IFS Bypass + Data URI XSS (DISCARDED)

Tried catching $IFS shell variable substitution, data URI XSS, and undefined variable injection. Caught 2 more attacks but added 10 false positives. Net gain: +0.001 BA.

Not worth the false positive cost. Correctly discarded.

Experiment 9: Encoding Detection + XSS/Shellshock (KEPT)

Broader overlong UTF-8 coverage
XSS javascript: with embedded whitespace bypass
Obfuscated Shellshock (dot-separated function definitions)
Sensitive web path detection (/var/www/)

Result: BA 96.1% to 96.7%. TPR reached 97.4%.

v3.3.8 Final Results

Metric	Baseline	Optimized	Change
Balanced Accuracy	86.7%	96.7%	+11.5%
True Positive Rate	88.5%	97.4%	+10.1%
True Negative Rate	84.8%	96.0%	+13.2%
False Positive Rate	15.2%	4.0%	-73.7%
False Positives	685	178	-74.0%
Missed Attacks	98	22	-77.6%

v3.3.8 Optimization Trajectory

 #  BA      TPR     FPR    What changed
 0  0.867   0.885   15.2%  Baseline (stock CRS PL1)
 1  0.916   0.885    5.2%  + content types
 2  0.940   0.933    5.2%  + LFI/traversal rules
 3  0.944   0.933    4.5%  + allowed methods, cookie SQLi exclusion
 4  0.946   0.937    4.5%  + cookie XSS exclusion, SSTI rule
 5  0.948   0.937    4.1%  + cookie RCE/LFI/Referer exclusions
 6  0.949   0.937    4.0%  + shell arithmetic, more cookie exclusions
 7  0.961   0.961    4.0%  + SQL detection rules
 8  0.967   0.974    4.0%  + encoding, XSS bypass, shellshock rules

Two clear patterns: 1. Early gains came from reducing false positives (content types, cookie/header exclusions) 2. Later gains came from custom detection rules (LFI, SQLi, SSTI, Shellshock)

The agent figured out that you fix the FP problem first, then safely add detection.

Download: v3.3.8 Optimized Config

The optimized configuration consists of two files:

REQUEST-900-BEFORE.conf -- paranoia level, thresholds, allowed content types and methods
RESPONSE-999-AFTER.conf -- cookie/header exclusions + 11 custom detection rules

Drop these into your CRS deployment:

# Mount as Docker volumes
volumes:
  - ./REQUEST-900-BEFORE.conf:/etc/modsecurity.d/owasp-crs/rules/REQUEST-900-EXCLUSION-RULES-BEFORE-CRS.conf
  - ./RESPONSE-999-AFTER.conf:/etc/modsecurity.d/owasp-crs/rules/RESPONSE-999-EXCLUSION-RULES-AFTER-CRS.conf

Important: These configs were optimized against this specific dataset. Test against your traffic before deploying to production.

CRS v4.24.0 Results

CRS v4 is the current release. We ran the same experiment against it.

Baseline

Stock CRS v4.24.0, Paranoia Level 1, all defaults:

Metric	v3.3.8 baseline	v4.24.0 baseline
Balanced Accuracy	86.7%	80.8%
True Positive Rate	88.5%	91.3%
True Negative Rate	84.8%	70.3%
False Positive Rate	15.2%	29.7%
False Positives	685	1,336

v4 catches more attacks (+3% TPR) but nearly doubles false positives. The new rules are more aggressive, which is the right security decision, but operators need to tune more aggressively too.

The Experiments

The agent ran 19 experiments over approximately 4 hours. 17 were kept, 2 discarded.

The v4 optimization took a fundamentally different path than v3. Here's what happened:

Phase 1: Tame the False Positives (Experiments 1-3)

The agent's first move was raising the inbound anomaly score threshold from 5 to 7. On v3, the default of 5 worked fine. On v4, the stricter rules generate higher anomaly scores on legitimate traffic, so a higher threshold is needed. This single change cut false positives from 1,336 to 290.

The tradeoff: TPR dropped from 91.3% to 76.5%. The agent needed to earn that detection back.

Phase 2: Custom Detection Rules (Experiments 4-12)

With the false positive pressure relieved, the agent wrote 30+ custom detection rules across 9 experiments:

Path traversal with encoding evasion (overlong UTF-8, IIS Unicode, hex-prefix)
Shellshock and obfuscated variants
XXE with case-insensitive matching
SQLi timing attacks, blind injection, function calls, comment evasion
Command execution (shell metacharacters, IFS substitution, Windows commands)
SSTI arithmetic probes and template syntax detection
XSS event handler evasion and javascript protocol bypass
Log4Shell/JNDI with nested ${}, unicode, and protocol scheme variants
Buffer overflow indicators and web root path access

By experiment 12, TPR was back to 98.4% -- higher than the baseline -- while keeping FPR at 7.3%.

Phase 3: Surgical FP Reduction (Experiments 13-19)

With detection maxed out, the agent switched strategies:

Tightened its own rules -- removed overly broad patterns from traversal, shellshock, and SSTI rules that were catching legit traffic
Removed high-FP CRS rules -- identified CRS rule 932270 (Unix Shell Expression) as causing 157 false positives with zero detection benefit, and rule 942550 (SQL Injection Comment Sequence) as causing 31 FPs for only 1 additional detection
Added content types -- same trick that worked on v3, adding text/plain and common content types to reduce CRS rule 920420 triggers

v4.24.0 Final Results

Metric	Baseline	Optimized	Change
Balanced Accuracy	80.8%	98.4%	+21.8%
True Positive Rate	91.3%	98.5%	+7.9%
True Negative Rate	70.3%	98.4%	+39.8%
False Positive Rate	29.7%	1.6%	-94.5%
False Positives	1,336	74	-94.5%
Missed Attacks	74	13	-82.4%

v4.24.0 Optimization Trajectory

 #  BA      TPR     FPR    What changed
 0  0.808   0.913   29.7%  Baseline (stock CRS v4.24.0 PL1)
 1  0.850   0.765    6.4%  + anomaly threshold 7 (massive FP cut, TPR drops)
 2  0.870   0.813    7.3%  + traversal rules (start recovering TPR)
 3  0.875   0.822    7.3%  + shellshock/XXE rules
 4  0.882   0.836    7.3%  + SQLi timing/blind + cmdexe rules
 5  0.883   0.838    7.3%  + SQLi functions + shell arithmetic
 6  0.902   0.877    7.3%  + overlong UTF-8 + sensitive paths
 7  0.920   0.913    7.3%  + SSTI + SQL tautology + Windows cmds
 8  0.943   0.959    7.3%  + flexible traversal + more SQLi + cmd injection
 9  0.955   0.984    7.3%  + XSS evasion + shellshock + UTF-7 XXE
10  0.959   0.984    6.5%  tighten rules (reduce FPs)
11  0.977   0.984    3.0%  remove CRS rule 932270 (157 FPs, 0 TPs)
12  0.979   0.982    2.4%  remove CRS rule 942550 (31 FPs, 1 TP)
13  0.968   0.956    2.0%  + Log4Shell/JNDI detection (37 catches)
14  0.982   0.985    2.0%  + SQLi version/blind + XSS handlers
15  0.984   0.985    1.6%  + content types (same trick as v3)

The v4 trajectory is different from v3. On v3, the agent reduced FPs first, then added detection. On v4, it had to raise the anomaly threshold first (accepting a TPR drop), then spent 9 experiments writing custom rules to recover and exceed the original detection rate, then surgically removed high-FP CRS rules to clean up the remaining false positives.

Download: v4.24.0 Optimized Config

The optimized configuration consists of two files:

REQUEST-900-BEFORE.conf -- anomaly threshold (7), allowed content types
RESPONSE-999-AFTER.conf -- CRS rule removals (932270, 942550), 30+ custom detection rules

Drop these into your CRS v4 deployment:

# Mount as Docker volumes
volumes:
  - ./REQUEST-900-BEFORE.conf:/etc/modsecurity.d/owasp-crs/rules/REQUEST-900-EXCLUSION-RULES-BEFORE-CRS.conf
  - ./RESPONSE-999-AFTER.conf:/etc/modsecurity.d/owasp-crs/rules/RESPONSE-999-EXCLUSION-RULES-AFTER-CRS.conf

Important: These configs were optimized against this specific dataset. Test against your traffic before deploying to production.

v3 vs v4: Side-by-Side

	v3.3.8 Baseline	v3.3.8 Optimized	v4.24.0 Baseline	v4.24.0 Optimized
Balanced Accuracy	86.7%	96.7%	80.8%	98.4%
True Positive Rate	88.5%	97.4%	91.3%	98.5%
False Positive Rate	15.2%	4.0%	29.7%	1.6%
False Positives	685	178	1,336	74
Missed Attacks	98	22	74	13
Experiments	11 (9 kept)		19 (17 kept)
Runtime	~3 hours		~4 hours

Key differences: - v4 starts worse but ends better. v4's stricter default rules create more false positives out of the box, but this gives the agent more optimization surface. - Different strategies. v3: reduce FPs first, add detection later. v4: raise threshold, write custom rules to recover detection, then surgically cut remaining FPs. - v4 needed more custom rules. The higher anomaly threshold means more attacks slip under the scoring threshold, requiring explicit detection rules to compensate. - v4's final FPR (1.6%) is less than half of v3's (4.0%). The combination of threshold tuning + targeted CRS rule removal is more effective than v3's content-type/cookie approach alone.

The Custom Rules (Both Versions)

Here are the detection rules the agent wrote, with explanation and examples for each.

Rule 1: LFI Target Path Detection

What it catches: Path traversal attempts targeting sensitive OS files, with flexible separators to handle encoding evasion.

Examples blocked:
  ../../etc/passwd
  ..%c0%af..%c0%afetc%c0%afpasswd
  ..%bg%qfetc%bg%qfshadow
  ../../windows/system32/config/sam
  /proc/self/environ

Why CRS misses some of these: CRS detects standard ../ traversal but some encoded variants with invalid percent-encoding (%bg, %qf) or hex escapes (0x5c) slip through the URL decode transformations.

Rule 2: Overlong UTF-8 Traversal Encoding

What it catches: Invalid UTF-8 byte sequences used exclusively for WAF evasion.

Examples blocked:
  %c0%af  (overlong / -- 2-byte encoding of a 1-byte character)
  %c1%1c  (overlong \ )
  %e0%80%af  (3-byte overlong /)
  %f0%80%80%af  (4-byte overlong /)
  %u002f  (IIS Unicode /)
  %u005c  (IIS Unicode \)

Why this has zero false positives: Overlong UTF-8 is explicitly forbidden by RFC 3629. No legitimate software produces these byte sequences. Their only purpose is WAF/filter evasion.

Rule 3: SSTI Arithmetic Probes

What it catches: Server-Side Template Injection fingerprinting via arithmetic inside template delimiters.

Examples blocked:
  {{7*7}}       (Jinja2/Twig)
  ${7*7}        (FreeMarker/Velocity)
  *{7*7}        (Thymeleaf)
  <%= 7*7 %>    (ERB/JSP)

Why this works: Penetration testers and tools (tplmap, Burp) use arithmetic expressions to confirm template injection. If the response contains 49, the template engine evaluated the expression. Legitimate users never send {{7*7}} as a parameter value.

Rule 4: Shell Arithmetic Probes

What it catches: Blind command execution probing via shell arithmetic.

Examples blocked:
  $((1+1))
  $((5-3))
  $((7*7))

Why this works: Automated RCE scanners (commix, etc.) use $((N+M)) to test for command injection without triggering obvious command signatures. The response revealing the arithmetic result confirms execution. Never appears in legitimate parameters.

Rule 5: SQL Comment WAF Bypass

What it catches: Inline SQL comments used to split keywords and evade signature detection.

Examples blocked:
  UN/**/ION SE/**/LECT
  /*!50000SELECT*/ /*!50000FROM*/
  1'/**/OR/**/1=1--

Why this works: SQL comments inside HTTP parameters are never legitimate. The /**/ pattern is specifically used to break up SQL keywords that WAF signatures match as whole words. MySQL versioned comments (/*!50000...*/) are a MySQL-specific evasion technique.

Rules 6-8: SQL Injection Patterns

Bare SELECT...FROM, boolean tautologies (OR 1=1), and ORDER BY <number> column enumeration. These supplement CRS's existing SQLi detection for patterns that bypass the libinjection-based checks.

Rules 9-11: XSS/Shellshock/Path Detection

XSS javascript: with embedded whitespace (filter evasion)
Obfuscated Shellshock function definitions
Direct access to /var/www/ paths

Key Takeaways for WAF Operators

1. Your Default CRS Is Probably Blocking More Legit Traffic Than You Think

Stock CRS PL1 blocked 15% of real browsing traffic on v3, 30% on v4. If you're running defaults on a site with real users (not just an API), check your false positive rate.

Quick win: Expand tx.allowed_request_content_type to include content types your application actually receives.

2. Cookies and Referer Headers Are False Positive Factories

Real cookies contain tracking pixels, encoded JSON, HTML fragments, and URL parameters that trigger attack signatures everywhere. If cookie-based attacks aren't in your threat model, excluding REQUEST_COOKIES from attack detection is the single biggest FP reduction.

3. PL2 Is a Trap (Without Tuning)

Paranoia Level 2 improves detection by ~5% but can double or triple your false positives. PL1 with targeted custom rules consistently outperformed raw PL2 in our tests.

4. Custom Rules for Encoding Evasion Are Free Wins

Overlong UTF-8, IIS Unicode encoding, SQL comment splitting -- these patterns have zero legitimate use. Adding detection for them improves security with no false positive cost.

5. AI Agents Can Optimize WAF Config

11 experiments in 3 hours. A human doing the same work would need days. The agent correctly identified dead ends, stacked improvements, and found non-obvious optimizations (like fixing content types before adding detection rules).

Methodology and Limitations

What This Is

An experiment in AI-driven security configuration optimization, applying the autoresearch pattern to WAF tuning.

What This Is Not

Not a production deployment. Test against your traffic before deploying.
Not a replacement for human review. Every rule should be reviewed by a security engineer.
Not a comprehensive benchmark. 854 malicious requests is useful for iterative optimization but isn't statistically exhaustive.

Dataset Limitations

Legitimate traffic is from the openappsec project's 2024 browsing captures. Your traffic will differ.
Malicious payloads are from public repositories. Real attackers use different techniques.
Cookie-based attacks are underrepresented, which may make cookie exclusions appear safer than they are.
The dataset doesn't include API traffic, WebSocket connections, or file upload payloads.

Reproducibility

Everything is open source: - Repository: github.com/wafplanet/autoresearch - Dataset: dataset/requests.jsonl (5,354 requests) - Evaluation: scripts/evaluate.py - Agent loop: scripts/run-agent.sh + scripts/watchdog.sh - Results: results.tsv (v3), results-v4.tsv (v4)

Clone the repo, run docker compose up -d, run python3 scripts/evaluate.py. Verify every number in this post.

Transparency

This experiment was built and run by an AI agent (Zoutje/OpenClaw) orchestrating another AI agent (Claude Code/Anthropic). A human (Thijs de Zoete) designed the experiment, reviewed results, and made editorial decisions. The detection rules were written by Claude Code during autonomous experiment cycles. This blog post was drafted collaboratively.

We believe AI-assisted security research should be transparent about its methods. If an AI helped find it, say so.

What's Next

We're evaluating contributions to the OWASP CRS project for the detection rules with zero false positives (overlong UTF-8, SSTI probes, shell arithmetic).
We're exploring longer runs (24h+) with larger datasets to see where the optimization plateaus.
We want to test against CRS v4's newer rule categories and see how the agent adapts.

If you try this against your own traffic, we'd love to hear what it finds. Open an issue on the repo or reach out.

Background

The Autoresearch Pattern

Infrastructure

Dataset: 5,354 Real HTTP Requests

Legitimate Traffic (4,500 requests)

Malicious Traffic (854 requests)

Scoring: Balanced Accuracy

Evaluation Speed

CRS v3.3.8 Results

Baseline

The Experiments

Experiment 1: Expand Allowed Content Types (KEPT)

Experiment 2: Paranoia Level 2 (DISCARDED)

Experiment 3: Custom Path Traversal Rules (KEPT)

Experiment 4: HTTP Methods + Cookie SQLi Exclusions (KEPT)

Experiments 5-6: Cookie + Referer Exclusions (KEPT)

Experiment 7: Custom SQL Detection Rules (KEPT)

Experiment 8: IFS Bypass + Data URI XSS (DISCARDED)

Experiment 9: Encoding Detection + XSS/Shellshock (KEPT)

v3.3.8 Final Results

v3.3.8 Optimization Trajectory

Download: v3.3.8 Optimized Config

CRS v4.24.0 Results

Baseline

The Experiments

Phase 1: Tame the False Positives (Experiments 1-3)

Phase 2: Custom Detection Rules (Experiments 4-12)

Phase 3: Surgical FP Reduction (Experiments 13-19)

v4.24.0 Final Results

v4.24.0 Optimization Trajectory

Download: v4.24.0 Optimized Config

v3 vs v4: Side-by-Side

The Custom Rules (Both Versions)

Rule 1: LFI Target Path Detection

Rule 2: Overlong UTF-8 Traversal Encoding

Rule 3: SSTI Arithmetic Probes

Rule 4: Shell Arithmetic Probes

Rule 5: SQL Comment WAF Bypass

Rules 6-8: SQL Injection Patterns

Rules 9-11: XSS/Shellshock/Path Detection

Key Takeaways for WAF Operators

1. Your Default CRS Is Probably Blocking More Legit Traffic Than You Think

2. Cookies and Referer Headers Are False Positive Factories

3. PL2 Is a Trap (Without Tuning)

4. Custom Rules for Encoding Evasion Are Free Wins

5. AI Agents Can Optimize WAF Config

Methodology and Limitations

What This Is

What This Is Not

Dataset Limitations

Reproducibility

Transparency

What's Next

Tags

Frequently Asked Questions

Related Posts

An AI Agent Improved OWASP CRS Detection by 80% in 20 Experiments