INHUMAIN.AI
The Watchdog Platform for Inhuman Intelligence
Documenting What Happens When Intelligence Stops Being Human
AI Incidents (2026): 847 ▲ +23% | Countries with AI Laws: 41 ▲ +8 YTD | HUMAIN Partnerships: $23B ▲ +$3B | EU AI Act Fines: €14M ▲ New | AI Safety Funding: $2.1B ▲ +45% | OpenAI Valuation: $157B ▲ +34% | AI Job Displacement: 14M ▲ +2.1M | HUMAIN Watch: ACTIVE 24/7 | AI Incidents (2026): 847 ▲ +23% | Countries with AI Laws: 41 ▲ +8 YTD | HUMAIN Partnerships: $23B ▲ +$3B | EU AI Act Fines: €14M ▲ New | AI Safety Funding: $2.1B ▲ +45% | OpenAI Valuation: $157B ▲ +34% | AI Job Displacement: 14M ▲ +2.1M | HUMAIN Watch: ACTIVE 24/7 |

How to Audit an AI System: Complete Technical and Legal Guide

Step-by-step methodology for auditing AI systems — bias detection, fairness assessment, risk evaluation, and compliance verification. Covers IBM AIF360, Google What-If Tool, Fairlearn, and emerging audit standards.

AI auditing is no longer optional. New York City requires annual bias audits for automated employment decision tools. The EU AI Act mandates conformity assessments for high-risk AI systems. Colorado’s AI Act requires impact assessments. The question is no longer whether to audit your AI systems, but how.

This guide provides a comprehensive methodology for auditing AI systems, covering technical assessment, legal compliance, and practical tooling. It is designed for audit teams, compliance officers, AI engineers, and organizational leaders responsible for AI governance.


What Is an AI Audit?

An AI audit is a structured, systematic evaluation of an AI system’s properties, behavior, impacts, and governance. Unlike traditional software testing — which focuses on whether a system meets functional specifications — an AI audit evaluates whether a system operates fairly, safely, transparently, and in compliance with applicable laws and organizational policies.

AI audits address questions that functional testing cannot:

  • Does the system produce biased outcomes across demographic groups?
  • Are the system’s decisions explainable to affected individuals?
  • Does the system comply with privacy regulations?
  • Has the system been tested for adversarial robustness?
  • Is human oversight adequate and effective?
  • Does the system’s real-world performance match its development-phase evaluation?

Types of AI Audits

Bias and Fairness Audit

Evaluates whether an AI system produces disparate outcomes across protected groups. This is the most commonly mandated audit type (NYC LL 144, Colorado AI Act).

Scope: Statistical analysis of system outputs across demographic categories (race, gender, age, disability, etc.).

Safety and Risk Audit

Evaluates whether an AI system operates within acceptable safety boundaries and whether risk management processes are adequate.

Scope: Technical testing for failure modes, edge cases, adversarial attacks, and cascading failures.

Compliance Audit

Evaluates whether an AI system and its governance processes comply with applicable legal requirements.

Scope: Documentation review, process verification, and conformity assessment against regulatory requirements (EU AI Act, GDPR, sector-specific regulation).

Ethics Audit

Evaluates whether an AI system’s design and deployment align with ethical principles and organizational values.

Scope: Value alignment assessment, stakeholder impact analysis, and ethical risk evaluation.

Technical Performance Audit

Evaluates whether an AI system performs as specified and whether its performance degrades over time.

Scope: Accuracy, precision, recall, latency, throughput, and performance drift analysis.


Step-by-Step Audit Methodology

Phase 1: Scoping and Planning

Step 1.1: Define Audit Objectives

Determine the purpose of the audit. Is it legally mandated (NYC LL 144 bias audit)? Is it part of a conformity assessment (EU AI Act)? Is it an internal governance exercise? The objectives determine the scope, methodology, and reporting requirements.

Step 1.2: Identify the AI System

Document the AI system under audit:

  • System name and version
  • Developer and deployer
  • Intended purpose and use cases
  • Deployment context and affected populations
  • Data sources and training methodology
  • Model architecture and type
  • Decision-making domain (hiring, credit, healthcare, etc.)
  • Integration points with other systems
  • Human oversight mechanisms

Step 1.3: Determine Applicable Requirements

Identify all legal, regulatory, and organizational requirements that apply to the system:

  • EU AI Act requirements (if applicable)
  • GDPR requirements (if processing personal data)
  • Sector-specific regulations (FDA, SEC, EEOC, etc.)
  • State/local requirements (NYC LL 144, Colorado AI Act)
  • Organizational AI governance policies
  • Industry standards (NIST AI RMF, ISO 42001)

Step 1.4: Establish Audit Team

The audit team should include:

  • AI/ML technical expertise (model evaluation, data science)
  • Legal expertise (regulatory compliance, privacy)
  • Domain expertise (understanding of the application domain)
  • Ethics expertise (fairness, impact assessment)
  • Independence (auditors should be independent from the development team)

Step 1.5: Develop Audit Plan

Document the audit scope, methodology, timeline, data access requirements, stakeholder interviews, and deliverables.

Phase 2: Documentation Review

Step 2.1: Technical Documentation

Review the system’s technical documentation, including:

  • System design documents
  • Model architecture and training methodology
  • Training data documentation (sources, curation, labeling)
  • Validation and testing results
  • Performance metrics and benchmarks
  • Known limitations and failure modes
  • Change log and version history

Step 2.2: Governance Documentation

Review the organizational governance framework:

  • AI risk management policies
  • Data governance procedures
  • Human oversight protocols
  • Incident response plans
  • Post-deployment monitoring processes
  • Roles and responsibilities

Step 2.3: Compliance Documentation

Review documentation specific to regulatory compliance:

  • Data Protection Impact Assessments (GDPR)
  • Fundamental rights impact assessments (EU AI Act)
  • Risk classification documentation
  • Conformity assessment records
  • EU database registration (if applicable)

Phase 3: Technical Assessment

Step 3.1: Data Quality Assessment

Evaluate the training, validation, and testing data:

  • Representativeness: Does the data adequately represent the populations the system will affect?
  • Completeness: Are there significant gaps in the data?
  • Accuracy: Are labels and annotations accurate?
  • Bias assessment: Are there systematic biases in the data that could produce discriminatory outcomes?
  • Currency: Is the data sufficiently current for the intended purpose?
  • Privacy compliance: Was the data collected and processed in compliance with applicable privacy laws?

Step 3.2: Model Performance Evaluation

Evaluate the system’s performance across relevant metrics:

  • Accuracy metrics: Overall accuracy, precision, recall, F1 score, AUC-ROC
  • Performance disaggregation: Break down performance metrics by demographic group to identify disparate performance
  • Calibration: Does the system’s confidence scores accurately reflect its actual accuracy?
  • Robustness: How does performance change with noisy, incomplete, or adversarial inputs?
  • Performance drift: Has the system’s performance degraded since deployment?

Step 3.3: Fairness Assessment

Evaluate the system’s outputs for fairness across protected groups. Multiple fairness metrics exist, and no single metric captures all dimensions of fairness:

Metric Definition Tool Support
Demographic parity Equal positive outcome rates across groups AIF360, Fairlearn
Equalized odds Equal true positive and false positive rates across groups AIF360, Fairlearn
Predictive parity Equal positive predictive values across groups AIF360
Individual fairness Similar individuals receive similar outcomes AIF360
Disparate impact ratio Selection rate ratio between groups (4/5ths rule) AIF360, Fairlearn, Aequitas
Counterfactual fairness Outcome would be the same if protected attribute were different Custom implementation

Step 3.4: Explainability Assessment

Evaluate whether the system’s decisions can be explained to affected individuals and oversight personnel:

  • Global explainability: Can the system’s overall decision-making logic be described?
  • Local explainability: Can individual decisions be explained?
  • Feature importance: Which input features most influence the system’s outputs?
  • Counterfactual explanations: What would need to change for a different outcome?

Step 3.5: Security Assessment

Evaluate the system’s vulnerability to adversarial attacks and misuse:

  • Adversarial input testing (evasion attacks)
  • Data poisoning vulnerability assessment
  • Model extraction risk evaluation
  • Privacy attack testing (membership inference, model inversion)
  • Access control review

Phase 4: Operational Assessment

Step 4.1: Human Oversight Evaluation

Assess whether human oversight mechanisms are effective:

  • Can human overseers understand the system’s outputs?
  • Do they have the authority and ability to override the system?
  • Is the volume and pace of decisions compatible with meaningful human review?
  • Are overseers trained to identify system errors and biases?
  • Is there evidence of automation bias (humans deferring to AI without critical evaluation)?

Step 4.2: Monitoring and Incident Response

Assess post-deployment governance:

  • Are system outputs monitored for performance degradation and drift?
  • Are feedback mechanisms in place for affected individuals?
  • Is there an incident response plan for system failures or harmful outcomes?
  • Are incidents documented and reported as required?

Step 4.3: Stakeholder Impact Assessment

Assess the system’s actual impact on affected individuals and communities:

  • What decisions does the system make or influence?
  • Who is affected by those decisions?
  • What are the consequences of incorrect decisions?
  • Are there adequate appeal and redress mechanisms?
  • Have affected communities been consulted?

Phase 5: Reporting

Step 5.1: Audit Findings

Document all findings, categorized by severity:

  • Critical: Findings indicating current harm, legal non-compliance, or imminent risk
  • Major: Findings indicating significant risk, material non-compliance, or systemic issues
  • Minor: Findings indicating areas for improvement or emerging risks
  • Observation: Findings noting best practice deviations without immediate risk

Step 5.2: Recommendations

Provide specific, actionable recommendations for each finding, including:

  • Remediation actions
  • Timeline for implementation
  • Resources required
  • Priority ranking

Step 5.3: Compliance Determination

For compliance audits, provide a clear determination of whether the system meets applicable requirements, with supporting evidence.


Open-Source Audit Tools

IBM AI Fairness 360 (AIF360)

Purpose: Comprehensive bias detection and mitigation toolkit.

Capabilities:

  • 70+ fairness metrics
  • Pre-processing bias mitigation (reweighting, disparate impact remover)
  • In-processing bias mitigation (adversarial debiasing, prejudice remover)
  • Post-processing bias mitigation (calibrated equalized odds, reject option classification)
  • Dataset bias analysis

Best for: Comprehensive fairness audits requiring multiple metrics and mitigation strategies.

Google What-If Tool

Purpose: Visual exploration of machine learning model behavior.

Capabilities:

  • Interactive visualization of model performance across data slices
  • Fairness metric comparison across groups
  • Counterfactual analysis (what-if scenarios)
  • Feature attribution and importance analysis
  • Integration with TensorBoard

Best for: Exploratory analysis and visual communication of audit findings to non-technical stakeholders.

Microsoft Fairlearn

Purpose: Fairness assessment and mitigation for machine learning models.

Capabilities:

  • Dashboard for fairness assessment across demographic groups
  • Mitigation algorithms (exponentiated gradient, grid search, threshold optimizer)
  • Group fairness metrics
  • Integration with scikit-learn ecosystem

Best for: Fairness audits in Python-based ML pipelines with a focus on mitigation.

Aequitas

Purpose: Bias and fairness audit toolkit developed by the University of Chicago.

Capabilities:

  • Audit disparities across multiple protected attributes
  • Group-level and individual-level fairness analysis
  • Web-based interface for non-technical users
  • Report generation

Best for: Quick fairness assessments with accessible visualization.

SHAP (SHapley Additive exPlanations)

Purpose: Model explainability through Shapley values.

Capabilities:

  • Feature importance for individual predictions
  • Global feature importance across the dataset
  • Interaction effects between features
  • Support for tree-based models, deep learning, and linear models

Best for: Explainability assessment and feature attribution analysis.

LIME (Local Interpretable Model-Agnostic Explanations)

Purpose: Local explainability for any machine learning model.

Capabilities:

  • Model-agnostic local explanations
  • Text, image, and tabular data support
  • Human-interpretable explanations for individual predictions

Best for: Generating human-understandable explanations of individual AI decisions.

Adversarial Robustness Toolbox (ART)

Purpose: Security testing for machine learning models.

Capabilities:

  • Adversarial attack simulation (evasion, poisoning, extraction, inference)
  • Defense implementation (preprocessing, postprocessing, trainer)
  • Robustness evaluation across attack types

Best for: Security assessments and adversarial robustness testing.

Additional Tools

Tool Developer Focus
Evidently AI Evidently ML monitoring and testing
Great Expectations Community Data quality validation
Responsible AI Toolbox Microsoft Comprehensive RAI toolkit
TensorFlow Model Analysis Google Model evaluation at scale
Alibi SeldonIO ML model inspection and interpretation
AI Verify IMDA Singapore AI governance testing

Regulatory Audit Requirements

NYC Local Law 144

Requirement: Annual bias audit by an independent auditor for automated employment decision tools.

Specifications:

  • Must calculate selection or scoring rate for each demographic category (sex/gender, race/ethnicity, intersectional)
  • Must calculate disparate impact ratio using the 4/5ths (80%) rule
  • Results must be published on the employer’s website
  • Audit must cover the most recent year of data (or test data if insufficient historical data)

EU AI Act Conformity Assessment

Requirement: Conformity assessment before placing high-risk AI systems on the EU market.

Specifications:

  • Internal control procedure (most Annex III systems): Provider self-assessment against Articles 8-15
  • Third-party assessment (biometric identification): Independent notified body assessment
  • Must verify quality management system, technical documentation, and system conformity
  • Must result in EU declaration of conformity and CE marking

Colorado AI Act Impact Assessment

Requirement: Impact assessment for high-risk AI systems before deployment.

Specifications:

  • Purpose and intended use of the system
  • Analysis of the system’s benefits and risks
  • Assessment of data used by the system
  • Known or foreseeable risks of algorithmic discrimination
  • Mitigation measures implemented
  • Description of post-deployment monitoring

Building an Audit Program

Organizational Readiness

Before conducting individual system audits, establish organizational readiness:

  1. AI inventory: Maintain a comprehensive inventory of all AI systems in use or development
  2. Risk classification: Classify each system by risk level to prioritize audit resources
  3. Governance framework: Adopt a governance framework (NIST AI RMF, ISO 42001) that provides the structural foundation for auditing
  4. Audit capability: Build or acquire audit expertise (internal team, external auditor, or hybrid)
  5. Documentation standards: Establish documentation requirements that facilitate auditing

Audit Frequency

System Risk Level Recommended Frequency Regulatory Minimum
High-risk (EU AI Act) Continuous monitoring + annual comprehensive audit Conformity assessment (pre-market) + ongoing compliance
Consequential decisions (Colorado) Annual + event-driven Impact assessment before deployment
Employment AEDT (NYC) Annual bias audit Annual bias audit
Other regulated Annual Varies by sector
Internal/low-risk Biennial or event-driven None

Audit Independence

Independence is essential to audit credibility. At minimum:

  • Auditors should not report to the AI development team
  • External auditors should have no financial relationship with the AI provider beyond the audit engagement
  • Audit findings should be reported to senior leadership or board level
  • Audit methodologies and tools should be disclosed

This guide is maintained by INHUMAIN.AI. For related coverage, see our Global AI Regulation Tracker, EU AI Act Complete Guide, AI Governance Frameworks Comparison, and AI Liability Guide.