How to Audit an AI System: Complete Technical and Legal Guide
Step-by-step methodology for auditing AI systems — bias detection, fairness assessment, risk evaluation, and compliance verification. Covers IBM AIF360, Google What-If Tool, Fairlearn, and emerging audit standards.
AI auditing is no longer optional. New York City requires annual bias audits for automated employment decision tools. The EU AI Act mandates conformity assessments for high-risk AI systems. Colorado’s AI Act requires impact assessments. The question is no longer whether to audit your AI systems, but how.
This guide provides a comprehensive methodology for auditing AI systems, covering technical assessment, legal compliance, and practical tooling. It is designed for audit teams, compliance officers, AI engineers, and organizational leaders responsible for AI governance.
What Is an AI Audit?
An AI audit is a structured, systematic evaluation of an AI system’s properties, behavior, impacts, and governance. Unlike traditional software testing — which focuses on whether a system meets functional specifications — an AI audit evaluates whether a system operates fairly, safely, transparently, and in compliance with applicable laws and organizational policies.
AI audits address questions that functional testing cannot:
- Does the system produce biased outcomes across demographic groups?
- Are the system’s decisions explainable to affected individuals?
- Does the system comply with privacy regulations?
- Has the system been tested for adversarial robustness?
- Is human oversight adequate and effective?
- Does the system’s real-world performance match its development-phase evaluation?
Types of AI Audits
Bias and Fairness Audit
Evaluates whether an AI system produces disparate outcomes across protected groups. This is the most commonly mandated audit type (NYC LL 144, Colorado AI Act).
Scope: Statistical analysis of system outputs across demographic categories (race, gender, age, disability, etc.).
Safety and Risk Audit
Evaluates whether an AI system operates within acceptable safety boundaries and whether risk management processes are adequate.
Scope: Technical testing for failure modes, edge cases, adversarial attacks, and cascading failures.
Compliance Audit
Evaluates whether an AI system and its governance processes comply with applicable legal requirements.
Scope: Documentation review, process verification, and conformity assessment against regulatory requirements (EU AI Act, GDPR, sector-specific regulation).
Ethics Audit
Evaluates whether an AI system’s design and deployment align with ethical principles and organizational values.
Scope: Value alignment assessment, stakeholder impact analysis, and ethical risk evaluation.
Technical Performance Audit
Evaluates whether an AI system performs as specified and whether its performance degrades over time.
Scope: Accuracy, precision, recall, latency, throughput, and performance drift analysis.
Step-by-Step Audit Methodology
Phase 1: Scoping and Planning
Step 1.1: Define Audit Objectives
Determine the purpose of the audit. Is it legally mandated (NYC LL 144 bias audit)? Is it part of a conformity assessment (EU AI Act)? Is it an internal governance exercise? The objectives determine the scope, methodology, and reporting requirements.
Step 1.2: Identify the AI System
Document the AI system under audit:
- System name and version
- Developer and deployer
- Intended purpose and use cases
- Deployment context and affected populations
- Data sources and training methodology
- Model architecture and type
- Decision-making domain (hiring, credit, healthcare, etc.)
- Integration points with other systems
- Human oversight mechanisms
Step 1.3: Determine Applicable Requirements
Identify all legal, regulatory, and organizational requirements that apply to the system:
- EU AI Act requirements (if applicable)
- GDPR requirements (if processing personal data)
- Sector-specific regulations (FDA, SEC, EEOC, etc.)
- State/local requirements (NYC LL 144, Colorado AI Act)
- Organizational AI governance policies
- Industry standards (NIST AI RMF, ISO 42001)
Step 1.4: Establish Audit Team
The audit team should include:
- AI/ML technical expertise (model evaluation, data science)
- Legal expertise (regulatory compliance, privacy)
- Domain expertise (understanding of the application domain)
- Ethics expertise (fairness, impact assessment)
- Independence (auditors should be independent from the development team)
Step 1.5: Develop Audit Plan
Document the audit scope, methodology, timeline, data access requirements, stakeholder interviews, and deliverables.
Phase 2: Documentation Review
Step 2.1: Technical Documentation
Review the system’s technical documentation, including:
- System design documents
- Model architecture and training methodology
- Training data documentation (sources, curation, labeling)
- Validation and testing results
- Performance metrics and benchmarks
- Known limitations and failure modes
- Change log and version history
Step 2.2: Governance Documentation
Review the organizational governance framework:
- AI risk management policies
- Data governance procedures
- Human oversight protocols
- Incident response plans
- Post-deployment monitoring processes
- Roles and responsibilities
Step 2.3: Compliance Documentation
Review documentation specific to regulatory compliance:
- Data Protection Impact Assessments (GDPR)
- Fundamental rights impact assessments (EU AI Act)
- Risk classification documentation
- Conformity assessment records
- EU database registration (if applicable)
Phase 3: Technical Assessment
Step 3.1: Data Quality Assessment
Evaluate the training, validation, and testing data:
- Representativeness: Does the data adequately represent the populations the system will affect?
- Completeness: Are there significant gaps in the data?
- Accuracy: Are labels and annotations accurate?
- Bias assessment: Are there systematic biases in the data that could produce discriminatory outcomes?
- Currency: Is the data sufficiently current for the intended purpose?
- Privacy compliance: Was the data collected and processed in compliance with applicable privacy laws?
Step 3.2: Model Performance Evaluation
Evaluate the system’s performance across relevant metrics:
- Accuracy metrics: Overall accuracy, precision, recall, F1 score, AUC-ROC
- Performance disaggregation: Break down performance metrics by demographic group to identify disparate performance
- Calibration: Does the system’s confidence scores accurately reflect its actual accuracy?
- Robustness: How does performance change with noisy, incomplete, or adversarial inputs?
- Performance drift: Has the system’s performance degraded since deployment?
Step 3.3: Fairness Assessment
Evaluate the system’s outputs for fairness across protected groups. Multiple fairness metrics exist, and no single metric captures all dimensions of fairness:
| Metric | Definition | Tool Support |
|---|---|---|
| Demographic parity | Equal positive outcome rates across groups | AIF360, Fairlearn |
| Equalized odds | Equal true positive and false positive rates across groups | AIF360, Fairlearn |
| Predictive parity | Equal positive predictive values across groups | AIF360 |
| Individual fairness | Similar individuals receive similar outcomes | AIF360 |
| Disparate impact ratio | Selection rate ratio between groups (4/5ths rule) | AIF360, Fairlearn, Aequitas |
| Counterfactual fairness | Outcome would be the same if protected attribute were different | Custom implementation |
Step 3.4: Explainability Assessment
Evaluate whether the system’s decisions can be explained to affected individuals and oversight personnel:
- Global explainability: Can the system’s overall decision-making logic be described?
- Local explainability: Can individual decisions be explained?
- Feature importance: Which input features most influence the system’s outputs?
- Counterfactual explanations: What would need to change for a different outcome?
Step 3.5: Security Assessment
Evaluate the system’s vulnerability to adversarial attacks and misuse:
- Adversarial input testing (evasion attacks)
- Data poisoning vulnerability assessment
- Model extraction risk evaluation
- Privacy attack testing (membership inference, model inversion)
- Access control review
Phase 4: Operational Assessment
Step 4.1: Human Oversight Evaluation
Assess whether human oversight mechanisms are effective:
- Can human overseers understand the system’s outputs?
- Do they have the authority and ability to override the system?
- Is the volume and pace of decisions compatible with meaningful human review?
- Are overseers trained to identify system errors and biases?
- Is there evidence of automation bias (humans deferring to AI without critical evaluation)?
Step 4.2: Monitoring and Incident Response
Assess post-deployment governance:
- Are system outputs monitored for performance degradation and drift?
- Are feedback mechanisms in place for affected individuals?
- Is there an incident response plan for system failures or harmful outcomes?
- Are incidents documented and reported as required?
Step 4.3: Stakeholder Impact Assessment
Assess the system’s actual impact on affected individuals and communities:
- What decisions does the system make or influence?
- Who is affected by those decisions?
- What are the consequences of incorrect decisions?
- Are there adequate appeal and redress mechanisms?
- Have affected communities been consulted?
Phase 5: Reporting
Step 5.1: Audit Findings
Document all findings, categorized by severity:
- Critical: Findings indicating current harm, legal non-compliance, or imminent risk
- Major: Findings indicating significant risk, material non-compliance, or systemic issues
- Minor: Findings indicating areas for improvement or emerging risks
- Observation: Findings noting best practice deviations without immediate risk
Step 5.2: Recommendations
Provide specific, actionable recommendations for each finding, including:
- Remediation actions
- Timeline for implementation
- Resources required
- Priority ranking
Step 5.3: Compliance Determination
For compliance audits, provide a clear determination of whether the system meets applicable requirements, with supporting evidence.
Open-Source Audit Tools
IBM AI Fairness 360 (AIF360)
Purpose: Comprehensive bias detection and mitigation toolkit.
Capabilities:
- 70+ fairness metrics
- Pre-processing bias mitigation (reweighting, disparate impact remover)
- In-processing bias mitigation (adversarial debiasing, prejudice remover)
- Post-processing bias mitigation (calibrated equalized odds, reject option classification)
- Dataset bias analysis
Best for: Comprehensive fairness audits requiring multiple metrics and mitigation strategies.
Google What-If Tool
Purpose: Visual exploration of machine learning model behavior.
Capabilities:
- Interactive visualization of model performance across data slices
- Fairness metric comparison across groups
- Counterfactual analysis (what-if scenarios)
- Feature attribution and importance analysis
- Integration with TensorBoard
Best for: Exploratory analysis and visual communication of audit findings to non-technical stakeholders.
Microsoft Fairlearn
Purpose: Fairness assessment and mitigation for machine learning models.
Capabilities:
- Dashboard for fairness assessment across demographic groups
- Mitigation algorithms (exponentiated gradient, grid search, threshold optimizer)
- Group fairness metrics
- Integration with scikit-learn ecosystem
Best for: Fairness audits in Python-based ML pipelines with a focus on mitigation.
Aequitas
Purpose: Bias and fairness audit toolkit developed by the University of Chicago.
Capabilities:
- Audit disparities across multiple protected attributes
- Group-level and individual-level fairness analysis
- Web-based interface for non-technical users
- Report generation
Best for: Quick fairness assessments with accessible visualization.
SHAP (SHapley Additive exPlanations)
Purpose: Model explainability through Shapley values.
Capabilities:
- Feature importance for individual predictions
- Global feature importance across the dataset
- Interaction effects between features
- Support for tree-based models, deep learning, and linear models
Best for: Explainability assessment and feature attribution analysis.
LIME (Local Interpretable Model-Agnostic Explanations)
Purpose: Local explainability for any machine learning model.
Capabilities:
- Model-agnostic local explanations
- Text, image, and tabular data support
- Human-interpretable explanations for individual predictions
Best for: Generating human-understandable explanations of individual AI decisions.
Adversarial Robustness Toolbox (ART)
Purpose: Security testing for machine learning models.
Capabilities:
- Adversarial attack simulation (evasion, poisoning, extraction, inference)
- Defense implementation (preprocessing, postprocessing, trainer)
- Robustness evaluation across attack types
Best for: Security assessments and adversarial robustness testing.
Additional Tools
| Tool | Developer | Focus |
|---|---|---|
| Evidently AI | Evidently | ML monitoring and testing |
| Great Expectations | Community | Data quality validation |
| Responsible AI Toolbox | Microsoft | Comprehensive RAI toolkit |
| TensorFlow Model Analysis | Model evaluation at scale | |
| Alibi | SeldonIO | ML model inspection and interpretation |
| AI Verify | IMDA Singapore | AI governance testing |
Regulatory Audit Requirements
NYC Local Law 144
Requirement: Annual bias audit by an independent auditor for automated employment decision tools.
Specifications:
- Must calculate selection or scoring rate for each demographic category (sex/gender, race/ethnicity, intersectional)
- Must calculate disparate impact ratio using the 4/5ths (80%) rule
- Results must be published on the employer’s website
- Audit must cover the most recent year of data (or test data if insufficient historical data)
EU AI Act Conformity Assessment
Requirement: Conformity assessment before placing high-risk AI systems on the EU market.
Specifications:
- Internal control procedure (most Annex III systems): Provider self-assessment against Articles 8-15
- Third-party assessment (biometric identification): Independent notified body assessment
- Must verify quality management system, technical documentation, and system conformity
- Must result in EU declaration of conformity and CE marking
Colorado AI Act Impact Assessment
Requirement: Impact assessment for high-risk AI systems before deployment.
Specifications:
- Purpose and intended use of the system
- Analysis of the system’s benefits and risks
- Assessment of data used by the system
- Known or foreseeable risks of algorithmic discrimination
- Mitigation measures implemented
- Description of post-deployment monitoring
Building an Audit Program
Organizational Readiness
Before conducting individual system audits, establish organizational readiness:
- AI inventory: Maintain a comprehensive inventory of all AI systems in use or development
- Risk classification: Classify each system by risk level to prioritize audit resources
- Governance framework: Adopt a governance framework (NIST AI RMF, ISO 42001) that provides the structural foundation for auditing
- Audit capability: Build or acquire audit expertise (internal team, external auditor, or hybrid)
- Documentation standards: Establish documentation requirements that facilitate auditing
Audit Frequency
| System Risk Level | Recommended Frequency | Regulatory Minimum |
|---|---|---|
| High-risk (EU AI Act) | Continuous monitoring + annual comprehensive audit | Conformity assessment (pre-market) + ongoing compliance |
| Consequential decisions (Colorado) | Annual + event-driven | Impact assessment before deployment |
| Employment AEDT (NYC) | Annual bias audit | Annual bias audit |
| Other regulated | Annual | Varies by sector |
| Internal/low-risk | Biennial or event-driven | None |
Audit Independence
Independence is essential to audit credibility. At minimum:
- Auditors should not report to the AI development team
- External auditors should have no financial relationship with the AI provider beyond the audit engagement
- Audit findings should be reported to senior leadership or board level
- Audit methodologies and tools should be disclosed
This guide is maintained by INHUMAIN.AI. For related coverage, see our Global AI Regulation Tracker, EU AI Act Complete Guide, AI Governance Frameworks Comparison, and AI Liability Guide.