Microsoft Sentinel data lake: implementation guide

Introduction

What is Microsoft Sentinel data lake

Microsoft Sentinel data lake is a purpose-built, cloud-native security data platform that addresses the fundamental challenge organizations face between comprehensive security coverage and cost sustainability. The platform transforms how organizations manage and analyze security data through:

  1. Unifying security data across Microsoft Defender XDR, third-party sources, assets, activity logs, and threat intelligence
  2. Optimizing costs with tiered storage, on-demand data promotion, and single copy architecture
  3. Enabling deep security insights with up to 12 years of queryable security data
  4. Powering AI and automation for faster detection and response

The business challenge solved

Traditional SIEM solutions struggle with the cost and complexity of storing and querying long-term security data. Organizations often face impossible choices between comprehensive security coverage and budget constraints. Microsoft Sentinel data lake solves these challenges through centralized data management, cost-effective storage with significantly lower per-GB costs for historical data, maintained query capabilities ensuring no loss of analytical functionality, and automated data lifecycle management reducing operational overhead.

Two-tier architecture benefits

Data tiers architecture

Microsoft Sentinel data lake implements a sophisticated two-tier architecture optimized for both cost and performance:

FeatureAnalytics tierData lake tier
Key characteristicsHigh-performance querying and indexing (hot/interactive retention)Cost-effective long-term retention (cold storage)
Best use casesReal-time analytics rules, alerting, hunting, workbooks, all Sentinel SIEM featuresCompliance and regulatory logging, historical trend analysis and forensics, low-touch data not needed for real-time alerts
Query price includedYes – queries included in ingestion costNo – queries charged separately at $0.005/GB scanned
Query performanceOptimized – sub-second response for recent dataSlower – optimized for large-scale scans, good for auditing
Query capabilitiesFull KQL in Defender and Azure portals via APIsFull KQL on single table with lookup enrichment, scheduled KQL/Spark jobs, Jupyter notebooks
Real-time analytics featuresYes – all detection rules, alerts, workbooks, hunting queriesNo – limitations on analytics rules, hunting queries, parsers, watchlists, workbooks, playbooks
Search jobsSupportedSupported
Summary rulesSupportedSupported – Full KQL on single table, extendable with analytics table lookups
Restore capabilitySupportedNot supported
Data exportSupportedNot supported
Retention period90 days (Sentinel) / 30 days (Defender XDR), extendable to 2 years with prorated monthly chargeSame as analytics by default, extendable to 12 years

Critical Note: Data processing charge ($0.10/GB) applies ONLY to data ingested directly to data lake tier only. Data mirrored from analytics tier incurs NO processing charge.

Key advantages over Azure Data Explorer (ADX)

While Azure Data Explorer remains a powerful platform, it demands significant administrative investment. Organizations using ADX must manually create and maintain table structures, custom functions, monitoring systems, and data mappings. The Sentinel data lake eliminates this operational burden by having Microsoft handle the underlying infrastructure management.

Impact on Auxiliary logging

When organizations enable the Sentinel data lake, auxiliary log tables automatically disappear from both Defender Advanced Hunting and the Microsoft Sentinel portal interface. However, this data isn’t lost—it’s seamlessly migrated to the data lake environment where analysts can access it through KQL queries or Jupyter notebooks. This transition represents an automatic upgrade from auxiliary table storage to a more scalable data lake architecture.

Prerequisites and requirements

Required permissions

You need specific roles across different scopes:

Permission ScopeRequired RolePurpose
Azure SubscriptionSubscription OwnerBilling setup and resource creation
Microsoft Entra IDGlobal Administrator OR Security AdministratorCross-tenant security platform configuration
Workspace LevelMicrosoft Sentinel ReaderAccess to all participating workspaces

Technical prerequisites

Infrastructure requirements:

  • Existing Azure subscription with billing access
  • Resource group (existing or new) – cannot be changed after setup
  • Microsoft Sentinel workspace(s) in correct region
  • Microsoft Sentinel primary workspace is connected to the Microsoft Defender XDR portal

Pricing model

Understanding the cost structure of Microsoft Sentinel data lake is essential for planning your implementation and maximizing cost savings. The platform uses a consumption-based pricing model with distinct meters for ingestion, storage, processing, and compute.

The pricing model underwent changes as the platform transitioned from public preview to general availability. Most notably, data processing and asset data ingestion now incur charges, though these costs are offset by the substantial savings on storage and the elimination of analytics tier ingestion costs for appropriately routed data.

Current pricing structure

Cost componentRateWhat it covers
Data lake ingestion$0.05/GBData entering data lake tier directly
Data processing$0.10/GBAll transformations (filtering, normalization, enrichment)
Data lake storage$0.026/GB/monthStorage beyond 30-day included period (6:1 compression applied)
Query execution$0.005/GBData scanned during KQL queries and jobs
Jupyter notebook compute$0.15/compute hourNotebook sessions (cores × session time)
Asset data ingestion$0.05/GBEntra ID, Microsoft 365, and Azure asset connector data

For comparison – Analytics tier: $2.30-$4.60/GB ingestion (region and commitment dependent)

Critical pricing notes

Included with ingestion:

  • 30 days of storage included – No separate storage charges for first 30 days
  • No double billing – Data mirrored from analytics tier to data lake incurs no additional ingestion charges

Sentinel-specific advantage:

  • Filtering exemption – Sentinel workspaces have NO penalty charges for DCR filtering, regardless of volume reduction percentage (Standard Log Analytics charges penalties for filtering >50%)

Uniform compression:

  • 6:1 compression ratio – Applied consistently across all data sources for storage billing

October 2025 changes:

  • Data processing ($0.10/GB) – Now charged (was free during preview)
  • Asset data ingestion ($0.05/GB) – Now charged (was free during preview)

Data lake onboarding

Step 1: Initiate onboarding process

Method 1: Via onboarding banner

  1. Sign in to Microsoft Defender Portal
  2. Look for data lake onboarding banner at top of page
  3. Click Get started

Method 2: Via settings navigation

  1. Navigate to SystemSettingsMicrosoft SentinelData lake
  2. Click Set up data lake

Step 2: Permission validation screen

If you encounter the permissions screen, refer to the requirements above for required permissions.

Step 3: Billing configuration

  • Subscription: Choose billing subscription carefully
  • Resource group: Select or create resource group for data lake resources

Click Set up data lake to begin provisioning process.

Step 4: Monitor setup progress

Setup timeline: Refer to the Critical Requirements Reference box above for detailed timeline expectations.

Step 5: Validate successful completion

Success indicators:

  • Completion banner Your data lake is ready with feature access cards
  • New navigation option: Data lake exploration under Sentinel
  • Enhanced Tables management capabilities
  • Jobs section becomes available

Data Connectors

Microsoft Sentinel data lake works with all existing Sentinel data connectors, including:

  • DNS, proxy, and email telemetry
  • All Microsoft Defender XDR and Microsoft Sentinel data sources
  • Microsoft 365
  • Microsoft Entra ID
  • Microsoft Resource Graph
  • Endpoint Detection and Response (EDR) platforms
  • Firewall and network logs
  • Cloud infrastructure and workload telemetry
  • Identity and access logs (Microsoft Entra, Okta, etc.)

Below the steps to activate for example the Defender XDR data connector (in case you haven’t got it enabled yet).

Access XDR data connector

  1. Navigate to Microsoft Defender PortalMicrosoft SentinelConfigurationData connectors
  2. Search for “Microsoft Defender XDR” in the connector gallery
  3. Click Open connector page

Configure integration components The XDR connector configuration includes three main sections:

  1. Connect incidents and alerts: Enables basic integration for incident synchronization between platforms
  2. Connect entities: Integrates on-premises Active Directory user identities through Microsoft Defender for Identity
  3. Connect events: Enables collection of raw advanced hunting events from Defender components

Understanding XDR data streams

Free data streams (included with XDR integration):

  • SecurityAlert: Alerts from all Defender products
  • SecurityIncident: Unified incident management data
  • Bi-directional synchronization: Incidents sync between Sentinel and Defender portals

Paid data streams (require analytics tier ingestion):

  • DeviceProcessEvents: Process execution and command-line data
  • DeviceNetworkEvents: Network connection and communication data
  • DeviceFileEvents: File creation, modification, and access events
  • DeviceLogonEvents: Authentication and session data
  • EmailEvents: Email security and communication data
  • CloudAppEvents: Cloud application activity and security events
  • IdentityLogonEvents: Identity provider authentication events

If you’ve Sentinel setup before and transitioned to the Unified Experience (Defender portal), existing connectors continue working without interruption. However, Microsoft Defender connectors (XDR, Endpoint, Identity, Office 365, Cloud Apps, and Defender for Cloud) won’t appear in the Data connectors tab under Sentinel—they’re automatically managed as part of the unified experience. To view these connectors, access Microsoft Sentinel through the Azure portal or from Exposure Management (only the status):

Retention configuration and management

Understanding data mirroring behavior

When you enable data lake, automatic mirroring begins for all tables from the onboarding point forward with these characteristics:

  • Forward-looking only: Historical data existing before onboarding is not mirrored
  • Automatic mirroring: All existing connectors automatically mirror data to data lake
  • Consistent retention initially: Data lake tier matches analytics tier retention settings
  • Single copy architecture: Same data serves both tiers without duplication

The Data lake retention model

As covered earlier, Data Lake uses a two-tier model:

TierDefault retentionMaximum retentionPurpose
Analytics tier (hot)30 days2 yearsReal-time detection, hunting, analytics
Data lake tier (cold)Mirrors analytics12 yearsLong-term compliance, historical analysis

Key cost insight: For Sentinel solution tables, 90 days of analytics tier storage is free (you only pay ingestion). For XDR tables, only 30 days is included in the XDR license. Data lake storage beyond analytics retention incurs additional costs.

Special case: XDR tables and data lake

XDR tables behave fundamentally differently than regular Sentinel solution tables, particularly around costs and retention:

Understanding XDR table tiers:

Before they’re ingested into your Sentinel workspace, XDR tables exist in what’s called the “XDR default tier” – essentially a 30-day buffer managed by the Defender XDR service. This data is:

  • Included in your Defender XDR license (already paid for)
  • Available via Advanced Hunting in Defender portal
  • View-only and doesn’t participate in the data lake architecture
  • Not ingested into Sentinel (so no Sentinel ingestion costs)
Table typeMinimum analyticsCan use data lake?Note
Sentinel/Custom30 days✅ YesStandard data lake behavior, 90 days free storage
XDR (not ingested)N/A (30-day XDR default)❌ NoMust ingest first by setting retention > 30 days
XDR (ingested)30 days minimum✅ YesNow follows standard data lake model with costs

Two methods for extending XDR table retention:

Method 1: Standard ingestion (covered in this section)

  • Set analytics retention > 30 days in Defender portal
  • XDR data ingests into Sentinel Analytics tier (expensive ingestion costs)
  • Then automatically mirrors to data lake

Method 2: DCR transformation (cost-optimized)

  • Bypass expensive Analytics tier ingestion entirely
  • Route XDR data directly to data lake using custom tables
  • Massive cost savings for compliance/archival scenarios
  • Covered in next section: “Optimizing XDR costs with DCR transformations”

How to check current retention configuration

Step 1: Navigate to table management

Open Microsoft Defender PortalMicrosoft SentinelConfigurationTables

You’ll see a list of all tables with their current retention settings

Step 2: Review retention overview

The table list displays key information:

ColumnDescription
Table nameName of the table
WorkspaceWhich Log Analytics workspace
TierAnalytics, Data Lake, or XDR default
Analytics retentionHot storage duration
Total retentionAnalytics + data lake combined

Step 3: Check detailed retention for a specific table

  1. Click on any table name
  2. A details panel opens on the right showing:
    • Current tier (Analytics, data lake, or XDR default)
    • Analytics retention period
    • Total retention period

How to change retention configuration

For Sentinel and custom tables:

Step 1: Navigate to Microsoft Defender PortalMicrosoft SentinelConfigurationTables

Step 2: Click on your table → Click Manage table

Step 3: Configure retention:

  • Analytics retention: 30 days to 2 years (hot storage)
  • Total retention: Equal to or greater than analytics, max 12 years (hot + cold)

Step 4: Click Save

Example configuration:

Scenario: Need 1 year total retention for a Sentinel solution table

Configuration:

  • Analytics retention: 90 days (free for Sentinel tables)
  • Total retention: 1 year (365 days)

Result: 90 days in analytics + 275 days in data lake

Cost: Only data lake storage for days 91-365 (analytics storage is free)

Cost tip: You can also fully skip the analytics tier for some tables – this results in lower cost, but loses real-time analytics features like alerting, hunting queries, and analytics rules. Make sure you understand what functionality breaks before doing this:

For XDR tables (enabling data lake – standard method):

IMPORTANT COST NOTICE: This method incurs significant Analytics tier ingestion costs. XDR tables only have 30 days included in your license (not 90 like Sentinel tables). Every day beyond 30 requires paying full Sentinel ingestion rates before data reaches the data lake. See the next section for a cost-optimized alternative via DCR.

Step 1: Navigate to Tables → Locate XDR table showing “XDR default tier”

Examples: DeviceEvents, EmailEvents, IdentityInfo, AlertEvidence, DeviceProcessEvents

Step 2: Click on the XDR table → Click Manage table

Step 3: Set analytics retention > 30 days

This triggers ingestion into Sentinel and incurs expensive ingestion costs
Common values: 60, 90, 180, 365 days (preferable 90 days if using standard method)

Step 4: Set total retention (equal to or greater than analytics)

Example: 365 days for 1-year compliance

Step 5: Review cost warnings → Click Save

What happens:

  • Ingestion begins immediately (costs start here)
  • Data flows: XDR → Sentinel Analytics tier (expensive) → Data lake mirror
  • Table type changes from “XDR default” to “Microsoft Sentinel”
  • You’re now paying for both Analytics ingestion + any data lake storage beyond analytics retention

To stop ingestion: Set both Analytics and Total retention back to 30 days

Cost breakdown example for XDR tables:

ScenarioAnalyticsTotalCost structure
Stay in XDR default30 days30 days$0 – Included in license
Standard method90 days90 daysAnalytics ingestion for 90 days (expensive)
Standard + compliance90 days1 yearAnalytics ingestion for 90 days + Data lake storage (days 91-365)

Next section: Learn how to bypass these expensive ingestion costs using DCR transformations to send XDR data directly to data lake tier (especially handy for noisy tables).

ScenarioAnalyticsTotalUse case
Standard operations90 days90 daysActive security operations, no extended compliance
Balanced approach90 days365 daysOperations with 1-year compliance
Extended compliance90 days2-7 yearsRegulatory requirements (GDPR, HIPAA)
Cost-optimized30 days365 days – 7 yearsMinimal hot storage, extended cold storage

Optimizing XDR costs with DCR transformations

The cost problem

As we explored in the previous section, extending XDR table retention beyond 30 days using the standard method triggers expensive Analytics tier ingestion. The warning dialog makes this clear: increasing retention moves your XDR data to Sentinel and incurs ingestion costs.

The impact: 1TB/day of XDR data = ~$96,000/month in ingestion costs alone. For many organizations, this makes long-term retention financially prohibitive, even when required for compliance.

Understanding the optimization approach

The key insight is that XDR data already exists for free in your environment – it’s included in your Defender XDR license and available via Advanced Hunting for 30 days. The problem is the expensive journey it takes when you need to keep it longer.

Standard flow (expensive):

XDR Advanced Hunting (30d free) → Analytics Tier Ingestion ($$$) → Data Lake Mirror

Optimized flow (cost-effective):

XDR Advanced Hunting (30d free) → Workspace DCR Transformation → Custom Data Lake Tables ($)

By using Data Collection Rules (DCR) with workspace transformations, we can route XDR data directly to custom tables in the data lake tier, completely bypassing expensive Analytics tier ingestion. This is a legitimate Microsoft-supported feature that’s just not widely documented for this use case, but got inspired by Jeffrey Appel his approach for this.

Workspace transformation DCR

Key benefits and trade-offs

Benefits:

  • Eliminate Analytics tier ingestion costs entirely
  • Keep 30 days free in XDR Advanced Hunting for real-time detection
  • Store years of data in low-cost data lake tier
  • Perfect for compliance and historical analysis
  • Single DCR handles multiple tables

Trade-offs:

  • No real-time alerting or analytics rules on custom tables
  • Slower queries compared to Analytics tier
  • Requires initial setup and ongoing management
  • Custom tables have limited feature support

Critical decision: Only use this method for XDR data needed for compliance/historical purposes. If you need real-time detection beyond 30 days, keep those tables in Analytics tier. Many organizations use a hybrid approach.

Choosing which tables to optimize

Not all XDR tables are equal candidates. Base your decision on volume, analytical needs, and compliance requirements.

XDR TableVolumeRecommendationReasoning
DeviceProcessEventsVery HighOptimizeHighest volume, mostly needed for forensics
DeviceNetworkEventsVery HighOptimizeHigh noise-to-signal ratio
DeviceFileEventsHighOptimizePrimarily used for investigations
CloudAppEventsMediumConsiderDepends on detection requirements
EmailEventsMediumConsiderMay need for phishing detection
AlertEvidenceLowKeep in AnalyticsCritical for investigations
IdentityInfoVery LowKeep in AnalyticsRequired for UEBA

Optimize when:

  • High-volume, low-value data (verbose logs)
  • Compliance retention only
  • Historical analysis and forensics needs

Keep in Analytics when:

  • Real-time alerting required
  • UEBA or behavioral analytics needed
  • Active detection rules depend on the data

Prerequisites and tools

Before implementing this optimization, you’ll need a few things in place.

1. Marko Lauren’s Table Creator Tool

The tool is essential because Azure Data Lake tables (Auxiliary tier) don’t natively support dynamic data types, which many XDR tables use. Marko’s tool solves this by automating the table creation process and handling the schema complexity.

GitHub link

What it does:

  • Reads source table schema from your workspace
  • Creates matching custom table with correct structure
  • Handles dynamic fields via Analytics tier workaround
  • Enables seamless switching to Data Lake tier

The workaround: Create tables in Analytics tier first (which supports dynamic types), then switch to Data Lake tier via the portal. This preserves the dynamic field support.

2. Azure access requirements
  • Contributor or Owner permissions on Log Analytics workspace
  • Ability to deploy ARM templates
  • Access to Azure Portal and Defender portal
3. Optimization scope decision

Identify which XDR tables to optimize based on current ingestion volumes, real-time analytics requirements, compliance needs, and detection rule dependencies. Start with your highest-volume tables for maximum impact.

Implementation guide

The implementation follows three main steps: creating custom tables, configuring the DCR, and verifying data flow. Let’s walk through each one.

Step 1: Create custom data lake tables

We’ll use Marko Lauren’s tableCreator.ps1 script to automate the table creation process and avoid schema issues.

Download the tool

Download tableCreator.ps1 from the GitHub repository:

GitHub link

Run the table creator

The tool uses command-line parameters. Execute it via Azure CLI with your workspace details:

# Login to Azure
az login

# Run the table creator script with parameters
.\tableCreator.ps1 -tableName DeviceProcessEvents -newTableName DeviceProcessEventsLake_CL -type analytics -retention 30 -totalRetention 365

Parameter explanation:

ParameterValueDescription
-tableNameDeviceProcessEventsExisting XDR table to replicate
-newTableNameDeviceProcessEventsLake_CLNew custom table name (must end with _CL)
-typeanalyticsStart with analytics, not datalake/auxiliary
-retention365Total retention in days
-ConvertToString(optional)Use if you get dynamic type errors
-TenantId(optional)Specify if multi-tenant environment

Why -type analytics first? Even though our final destination is Data Lake, we must create the table as analytics type initially to preserve dynamic field support. We’ll switch it to Data Lake tier in the next step.

Switch table to data lake tier

Now that the table exists with the correct schema including dynamic fields, switch it to data Lake tier:

Navigate to Microsoft Defender PortalMicrosoft SentinelConfigurationTables

  1. Locate your custom table (DeviceProcessEventsLake_CL)
  2. Click on the table → Click Manage table
  3. Under table tier options, select data lake tier only
  4. Configure retention (e.g., 1 year for compliance)
  5. Click Save

Your table is now in the Auxiliary (data lake) tier with full dynamic field support, ready to receive data.

Create additional tables

Repeat this process for each XDR table you want to optimize:

Source TableCustom Table NameUse Case
DeviceProcessEventsDeviceProcessEventsLake_CLProcess execution history
DeviceNetworkEventsDeviceNetworkEventsLake_CLNetwork connection logs
DeviceFileEventsDeviceFileEventsLake_CLFile activity forensics
CloudAppEventsCloudAppEventsLake_CLCloud application activity
EmailEventsEmailEventsLake_CLEmail flow history

Step 2: Create workspace transformation DCR

Now we’ll create the Data Collection Rule that routes XDR data from original tables to your custom Data Lake tables. This is where the magic happens.

Initial DCR creation via portal

Start by creating the basic DCR structure through the Azure Portal:

Navigate to Azure PortalLog Analytics workspaces → Select your workspace → Tables

  1. Locate any source XDR table (e.g., DeviceProcessEvents)
  2. Click on the table → Click Create transformation
  3. Configure the DCR:
FieldValueNote
DCR namedcr-sentinel-workspace-xdr-lake-routingUse descriptive name
Resource groupSame as workspaceKeep together
RegionSame as workspacePerformance
DescriptionRoutes XDR data to Data Lake custom tablesClear purpose
  1. Click Create

This creates a Workspace Transformation DCR (Kind: WorkspaceTransforms) that’s automatically associated with your workspace.

Important limitation: You can only have one Workspace Transformation DCR per workspace, but this single DCR can handle all your table routing.

Configure DCR via ARM template

The portal UI only allows configuring one table at a time. To efficiently route multiple XDR tables, we’ll edit the DCR’s ARM template directly.

Navigate to Azure PortalData Collection Rules → Locate your DCR

  1. Verify Kind shows WorkspaceTransforms in the Overview
  2. Click Export templateDeployEdit template
  1. Write down the name from the workspaceresourceId (under LogAnalytics):
        {
            "type": "Microsoft.Insights/dataCollectionRules",
            "apiVersion": "2023-03-11",
            "name": "[parameters('dataCollectionRules_dcr_sentinel_workspace_xdr_lake_routing_name')]",
            "location": "westeurope",
            "kind": "WorkspaceTransforms",
            "properties": {
                "dataSources": {},
                "destinations": {
                    "logAnalytics": [
                        {
                            "workspaceResourceId": "[parameters('workspaces_sec_management_siemlog_prd_001_externalid')]",
                            "name": "5511c2c9b1764fed860fa62716b60686"
                        }
                    ]
                },
  1. Replace the template content (dataFlows-part) with the configuration below. The template configures data flows for multiple XDR tables simultaneously:
"dataFlows": [
  {
    "streams": [
      "Microsoft-Table-DeviceProcessEvents"
    ],
    "destinations": [
      "5511c2c9b1764fed860fa62716b60686"
    ],
    "outputStream": "Custom-DeviceProcessEventsLake_CL"
  },
  {
    "streams": [
      "Microsoft-Table-DeviceNetworkEvents"
    ],
    "destinations": [
      "5511c2c9b1764fed860fa62716b60686"
    ],
    "outputStream": "Custom-DeviceNetworkEventsLake_CL"
  },
  {
    "streams": [
      "Microsoft-Table-DeviceFileEvents"
    ],
    "destinations": [
      "5511c2c9b1764fed860fa62716b60686"
    ],
    "outputStream": "Custom-DeviceFileEventsLake_CL"
  },
  {
    "streams": [
      "Microsoft-Table-CloudAppEvents"
    ],
    "destinations": [
      "5511c2c9b1764fed860fa62716b60686"
    ],
    "outputStream": "Custom-CloudAppEventsLake_CL"
  },
  {
    "streams": [
      "Microsoft-Table-EmailEvents"
    ],
    "destinations": [
      "5511c2c9b1764fed860fa62716b60686"
    ],
    "outputStream": "Custom-EmailEventsLake_CL"
  }
]

You will end up with a template like this:

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "dataCollectionRules_dcr_sentinel_workspace_xdr_lake_routing_name": {
            "defaultValue": "dcr-sentinel-workspace-xdr-lake-routing",
            "type": "String"
        },
        "workspaces_sec_management_siemlog_prd_001_externalid": {
            "defaultValue": "/subscriptions/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/sec-management-siemlog-prd-001/providers/microsoft.operationalinsights/workspaces/sec-management-siemlog-prd-001",
            "type": "String"
        }
    },
    "variables": {},
    "resources": [
        {
            "type": "Microsoft.Insights/dataCollectionRules",
            "apiVersion": "2023-03-11",
            "name": "[parameters('dataCollectionRules_dcr_sentinel_workspace_xdr_lake_routing_name')]",
            "location": "westeurope",
            "kind": "WorkspaceTransforms",
            "properties": {
                "description": "Routes XDR tables directly to Data Lake custom tables, bypassing Analytics tier ingestion costs",
                "dataSources": {},
                "destinations": {
                    "logAnalytics": [
                        {
                            "workspaceResourceId": "[parameters('workspaces_sec_management_siemlog_prd_001_externalid')]",
                            "name": "5511c2c9b1764fed860fa62716b60686"
                        }
                    ]
                },
                "dataFlows": [
                    {
                        "streams": [
                            "Microsoft-Table-DeviceProcessEvents"
                        ],
                        "destinations": [
                            "5511c2c9b1764fed860fa62716b60686"
                        ],
                        "outputStream": "Custom-DeviceProcessEventsLake_CL"
                    },
                    {
                        "streams": [
                            "Microsoft-Table-DeviceNetworkEvents"
                        ],
                        "destinations": [
                            "5511c2c9b1764fed860fa62716b60686"
                        ],
                        "outputStream": "Custom-DeviceNetworkEventsLake_CL"
                    },
                    {
                        "streams": [
                            "Microsoft-Table-DeviceFileEvents"
                        ],
                        "destinations": [
                            "5511c2c9b1764fed860fa62716b60686"
                        ],
                        "outputStream": "Custom-DeviceFileEventsLake_CL"
                    },
                    {
                        "streams": [
                            "Microsoft-Table-CloudAppEvents"
                        ],
                        "destinations": [
                            "5511c2c9b1764fed860fa62716b60686"
                        ],
                        "outputStream": "Custom-CloudAppEventsLake_CL"
                    },
                    {
                        "streams": [
                            "Microsoft-Table-EmailEvents"
                        ],
                        "destinations": [
                            "5511c2c9b1764fed860fa62716b60686"
                        ],
                        "outputStream": "Custom-EmailEventsLake_CL"
                    }
                ]
            }
        }
    ]
}

Understanding the template:

Each dataFlow entry routes data from a source stream to a destination table. The streams value must match Microsoft’s exact naming (e.g., Microsoft-Table-DeviceProcessEvents). The outputStream must include the Custom- prefix for custom tables. The transformKql value of "source" means no filtering – all data passes through. You can add KQL filtering here if needed (covered later).

Click SaveReview + createCreate to deploy the DCR

Step 3: Verify data flow

After deploying the DCR, data should start flowing to your custom tables within 15-30 minutes. Let’s verify everything is working correctly.

Check 1: Original table status

DeviceProcessEvents
| take 10

Expected: Low count or zero (data is being routed to the custom table)

Check 2: Custom table receiving data

DeviceProcessEventsLake_CL
| take 10
Datalake
Microsoft Sentinel
XDR

Expected: Data flowing with proper schema

Monitor cost savings

Track the effectiveness of your optimization with these queries:

Volume comparison:

let OriginalVolume = toscalar(DeviceProcessEvents | where TimeGenerated > ago(24h) | count);
let OptimizedVolume = toscalar(DeviceProcessEventsLake_CL | where TimeGenerated > ago(24h) | count);
print 
    OriginalTableEvents = OriginalVolume,
    DataLakeTableEvents = OptimizedVolume,
    RoutingSuccess = iff(OptimizedVolume > OriginalVolume, "Working", "Check DCR")

Cost savings estimate KQL example:

let DailyVolumeGB = toscalar(
    DeviceProcessEventsLake_CL
    | where TimeGenerated > ago(24h)
    | extend DataSizeBytes = estimate_data_size(*)
    | summarize TotalBytes = sum(DataSizeBytes)
    | extend GB = TotalBytes / 1024 / 1024 / 1024
    | project GB
);
// Updated costs for 2025 GA pricing (East US region)
let AnalyticsIngestionCost = 4.30;  // Per GB (updated from $3.207)
let DataLakeIngestionCost = 0.05;    // Per GB
let DataLakeProcessingCost = 0.10;   // Per GB
let DataLakeStorageCost = 0.02;      // Per GB per month (compressed)
// Calculate monthly costs
let StandardMethodCost = DailyVolumeGB * 30 * AnalyticsIngestionCost;
let OptimizedMethodCost = (DailyVolumeGB * 30 * (DataLakeIngestionCost + DataLakeProcessingCost)) + (DailyVolumeGB * 30 * DataLakeStorageCost);
let MonthlySavings = StandardMethodCost - OptimizedMethodCost;
print 
    DailyVolumeGB = round(DailyVolumeGB, 2),
    MonthlyVolumeTB = round((DailyVolumeGB * 30) / 1024, 2),
    StandardMethodCost = strcat("$", round(StandardMethodCost, 2)),
    OptimizedMethodCost = strcat("$", round(OptimizedMethodCost, 2)),
    EstimatedMonthlySavings = strcat("$", round(MonthlySavings, 2)),
    SavingsPercentage = strcat(round((MonthlySavings / StandardMethodCost) * 100, 1), "%")

Optional: Advanced filtering with transformKql

The examples (template) above don’t use "transformKql": "source" which routes all data without filtering. You can add KQL transformations to filter data before ingestion, providing even more cost savings.

Filtering at the DCR level provides maximum savings because filtered data never reaches your workspace. However, ensure you’re not filtering out data needed for compliance or future investigations. When in doubt, keep everything and filter during queries instead.

Data analysis and operations

Data lake exploration with KQL

Accessing data lake exploration interface

Navigation steps:

  1. Navigate to Microsoft Defender PortalMicrosoft SentinelData lake exploration
  2. Select target workspace from dropdown menu
  3. Configure appropriate time range for analysis
  4. Begin KQL query development and execution

Query capabilities and limitations

Supported KQL control commands: Microsoft’s data lake documentation shows the following control commands are currently supported:

.show version
.show databases
.show databases entities
.show database

Performance and functional limitations:

Limitation categorySpecificationRecommended workaround
Query timeout8 minutes maximumUse time-based filtering and sampling techniques
Result size500,000 rows or 64MBImplement summarization and aggregation
Concurrent queries45 per minute per workspacePlan and stagger query execution schedules
Query scopeSingle workspace per queryDesign workspace-specific analysis approaches

Sample queries and optimization

Suspicious travel activity analysis:

SigninLogs
| where TimeGenerated >= ago(180d)
| where ResultType == 0
| summarize CountriesAccessed = make_set(Location) by UserPrincipalName
| where array_length(CountriesAccessed) > 3  // Adjust threshold as needed

Time-first filtering approach:

// Efficient query pattern leveraging time partitioning
SecurityEvent
| where TimeGenerated > ago(7d)     // Partition elimination first
| where Computer startswith "DC"    // Apply specific filters second
| where EventID in (4624, 4625)     // Multiple conditions last
| summarize count() by Computer

Jupyter notebooks for security analytics

Step 1: Install and configure VS Code extension

Required components:

  1. Visual Studio Code (latest version)
  2. Microsoft Sentinel extension for VS Code
  3. Python environment with appropriate packages

Installation process:

  1. Open Visual Studio Code
  2. Navigate to Extensions (Ctrl+Shift+X)
  3. Search for “Microsoft Sentinel”
  4. Click Install on the Microsoft Sentinel extension
  5. Microsoft Sentinel shield icon appears in left toolbar

Step 2: Configure compute resources

Select runtime

Runtime pool options:

Pool sizeCompute resourcesMemoryRecommended use casesCost considerations
Small4 cores8GBBasic analytics, data exploration, developmentLower cost, suitable for small datasets
Medium8 cores16GBStandard security analytics, reportingModerate cost, production workloads
Large16 cores32GBML models, complex analysis, large datasetsHigher cost, resource-intensive operations

Microsoft’s billing documentation shows compute hours are calculated by multiplying the number of cores with session time, affecting Advanced Data Insights billing.

Step 3: Develop security analytics workflows

Basic data exploration example:

from sentinel_lake.providers import MicrosoftSentinelProvider
import pandas as pd
import matplotlib.pyplot as plt

# Initialize data provider
data_provider = MicrosoftSentinelProvider(spark)

# Read data from custom data lake table
df = data_provider.read_table("DeviceNetworkEventsLake_CL")
df.select("TimeGenerated", "DeviceName", "ProcessName", "ProcessCommandLine").show(10)

Automated jobs and scheduling

Understanding data lake jobs

Microsoft’s job documentation shows that jobs enable automated analysis and data promotion between tiers. Two primary job types are supported:

  1. KQL jobs: Query-based analytics using KQL syntax
  2. Notebook jobs: Python/Spark analytics using Jupyter notebooks

Creating and managing KQL jobs

  1. Navigate to Microsoft Defender PortalMicrosoft SentinelData lake exploration
  2. Develop and test your KQL query
  3. Click Create job button in upper right corner

Job limitations during preview:

Limitation categorySpecificationImpact on implementation
Concurrent executions3 per tenant maximumRequires queue management and scheduling coordination
Job timeout1 hour maximum execution timeNecessitates query optimization for complex operations
Enabled jobs100 per tenant maximumRequires lifecycle planning and job prioritization

Performance monitoring and optimization

Microsoft’s data lake exploration documentation shows search functionality within results and comprehensive query management capabilities.

Performance optimization guidelines:

Performance areaLimitationOptimization strategy
Query execution time8 minutes maximumImplement time-based filtering and data sampling
Result set size500,000 rows or 64MBUse summarization and aggregation techniques
Concurrent operations45 queries per minutePlan and schedule query execution appropriately
Data processingSingle workspace scopeDesign workspace-specific analysis workflows

Troubleshooting common issues

Onboarding error codes

Microsoft’s troubleshooting documentation provides specific error codes for common onboarding issues:

DL101: Regional mismatch error

Error: Can't complete setup
Description: Primary workspace region differs from tenant home region
Resolution: Ensure workspace is in same region as tenant home region

DL102: Resource availability error

Error: Lack of Azure resources in region
Description: Insufficient Azure resources available during provisioning
Resolution: Retry setup or attempt during off-peak hours

DL103: Azure policy restrictions

Error: Azure policies prevent resource creation
Description: Organizational policies blocking required resource creation
Resolution: Create policy exemption for Microsoft.SentinelPlatformServices

Data integration troubleshooting

Data not appearing in data lake:

  1. Wait period: Refer to the sections above for estimated timeline expectations
  2. Connector verification: Verify data connector configuration and status
  3. Retention settings: Check tier configuration and retention policies
  4. Permissions: Validate managed identity permissions and role assignments

Best practices and governance

Security and access management:

  • Implement principle of least privilege access
  • Conduct regular access reviews and role assignments
  • Use Unified RBAC for granular workspace-specific permissions
  • Monitor data access through comprehensive audit logs

Cost management best practices:

  • Use time-based filtering for optimal partition elimination
  • Implement appropriate sampling techniques for large datasets
  • Monitor query patterns and associated costs regularly
  • Balance operational needs with cost optimization objectives

Additional information

Official Microsoft documentation:

Community resources:

Conclusion

When I first started testing with Microsoft Sentinel data lake, I’ll be honest – the pricing models and retention tiers felt overwhelming. But after implementing it across multiple environments and seeing the cost savings firsthand, I’m convinced this is one of the most significant improvements Microsoft has made to Sentinel.

Why this matters to me (and probably to you too):

We’ve all been in those budget meetings where leadership questions why security data retention costs so much. The data lake finally gives us a real answer to that problem. Being able to keep years of security logs without burning through budget means we can actually do our jobs properly – hunt for threats, investigate incidents thoroughly, and meet compliance requirements without constantly worrying about costs.

My honest take on implementation:

The standard retention configuration through the Defender portal is straightforward – you can set it up in minutes. For most Sentinel tables, it just works and the 90 days of free analytics storage is genuinely helpful. Where it gets tricky is with XDR tables. That 30-day limit catches a lot of people off guard, and suddenly you’re looking at massive ingestion bills if you’re not careful.

What I’ve learned along the way:

Start simple. Don’t try to optimize everything at once. Get comfortable with basic retention configuration first, understand your data volumes, and then decide if the advanced DCR approach makes sense for your environment. Monitor your costs closely in the first few months – you’ll quickly see which tables are driving expenses and where optimization efforts will have the biggest impact.

Also, don’t underestimate the value of having historical data readily available. I’ve been in too many investigations where we needed logs from 6+ months ago and they were either gone or stuck in some archived format that took hours to access. The data lake solves that problem elegantly.

Looking ahead:

Microsoft Sentinel data lake is now GA, which means it’s stable and production-ready. The pricing has settled down, the features are solid, and the performance is good. I’m excited to see where Microsoft takes this – the foundation is there for some really powerful capabilities around AI-driven analysis and long-term behavioral detection. I will try to cover MCP and Graph in my next posts, also digging more in Jupyter and Notebooks soon :-).

If you’re still on the fence about enabling data lake, my advice is simple: just do it. The automatic mirroring from analytics to data lake costs nothing, and you’ll immediately have a safety net for your data. You can always optimize later, but at least you won’t lose historical data while you’re figuring out your strategy.

Final thoughts:

Security operations is hard enough without having to choose between doing the right thing and staying within budget. The data lake removes that impossible choice. Yes, there’s complexity to manage. Yes, you need to understand the cost model. But the alternative – deleting security logs because you can’t afford to keep them – is far worse.

Take the time to understand your options, configure things properly, and you’ll have a data retention strategy that actually works for both security and finance. And honestly? That’s a win we don’t get very often in this field.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
Automatic Attack Disruption XDR M365 Defender

Automatic Attack Disruption in Microsoft Defender XDR

Next Post

Proactively Block Cloud Apps (like AI) with Microsoft Defender for Cloud Apps and Defender for Endpoint

Related Posts