Only this pageAll pages
Powered by GitBook
1 of 27

Osmos AI Data Wrangler

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Introduction

Accelerate your Fabric deployment with Osmos AI data agents. Seamlessly ingest, transform, and structure data using agentic AI.

How Does Osmos Work?

Osmos empowers your data team with agentic AI to seamlessly manage your Fabric environment. The ‘Osmos Data Engineer’ leverages AI Agents to autonomously create execution-ready Spark notebooks for data engineering and ETL tasks, and the ‘Osmos Data Wrangler’ wrangles even your messiest data, hands-free, with AI Data Agents. Osmos can run on any Fabric Capacity provisioned with F2 or higher.

Osmos integrates effortlessly with Microsoft Fabric, creating a robust, scalable, and secure environment for data processing. With this integration, your organization can benefit from Fabric’s advanced analytics and business intelligence capabilities, all powered by pristine, well-structured data.

What is Osmos Used For?

Osmos provides the tools needed to make data transformation accessible across the organization. Key applications include:

  • Data Wrangling and Transformation: Clean, align, and prepare datasets for analytics and AI with minimal manual effort.

  • Data Integration: Streamline data ingestion from multiple sources, enabling rapid analysis and decision-making.

  • Business Intelligence: Ensure your data is consistently structured and accurate to amplify the effectiveness of tools like Microsoft Power BI.

  • Regulatory Compliance: Maintain data integrity and documentation for compliance in industries with strict data regulations.

Osmos enhances your team’s ability to work with data at scale, minimizing the need for technical resources traditionally required for extensive data preparation.

Why Use Osmos with Microsoft Fabric?

By integrating with Microsoft Fabric, Osmos creates a unified environment where data can be processed, analyzed, and transformed within a single ecosystem. Osmos AI Data Wrangler, available as a native workload in Microsoft Fabric, provides several distinct advantages:

  • Scalability: Easily manage data wrangling for both small and large datasets.

  • Enhanced Collaboration: Fabric’s collaborative environment supports teamwork on data projects, allowing insights and workflows to be shared seamlessly.

  • Real-Time Monitoring: Fabric's centralized monitoring tools allow you to track the status of your data transformation jobs across your organization.

  • Security and Compliance: Microsoft Fabric’s enterprise-grade security ensures your data remains protected throughout the wrangling process.

Osmos on Fabric allows you to maximize the potential of your data without expanding your data engineering team.

Fabric Tenant Settings

Steps to Add the Osmos Workload and enable the AI Data Wrangler

Enable the Fabric tenant settings and capacities

To begin working with the AI Data Wrangler, you must enable tenant settings and capacities in Fabric. The steps below require admin access. This is a workspace-wide activity.

  1. Log in to Fabric

  2. Go to Settings and select Admin portal

  3. Select Tenant Settings

  4. In the Additional Workloads (preview) section,

    • In the Capacity admins and contributors can add and remove additional workloads section, select Enabled.

    • In the Users can see and work with uncertified partner workloads section, select Enabled

    • Apply to the entire organization or specific groups

Note: Users can add workloads if they meet the following criteria:

  • They have permission to add a workload by Fabric admin

  • They're a capacity admin or have permission to assign capacities to workspaces

Note: The AI Data Wrangler requires Microsoft Fabric with a minimum F2 Capacity

Getting Started with Microsoft Fabric

Steps to enable your free Microsoft Fabric Trial.

Enabling a Microsoft Fabric Trial

Follow these steps to enable a free Microsoft Fabric trial. This gives you temporary access to all Fabric features—including Premium Capacity (F64)—so you can explore the Osmos Ai Data Wrangler before committing to a paid subscription.

Overview

A Microsoft Fabric trial provides a 60-day window to explore everything Fabric has to offer:

  • Access to Fabric experiences such as the Osmos AI Data Wrangler.

  • Integration with OneLake allows you to unify your data in a single, secure location.

  • Premium features (e.g., larger data model sizes) via a dedicated trial capacity.

Once the trial expires, your Fabric workspace(s) will revert to shared capacity, but any data will remain accessible within OneLake—no data loss will occur.

Important: This trial is for work under Microsoft Entra ID (formerly Azure Active Directory). Personal Microsoft accounts (Outlook, Hotmail, etc.) are not supported.

Prerequisites

  1. Work Account: You must use a Microsoft 365 work account to sign up for the trial.

  2. Tenant Permissions: Self-service sign-up must be allowed in your organization’s tenant. If self-service sign-up is disabled, you’ll need an admin to enable the trial on your behalf.

  3. Geographic Availability: Check the regional availability to ensure your region offers the Fabric trial.

Steps to Start Your Fabric Trial

  1. Sign In

    1. Navigate to the Microsoft Fabric Trial page or visit fabric.microsoft.com.

    2. Select Sign In and enter your work Microsoft 365 account credentials.

  2. Initiate the Trial

    1. Look for the Try Fabric or Start free trial button.

    2. If prompted, review and accept any terms and conditions.

    3. No credit card is required; however, you may need to verify your identity based on your organization’s tenant settings.

  3. Select a Workspace

    1. The free capacity trial applies to your My Workspace in Power BI / Microsoft Fabric by default.

    2. If you want to enable the trial for an organizational workspace, ensure you have the appropriate admin or contributor permissions in that workspace.

  4. Confirm Activation

    1. After accepting the trial, you’ll see a confirmation banner in the Fabric portal or your Power BI service interface.

    2. The banner typically includes your trial end date (60 days from activation).

    3. During this trial, your workspaces use the dedicated trial capacity, allowing you to test out premium features, including the Osmos AI Data Wrangler.

Managing Your Trial

  • Duration: The trial lasts for 60 days. You can continue exploring any Fabric workloads during this period.

  • Expiration: Once the trial ends, your workspace reverts to shared capacity, but all data in OneLake remains intact. You’ll lose premium features (e.g., large model sizes) unless you purchase a capacity or license.

  • Early Cancellation: If you no longer need the trial, you can let it expire naturally or deactivate it in the admin portal (if you have the required permissions).

Note: If you purchase a paid capacity (e.g., Fabric capacity in the Power BI admin portal) before your trial ends, your workspace seamlessly transitions to the paid tier without downtime.

Welcome to Osmos

Osmos | AI Agents for Enterprise Data

Let AI agents that reason, act, and adapt automate ingestion, cleanup, and transformation—no human hand-holding required.

Getting started is easy!

1. Log into Microsoft Fabric: Go to app.fabric.microsoft.com. You can sign up for a Microsoft Fabric Trial if you don't have an account.

2. Add the Osmos Workload: Once you're in, add the Osmos Workload to start using the AI Data Wrangler, which helps automate data transformations.

3. Get In Touch: If you’d like a more personalized product walkthrough, you can schedule a time with an expert through the provided link.

4. Support: If you have any questions or issues, email [email protected].

Get In Touch With Us

If you want a product walkthrough, click here to find time with an expert!

AI Data Wrangler

AI Data Wrangler | Hands-Free Data Clean Up

Let Osmos AI Data Wrangler handle your messiest files—Excel, PDFs, CSVs, and more—and deliver clean, SQL-ready data without manual effort.

Common Fabric Issues & Troubleshooting

These are common issues when enabling the Microsoft Fabric Trial

  1. No 'Try Fabric' Button

    1. You may already have a premium capacity license, or your organization might have disabled self-service sign-up.

    2. Verify with your administrator or check the Power BI Admin Portal under Settings → Admin Portal → Capacity settings.

  2. Account or Permission Problems

    1. Ensure you’re using a work Microsoft 365 account.

    2. If self-service is blocked, an admin must enable it or assign a trial capacity directly.

  3. Region Unavailable

    1. Confirm that Microsoft Fabric supports your tenant region. If not, you won’t be able to access the Fabric trial.

  4. Capacity Limitations

Learn More

Wrangler Context

What is Wrangler Context?

Best Practices for Column Descriptors

  1. Define valid data that can be stored in the is column.

  2. Do not specify the source data. Focus on the destination column.

  3. Column Descriptors are tied to the destination table. If multiple Wranglers are pointed to the same table, they share the same descriptors.

  4. Provide examples

  5. Tell us how you want us to handle null

  6. Tell us how you want us to handle errors; even provide an example

  7. Be careful not to create contradictions between the data type and the column descriptor. For instance, if the data type is int32, do not ask to round to zero decimal places.

  8. If information is already in a descriptor, be careful of deleting it, especially if it is being shared across Wranglers. It is best practice to add to a description by editing the descriptor rather than deleting it.

Wrangler Data Statuses

Here are the various statuses for a Wrangler file.

Wrangler Statuses

  1. Once you select your file(s), they will be listed on the bottom half of your Wrangler.

  2. Each file will have a status that updates as it moves through Wrangler processing.

Statuses include:

  • Queued

  • AI Processing

  • Ready for Review

  • Completed

  • Failed

  • Rejected

  • Cancelled

File Metadata

File Name Available in the Column Mapping

The Source originating File Name field can automatically be added to a destination column.

  1. Simply add a column named Filename.

  2. When you choose a file, it will automatically add the file name to the filename column.

AI Data Engineer

AI Data Engineer | Autonomous ETL with Spark

Let Osmos AI Data Engineer autonomously create execution-ready Spark notebooks for data engineering and ETL tasks.

During the trial, usage limits (e.g., max memory capacity, concurrency) may apply. These limits are explained in the.

Fabric Trial FAQ
Workspace License Information
Workspace Roles

Adding the Osmos Workload

Steps to Add the Osmos Workload

Microsoft Fabric's "workloads" refer to distinct components or capabilities integrated into the Fabric framework, such as Data Warehouse and Power BI, that enhance the service's usability within the Fabric workspace, allowing users to perform specific tasks without leaving the environment.

Adding the Osmos Workload

Once the tenant settings and capacity are enabled and configured, you can add the Osmos Workload.

  1. From the Fabric Homepage, navigate to the left-hand side of the page and select the “Workloads” Tab.

  2. Select “More Workloads” and select the Osmos AI Data Wrangler tile to visit the Osmos listing within Fabric.

Potential Pre-req: Fabric Admin may be required to enable the Osmos Workload.

They do not need to be admins if the admin has not turned off the setting below. Note that it is on by default.

Learn More

  • Microsoft Fabric Workloads

Adding the Osmos Workspace

How to create a new Workspace

Workspaces are places where colleagues collaborate to create collections of items such as AI Data Wranglers, lakehouses, warehouses, and reports and to create task flows.

How to create a new Workspace

  1. Go to the Osmos Workload Homepage

  2. Select New Workspace

  3. Add the following:

  • Name - Required field. Name the workspace

  • Description - Optional field. Describe the workspace

  • Domain - Optional field. Assign this workspace to a relevant domain to help people discover its content. Each workspace can be assigned to one domain.

  1. Select Apply

How to access an existing Workspace

There are two ways to access a Workspace easily.

  1. Side Panel

    1. From the side panel, select Workspaces.

    2. Your workspaces will be listed from your Workload

  2. Workload Homepage

    1. From the homepage, select See workspaces.

    2. A list of workspaces will be displayed for your workload.

Learn More

  • Workspaces In Microsoft Fabric

Adding Data into a Lakehouse

How to add Data into a Lakehouse

Before you can wrangle your data, it must exist in a lakehouse and be available to ingest into AI Data Wrangler.

How to upload a local file

One easy method to upload a file stored on your local machine is through the Lakehouse Explorer.

  1. Go to the Workspace, where the AI Data Wrangler resides, and select the Lakehouse.

  2. In the Lakehouse Explorer, Select Files

  3. Upload files

  1. Select files from your local machine.

  2. Hit Upload

Your file will now be available in your AI Data Wrangler.

Reminder: Data must exist in a lakehouse before it is available to ingest into AI Data Wrangler.

Methods to add data into a Lakehouse

In Microsoft Fabric, here are several ways to get data into a lakehouse:

  • File upload from local computer

  • Run a copy tool in pipelines

  • Set up a dataflow

  • Apache Spark libraries in notebook code

  • Stream real-time events with Eventstream

  • Get data from Eventhouse

Learn More

  • Microsoft Fabric Lakehouse

Running a Wrangler

Here are the steps to run a Wrangler.

Choose Source File

  1. In the Osmos Wrangler, you will select the file(s), you wish to process.

  2. Click on the Choose Files icon.

  1. Choose the Lakehouse that contains the source file and hit Connect. Note that the Lakehouse selected will turn light gray when selected.

  1. Select the file(s) and hit Save. The source file(s) will automatically begin to process.

The file will initiate with a status of Queued.

Cancelling a Run

You can cancel a Wrangler run once it moves into the AI Processing status (right after Queued).

  1. Go to the Fabric Monitor.

  2. Once your Wrangler Run moves into a status of In Progress

  3. Hover on the right side of the Activity Name until you see the Cancel Icon

  4. Select the Cancel Icon

This will initiate a run cancellation and update the Wrangler Run to a canceled status.

View Failed Run Detail

You can view more information about the failed run.

  1. Go to the failed Wrangler file

  2. Select the ellipse button to the right of the Status column

  3. Select View failure details

Rerun a Failed Run

To rerun a failed file.

  1. Go to the failed Wrangler run.

  2. Select the ellipse button to the right of the Status column

  3. Select Redo failed run.

This action will queue the file to run again.

Descriptors

What are Descriptors?

Descriptors define and enforce schema-level constraints to ensure structural consistency across datasets. Essentially, they provide a method for users to describe the field. This will enable them to explain more details about the column definition, such as their acronyms, and share any other relevant information. Field Descriptors do not know source data. They are optional and scoped to a destination table. This means all Wanglers pointing to a table share a standard set of field descriptors.

Users can provide descriptions for column headings when they want to guide the column cleaning to achieve better results. These column header description fields are called column descriptors. Column descriptors are applied during the file review process. Once a column descriptor has been entered, you must save and rerun the file to apply the changes.

Descriptors are scoped to the destination table, not a Wrangler.

Here are suggestions for adding column descriptors to drive the most effective outcomes.

  • Describe valid data for this field.

  • Describe this field’s relationship to other fields in this table.

  • Describe any business rules that govern how this field should be populated.

Step 1: Access the Column Descriptors

  1. When the file is ready, select Ready for Review.

  2. In the review screen, select Retry (note, it will default to Approve)

  3. A Retry file processing information message will pop up, select Got It

Step 2: Adding a Column Descriptor

  1. In the Retry screen, select Add Descriptor, which is located directly below the column header field.

  2. When you select the Descriptor, the instructions box will open on the right.

  3. In the box, describe how your data should be cleaned.

  4. Select Save Descriptor.

Note: The column descriptor will update from Add Descriptor to Edit Descriptor.

Step 2: Applying the Descriptor

  1. To apply the descriptors to the file, select Rerun File in the lower right-hand corner.

Step 3: Editing a Descriptor

  1. Descriptors can be modified by selecting Edit Descriptor.

  2. Update the instructions.

  3. Select Save Descriptor.

Create an AI Data Wrangler

Here are the steps to create a new AI Data Wrangler.

Step 1: Create the new AI Data Wrangler

  1. Go into the Workspace where you wish to add the AI Data Wrangler

  2. Select New Item

  1. Select the Osmos AI Data Wrangler item.

  1. Name the new AI Data Wrangler.

    1. Note: Names can only start and end with a letter, number, or underscore.

  1. Select Create

Step 2: Choose Your Destination

  1. Select Choose your destination table

  1. Choose the Lakehouse that contains the destination table

  1. Select Connect

  2. Choose Destination - where your data will go by selecting a table

  1. Hit Save

Now you are ready to process a file!

Instructions

What are Instructions?

Instructions provide guardrails that guide the AI, ensuring transformations stay within defined constraints and follow business intent. Users will upload one or more files, such as Business Requirements Documents and Information Architecture documentation. The documentation will be used to generate updates to descriptors and validators and instructions to guide the wrangler.

Enter instructions specific to the Wrangler, such as cleaning rules, formatting guidelines, or preprocessing steps. These will only apply to the Wrangler's operations and won't affect others tied to the destination table.

There are two types of Instructions in the Wrangler.

  1. Auto-configure Wrangler using documentation

    1. Select a folder with documents that define destination data, source data, and how to transform from source to destination. The Wrangler will use this information to create instructions, descriptors and validators for review.

  2. Provide Instructions

    1. Manually enter specific instructions

Different from Descriptors as Instructions are scoped to the Wrangler.

Connect a Destination Table

Connect to Destination Table

  1. Select Connect to destination

  1. Choose the Lakehouse that contains the destination table.

  1. Select Connect

  2. Choose Destination - where your data will go by selecting a table.

Hit Save

Now you have a destination connected!

Create an AI Data Engineer

Here are the steps to create a new AI Data Engineer.

Create the new AI Data Engineer

  1. Go into the Workspace where you wish to add the AI Data Engineer

  2. Select New Item

  1. Select the Osmos AI Data Engineer item.

  1. Name the new AI Data Engineer

    1. Note: Names can only start and end with a letter, number, or underscore.

  2. Select Create

Your AI Data Engineer is created!

Writing to the Destination

Step 5: Accept Your File and Write to the Destination

Review and approve your file(s) before writing to the Lakehouse.

  1. Select Ready for Review

  1. To accept and write to the destination, select Approve.

  2. If you do not wish to write the current file to the destination, select Reject.

    1. The file will not be written to the destination.

    2. It will update to a Failed Status.

    3. Reasons for rejection may vary. For example, the user initially chose the wrong file to process.

  3. If you select Retry, it will take you to the Retry Processing window.

    1. The retry screen allows you to incorporate column descriptors.

    2. If you choose to rerun the file, the outcome of the previous run(s) will not be saved.

Support

Support Offerings Overview for AI Data Wrangler

Osmos provides the following Support Offering tiers.

Support Tiers

Support Tier
Tier Details

Starter Support

  • Email Support

  • 24-hour response during business hours

Scale Support

  • Slack and Email Support

  • 12 Hours response during business hours

  • 4 Onboarding sessions

Enterprise Support

  • Slack, Teams, Email, and Phone Support

  • 4 Hour response during business hours

  • Use case review and guidance

  • Named customer success manager

  • On-demand onboarding and training sessions

  • First priority support

  • Custom contracts

Mission Critical Support

(Add-on)

Everything in Enterprise plus

  • 1 Hour response time - 24x7x365

  • Weekly touchpoints

  • Product Expertise

  • Provide Train the Trainer

  • Ongoing support to Product Owners/SME’s

  • Influence in Osmos roadmap

  • Named Customer Engineer

  • Be an expert POC for Product / Project Managers

  • Engineer design guidance of onboarding process & workflow

  • Operational process and workflow guidance

  • Support with complex data scenarios

Get In Touch With Us 🤝

Contact us at [email protected].

AI Data Engineer Overview

Overview

Built for complexity. Designed for autonomy.

The Osmos AI Data Engineer is your intelligent agent for solving high-scale, high-complexity data engineering problems. It autonomously builds execution-ready, reusable Spark notebooks, transforming fragmented datasets into pipeline-ready code—so you can scale faster, build better, and worry less.

From intricate transformations to massive data integrations, Osmos AI Data Engineers help you orchestrate your Fabric environment with precision, visibility, and speed.

🚀 What It Is

The AI Data Engineer is a purpose-built AI agent that generates thoroughly tested Python Spark notebooks tailored to your ETL and data engineering workloads. Whether working with relational databases, JSONs, or hundreds of interrelated CSVs, it autonomously engineers pipelines that are production-grade, versionable, and ready for reuse.

👥 Who It’s For

  • Primary User Persona: Data Engineering Teams

  • Secondary Personas: Data Services and Platform Engineering Teams

🎯 Key Use Cases

  • Building Spark-based ETL pipelines for:

    • Relational data (SQL, warehouse exports)

    • CSV, JSON, XML, Parquet, and log files

    • Interrelated or hierarchical datasets

    • Huge files or complex schema mappings

  • Automating workspace orchestration in Fabric or other lakehouse environments

  • Generating repeatable and testable pipelines for ongoing data integration

🛠️ How It Works

1. Tell Your Data Engineer About the Use Case

Upload whatever you have—source files, schema docs, designs, prior instructions. The AI will use this context to understand your goal.

2. AI Builds a Purpose-Built Notebook

The AI Data Engineer:

  • Samples and profiles input data intelligently

  • Identifies transformation logic and schema alignment

  • Writes Spark-based Python code with built-in test coverage

  • Iterates on logic if errors are detected during validation

3. Oversee the Engineer’s Work

You stay in control:

  • Monitor progress

  • Validate the output

  • Review or modify code

  • Provide additional feedback or instructions

4. Integrate Into Your Workflow

  • Schedule notebooks to run automatically

  • Plug them into Fabric, Airflow, dbt, or your existing pipelines

  • Store in Git for version control and long-term reuse

🧩 Key Capabilities

⚙️ AI Decision-Making Logic

Much like a skilled mid-level engineer, the AI Data Engineer:

  • Sample files and learns from structure and patterns

  • Adjusts logic based on runtime errors and data anomalies

  • Writes and executes test cases

  • Reprocesses files until success

  • Outputs clean, reusable, and reliable notebooks

The goal: execution-ready notebooks that can be plugged into your production systems with confidence.

🧪 Example Scenarios

Summary

The Osmos AI Data Engineer is where modern data engineering meets intelligent automation. Designed to tackle what traditional ETL tools can’t, it generates reusable, testable code you can trust—letting you move faster, break less, and deliver more.

-Focus on what matters most. -Let the AI handle the engineering.

Auto-Configure Instructions

Overview

The Auto-Configure Instructions feature enables users to define how Osmos AI Data Agents (such as the AI Data Engineer) should transform and process source data, without requiring any code. Whether you provide instructions manually or let the AI derive them from documentation in a folder, this feature puts you in control of the configuration logic. At the same time, the AI does the heavy lifting.

What It Is

Auto-Configure is an intelligent setup assistant that helps you:

  • Create a structured instruction set for your AI agent

  • Define destination schema and transformation logic

  • Provide either typed instructions or a folder of documentation

  • Preview and edit the AI-generated configuration before execution

It transforms your documentation, business rules, or direct input into clean, human-readable instructions the AI can follow for repeatable, controlled data processing.

Auto-configure instructions

  1. Select Add Instructions

  1. Select Manually provide Instructions

  1. Now you have the option to upload a folder and/or provide instructions manually.

Two Ways to Configure Instructions

1. Manual Instructions

Directly input your configuration logic using a structured form:

  • Destination Tables: Specify where your output should go. This often pre-populates from your Fabric workspace.

  • Source Files: Identify what data is being transformed.

  • Ingestion Instructions: Describe transformations, validation rules, mappings, and business logic.

✅ Best for cases where you know exactly what the AI needs to do or when dealing with new, one-off logic.

2. Choose an Instructions Folder

Point the AI to a folder containing relevant documentation. The AI will:

  • Read all provided files (up to 10)

  • Extract transformation logic, schemas, and business intent

  • Generate editable instructions from:

    • Business requirements docs

    • Data models and schema designs

    • Code snippets or prior scripts (SQL, Python, etc.)

    • Sample source/output files

✅ Best for existing projects, historical context, or when repurposing prior data transformation logic.

How It Works Behind the Scenes

The Osmos AI Data Engineer uses generative AI to analyze your inputs and convert them into a structured instruction set, including:

  • Target schema details

  • Source-to-destination mapping logic

  • Transformation functions and validations

  • Edge cases and data quality checks

These instructions act as guardrails for the AI, helping it:

  • Stay aligned with business rules

  • Ensure data integrity

  • Avoid brittle or incorrect transformations

🧩 Real-World Example

Let’s say your destination is a employee_pets table, and you want the AI to extract employee and pet information from messy spreadsheets. You could either:

  • Manually configure: “Destination table is employee_pets. Use all files in the folder. Map emp_type to one of [Full-time, Part-time, Contractor]. Standardize phone numbers. Extract date from header.”

  • Use a folder: Upload a folder containing:

    • A document describing the target schema

    • A sample table of cleaned data

    • A script with useful regex patterns

    • Notes about mapping rules

The AI will parse that information and present it back to you as an editable instruction template. You can then adjust as needed.

🔁 Iteration & Feedback

  • After reviewing the generated instructions, you can:

    • Edit them inline

    • Add edge-case handling

    • Strengthen constraints (e.g., "fail if source columns change")

  • If the result isn't right, update your instructions and regenerate

  • Use real-time feedback to refine and guide the AI’s behavior

🔒Folder Constraints

When using a folder to generate instructions:

  • Limit of 10 files per instruction set

  • All files must be relevant to the current data use case

  • Accepted formats include DOCX, PDF, TXT, CSV, XLSX, SQL, and code files

Capability
Description
Input Scenario
AI Engineer Outcome

Autonomous by Design

Just describe what you need; the agent handles data sampling, transformation, and validation.

Built to Adapt

Capable of transforming complex, nested, or massive datasets

Code You Can Trust

Fully tested and version-controlled Python Spark notebooks

Production-Ready Output

Ready for deployment into orchestration tools and lakehouse pipelines

Hundreds of CSVs needing to be merged and normalized

Consolidated schema, clean joins, Spark notebook ready for scheduling

Relational data dump + JSON blobs

Parsed relationships, flattened nested data, transformed into analytics table

Terabytes of log data from multiple sources

Partitioned and transformed Spark pipeline to generate usable summaries

Mixed XML + Parquet sources for ML preprocessing

Cleaned, reshaped training dataset exported via structured Spark workflow

Generate Notebook

Overview

The Generate Notebook feature is at the heart of how the Osmos AI Data Engineer turns your configuration into powerful, reusable Python code. With one click, it produces a fully functional Spark-based notebook that is ready to run, schedule, version, and integrate into pipelines.

These notebooks do more than ingest and transform data—they represent long-living, production-grade workflows that evolve with your needs, while putting human reviewers in complete control.

What It Is

Generate Notebook triggers the AI Data Engineer to build a ready-to-run Python notebook based on your configuration instructions, source files, and destination schemas. The notebook is:

  • Execution-ready: Includes logic for ingestion, transformation, and validation

  • Reusable: Can be versioned, re-executed, and adapted for new data

  • Pipeline-ready: Built for integration into orchestration systems (e.g., Fabric, Airflow)

  • Autonomous but supervised: All actions are user-initiated, ensuring full control

Think of it as saying: “Hey engineer, write me a Python notebook for this job.” And the AI does it—intelligently, iteratively, and at scale.

What the AI Does Behind the Scenes

When you click Generate Notebook, the AI Data Engineer will:

  1. Sample & Analyze Files It inspects your input data (CSV, JSON, XML, Parquet, etc.) to understand schemas, anomalies, and transformations.

  2. Write the Code It generates Spark-based Python code that:

    • Ingests your data

    • Transforms it according to your instructions

    • Includes built-in schema checks and validation logic

  3. Write Its Tests The notebook includes test cases to catch data issues, logic gaps, or structural inconsistencies.

  4. Handle Errors Automatically If tests fail, the AI:

    • Resamples the data

    • Revises the code

    • Re-generates logic until a working solution is found

  5. Add Bookkeeping Built-in logic tracks what data has been processed, avoiding duplicates or reprocessing in future runs.

🔄 Iteration & Control

You're always in the loop:

  • Preview the generated notebook

  • Edit the code or configuration as needed

  • Regenerate if something’s missing or incorrect

  • Schedule or manually trigger runs

  • Version in Git or any repo of your choice

Notebooks are written defensively. They handle schema shifts gracefully, raising clear errors for issues that need intervention.

🧩 Key Capabilities

Capability
Description

Reusable & Versionable

Notebooks are long-living and can be stored, shared, and reused

Fully Tested

AI includes test scripts and validation checks

Pipeline Integration

Designed to plug into workflows and orchestration platforms

Bookkeeping Logic

Automatically tracks processed files for repeatable, safe operations

Performance Optimized

Passes through an AI profiler for better runtime and scaling

Human-in-the-Loop

All notebook generation and execution are initiated and reviewed by users

✅ Summary

The Generate Notebook feature lets you go from config to code—automatically. Whether you're managing a complex data lake or building a repeatable ingestion flow, Osmos AI writes high-quality notebooks for you.

-No boilerplate. -No hand coding. -Just reliable, production-ready notebooks you can trust.

AI Data Wrangler Overview

Overview

What is the AI Data Wrangler? The Osmos AI Data Wrangler is an autonomous data agent that transforms your messiest, most irregular files into clean, structured data—hands-free. It’s purpose-built to automate the wrangling of complex file formats found across SharePoint, FTPs, GDrive, and more, enabling faster, more reliable decision-making without scaling up your data engineering team.

Currently available within Microsoft Fabric, the AI Data Wrangler helps organizations prepare lakehouse data with precision and minimal effort by leveraging generative AI.

🚀 What It Is

The AI Data Wrangler uses GenAI to intelligently and autonomously clean and reshape messy files into structured, SQL-ready datasets. Whether it's inconsistent Excel exports, broken PDFs, or fixed-width legacy system files, the Wrangler selects the most effective processing strategy—writing custom code or chunking through LLMs—so you don’t have to.

No rules. No templates. No manual rework.

👥 Who It’s For

  • Primary User Persona: Business, Operations, and Data Services Teams

🎯 Key Use Cases

  • Preparing messy source files to deliver SQL-ready data for downstream analytics

  • Wrangling input from:

    • Excel files with inconsistent headers and merged rows

    • PDFs with embedded or unstructured data

    • Fixed-width or custom-delimited exports

    • “Not really” CSVs from legacy tools

  • Mapping irregular data to a standardized schema (e.g., customer master table)

🛠️ How It Works

1. Submit Your Files

Upload messy files from sources like SharePoint, GDrive, FTPs, or internal systems. Supports a wide range of formats with irregular structure.

2. Provide Instructions—or Don’t

You can:

  • Point to a golden schema or a Fabric destination table

  • Let the AI infer expectations from instructions, example files, or even code

  • Use Autoconfigure to ingest prior docs and extract transformation logic

3. Leave the Dirty Work to Osmos

The Wrangler decides:

  • Whether to generate transformation code

  • Whether to chunk and semantically analyze the file using LLMs

  • How to best get your clean, validated tabular data

Each file is processed independently. No brittle code. No manual tuning.

4. Review, Approve, Repeat

  • Review outputs before committing

  • Compare the output side-by-side with the input for validation

  • Request changes or reprocess with new instructions

  • Accept the result and move on

🧩 Key Capabilities

Capability
Description

Fully Autonomous

AI decides optimal logic per file—LLM, code, or both

Flexible File Support

Handles PDFs, Excels, fixed-width, delimited, malformed CSVs, and more

Golden Schema Mapping

Aligns source data with your lakehouse schema and business expectations

Instant Review Cycles

See results in minutes, give feedback, or approve with a click

Built for Fabric

Seamlessly manages and prepares data in your Microsoft Fabric environment

⚙️ AI Decision-Making Logic

The Wrangler processes each file independently and flexibly:

  • Infers structure and formatting quirks

  • Chooses between LLM chunking and custom code generation

  • Validates results through in-process checks

  • Supports multiple data types in a single run

The output is always clean tabular data, not reusable code, because messy files change constantly, and brittle code breaks.

🧪 Example Scenarios

Input Scenario
Wrangler Outcome

Broken Excel with multi-row headers

Extracted proper columns, standardized formats, aligned to schema

PDF invoices with nested info

Parsed PO numbers, product descriptions, and quantities cleanly

Fixed-width export with missing headers

Inferred headers, extracted fields by position, produced structured output

Custom-delimited file with inconsistent rows

Detected delimiters, normalized row lengths, created clean flat file

Semi-structured CSV with embedded fields

Split merged fields into columns, matched values to categories

Summary

The Osmos AI Data Wrangler turns unstructured, irregular data chaos into consistent, actionable insights—fast. With no need for templates or hand-written transformations, it autonomously learns what your data should look like and delivers results you can trust.

Whether you’re prepping data for analytics or just trying to get invoice PDFs into your lakehouse, the AI Data Wrangler is your hands-free, error-free solution.

-From chaos to clean in minutes. -Powered by generative AI. -Available now in Microsoft Fabric.

Adding Workspace Items

How to add Workspace Items

Microsoft Fabric's workspace is a container for organizing and collaborating on Fabric items, including AI Data Wranglers, lakehouses, and Notebooks. It allows users to manage and share data assets.

How to create a new Workspace Item

  1. Go to the Workspace

  2. Select New Item

  1. Select an item such as AI Data Wrangler.

  2. Name the new Item and hit Create.

Learn More

  • Workspace Items In Microsoft Fabric