1 of 28

Osmos AI Data Wrangler

Welcome to Osmos

Osmos | AI Agents for Enterprise Data

Let AI agents that reason, act, and adapt automate ingestion, cleanup, and transformation—no human hand-holding required.

Getting started is easy!

1. Log into Microsoft Fabric: Go to. You can sign up for a Microsoft Fabric Trial if you don't have an account.

2. Add the Osmos Workload: Once you're in, add the Osmos Workload to start using the AI Data Wrangler, which helps automate data transformations.

3. Get In Touch: If you’d like a more personalized product walkthrough, you can schedule a time with an expert through the provided link.

4. Support: If you have any questions or issues, email [email protected].

Get In Touch With Us

If you want a product walkthrough, click to find time with an expert!

Introduction

Accelerate your Fabric deployment with Osmos AI data agents. Seamlessly ingest, transform, and structure data using agentic AI.

How Does Osmos Work?

Osmos empowers your data team with agentic AI to seamlessly manage your Fabric environment. The ‘Osmos Data Engineer’ leverages AI Agents to autonomously create execution-ready Spark notebooks for data engineering and ETL tasks, and the ‘Osmos Data Wrangler’ wrangles even your messiest data, hands-free, with AI Data Agents. Osmos can run on any Fabric Capacity provisioned with F2 or higher.

Osmos integrates effortlessly with Microsoft Fabric, creating a robust, scalable, and secure environment for data processing. With this integration, your organization can benefit from Fabric’s advanced analytics and business intelligence capabilities, all powered by pristine, well-structured data.

What is Osmos Used For?

Osmos provides the tools needed to make data transformation accessible across the organization. Key applications include:

Data Wrangling and Transformation: Clean, align, and prepare datasets for analytics and AI with minimal manual effort.
Data Integration: Streamline data ingestion from multiple sources, enabling rapid analysis and decision-making.
Business Intelligence: Ensure your data is consistently structured and accurate to amplify the effectiveness of tools like Microsoft Power BI.
Regulatory Compliance: Maintain data integrity and documentation for compliance in industries with strict data regulations.

Osmos enhances your team’s ability to work with data at scale, minimizing the need for technical resources traditionally required for extensive data preparation.

Why Use Osmos with Microsoft Fabric?

By integrating with Microsoft Fabric, Osmos creates a unified environment where data can be processed, analyzed, and transformed within a single ecosystem. Osmos AI Data Wrangler, available as a native workload in Microsoft Fabric, provides several distinct advantages:

Scalability: Easily manage data wrangling for both small and large datasets.
Enhanced Collaboration: Fabric’s collaborative environment supports teamwork on data projects, allowing insights and workflows to be shared seamlessly.
Real-Time Monitoring: Fabric's centralized monitoring tools allow you to track the status of your data transformation jobs across your organization.
Security and Compliance: Microsoft Fabric’s enterprise-grade security ensures your data remains protected throughout the wrangling process.

Osmos on Fabric allows you to maximize the potential of your data without expanding your data engineering team.

Getting Started with Microsoft Fabric

Steps to enable your free Microsoft Fabric Trial.

Enabling a Microsoft Fabric Trial

Follow these steps to enable a free Microsoft Fabric trial. This gives you temporary access to all Fabric features—including Premium Capacity (F64)—so you can explore the Osmos Ai Data Wrangler before committing to a paid subscription.

Overview

A Microsoft Fabric trial provides a 60-day window to explore everything Fabric has to offer:

Access to Fabric experiences such as the Osmos AI Data Wrangler.
Integration with OneLake allows you to unify your data in a single, secure location.
Premium features (e.g., larger data model sizes) via a dedicated trial capacity.

Once the trial expires, your Fabric workspace(s) will revert to shared capacity, but any data will remain accessible within OneLake—no data loss will occur.

Important: This trial is for work under Microsoft Entra ID (formerly Azure Active Directory). Personal Microsoft accounts (Outlook, Hotmail, etc.) are not supported.

Prerequisites

Work Account: You must use a Microsoft 365 work account to sign up for the trial.
Tenant Permissions: Self-service sign-up must be allowed in your organization’s tenant. If self-service sign-up is disabled, you’ll need an admin to enable the trial on your behalf.
Geographic Availability: Check the regional availability to ensure your region offers the Fabric trial.

Steps to Start Your Fabric Trial

Sign In
1. Navigate to the Microsoft Fabric Trial page or visit fabric.microsoft.com.
2. Select Sign In and enter your work Microsoft 365 account credentials.
Initiate the Trial
1. Look for the Try Fabric or Start free trial button.
2. If prompted, review and accept any terms and conditions.
3. No credit card is required; however, you may need to verify your identity based on your organization’s tenant settings.
Select a Workspace
1. The free capacity trial applies to your My Workspace in Power BI / Microsoft Fabric by default.
2. If you want to enable the trial for an organizational workspace, ensure you have the appropriate admin or contributor permissions in that workspace.
Confirm Activation
1. After accepting the trial, you’ll see a confirmation banner in the Fabric portal or your Power BI service interface.
2. The banner typically includes your trial end date (60 days from activation).
3. During this trial, your workspaces use the dedicated trial capacity, allowing you to test out premium features, including the Osmos AI Data Wrangler.

Managing Your Trial

Duration: The trial lasts for 60 days. You can continue exploring any Fabric workloads during this period.
Expiration: Once the trial ends, your workspace reverts to shared capacity, but all data in OneLake remains intact. You’ll lose premium features (e.g., large model sizes) unless you purchase a capacity or license.
Early Cancellation: If you no longer need the trial, you can let it expire naturally or deactivate it in the admin portal (if you have the required permissions).

Note: If you purchase a paid capacity (e.g., Fabric capacity in the Power BI admin portal) before your trial ends, your workspace seamlessly transitions to the paid tier without downtime.

Fabric Tenant Settings

Steps to Add the Osmos Workload and enable the AI Data Wrangler

Enable the Fabric tenant settings and capacities

To begin working with the AI Data Wrangler, you must enable tenant settings and capacities in Fabric. The steps below require admin access. This is a workspace-wide activity.

Log in to Fabric
Go to Settings and select Admin portal
Select Tenant Settings
In the Additional Workloads (preview) section,
- In the Capacity admins and contributors can add and remove additional workloads section, select Enabled.
- In the Users can see and work with uncertified partner workloads section, select Enabled
- Apply to the entire organization or specific groups

Note: Users can add workloads if they meet the following criteria:

They have permission to add a workload by Fabric admin
They're a capacity admin or have permission to assign capacities to workspaces

Note: The AI Data Wrangler requires Microsoft Fabric with a minimum F2 Capacity

Common Fabric Issues & Troubleshooting

These are common issues when enabling the Microsoft Fabric Trial

No 'Try Fabric' Button
1. You may already have a premium capacity license, or your organization might have disabled self-service sign-up.
2. Verify with your administrator or check the Power BI Admin Portal under Settings → Admin Portal → Capacity settings.
Account or Permission Problems
1. Ensure you’re using a work Microsoft 365 account.
2. If self-service is blocked, an admin must enable it or assign a trial capacity directly.
Region Unavailable
1. Confirm that Microsoft Fabric supports your tenant region. If not, you won’t be able to access the Fabric trial.
Capacity Limitations
1. During the trial, usage limits (e.g., max memory capacity, concurrency) may apply. These limits are explained in the.

Learn More

Adding the Osmos Workload

Steps to Add the Osmos Workload

Microsoft Fabric's "workloads" refer to distinct components or capabilities integrated into the Fabric framework, such as Data Warehouse and Power BI, that enhance the service's usability within the Fabric workspace, allowing users to perform specific tasks without leaving the environment.

Adding the Osmos Workload

Once the tenant settings and capacity are enabled and configured, you can add the Osmos Workload.

From the Fabric Homepage, navigate to the left-hand side of the page and select the “Workloads” Tab.
Select “More Workloads” and select the Osmos AI Data Wrangler tile to visit the Osmos listing within Fabric.

Potential Pre-req: Fabric Admin may be required to enable the Osmos Workload.

They do not need to be admins if the admin has not turned off the setting below. Note that it is on by default.

Learn More

Microsoft Fabric Workloads

Adding the Osmos Workspace

How to create a new Workspace

Workspaces are places where colleagues collaborate to create collections of items such as AI Data Wranglers, lakehouses, warehouses, and reports and to create task flows.

How to create a new Workspace

Go to the Osmos Workload Homepage
Select New Workspace
Add the following:

Name - Required field. Name the workspace
Description - Optional field. Describe the workspace
Domain - Optional field. Assign this workspace to a relevant domain to help people discover its content. Each workspace can be assigned to one domain.

Select Apply

How to access an existing Workspace

There are two ways to access a Workspace easily.

Side Panel
1. From the side panel, select Workspaces.
2. Your workspaces will be listed from your Workload
Workload Homepage
1. From the homepage, select See workspaces.
2. A list of workspaces will be displayed for your workload.

Learn More

Workspaces In Microsoft Fabric

Adding Workspace Items

How to add Workspace Items

Microsoft Fabric's workspace is a container for organizing and collaborating on Fabric items, including AI Data Wranglers, lakehouses, and Notebooks. It allows users to manage and share data assets.

How to create a new Workspace Item

Go to the Workspace
Select New Item

Select an item such as AI Data Wrangler.
Name the new Item and hit Create.

Learn More

Adding Data into a Lakehouse

How to add Data into a Lakehouse

Before you can wrangle your data, it must exist in a lakehouse and be available to ingest into AI Data Wrangler.

How to upload a local file

One easy method to upload a file stored on your local machine is through the Lakehouse Explorer.

Go to the Workspace, where the AI Data Wrangler resides, and select the Lakehouse.
In the Lakehouse Explorer, Select Files
Upload files

Select files from your local machine.
Hit Upload

Your file will now be available in your AI Data Wrangler.

Reminder: Data must exist in a lakehouse before it is available to ingest into AI Data Wrangler.

Methods to add data into a Lakehouse

In Microsoft Fabric, here are several ways to get data into a lakehouse:

File upload from local computer
Run a copy tool in pipelines
Set up a dataflow
Apache Spark libraries in notebook code
Stream real-time events with Eventstream
Get data from Eventhouse

Learn More

Microsoft Fabric Lakehouse

AI Data Wrangler

AI Data Wrangler | Hands-Free Data Clean Up

Let Osmos AI Data Wrangler handle your messiest files—Excel, PDFs, CSVs, and more—and deliver clean, SQL-ready data without manual effort.

AI Data Wrangler Overview

Overview

What is the AI Data Wrangler? The Osmos AI Data Wrangler is an autonomous data agent that transforms your messiest, most irregular files into clean, structured data—hands-free. It’s purpose-built to automate the wrangling of complex file formats found across SharePoint, FTPs, GDrive, and more, enabling faster, more reliable decision-making without scaling up your data engineering team.

Currently available within Microsoft Fabric, the AI Data Wrangler helps organizations prepare lakehouse data with precision and minimal effort by leveraging generative AI.

🚀 What It Is

The AI Data Wrangler uses GenAI to intelligently and autonomously clean and reshape messy files into structured, SQL-ready datasets. Whether it's inconsistent Excel exports, broken PDFs, or fixed-width legacy system files, the Wrangler selects the most effective processing strategy—writing custom code or chunking through LLMs—so you don’t have to.

No rules. No templates. No manual rework.

👥 Who It’s For

Primary User Persona: Business, Operations, and Data Services Teams

🎯 Key Use Cases

Preparing messy source files to deliver SQL-ready data for downstream analytics
Wrangling input from:
- Excel files with inconsistent headers and merged rows
- PDFs with embedded or unstructured data
- Fixed-width or custom-delimited exports
- “Not really” CSVs from legacy tools
Mapping irregular data to a standardized schema (e.g., customer master table)

🛠️ How It Works

1. Submit Your Files

Upload messy files from sources like SharePoint, GDrive, FTPs, or internal systems. Supports a wide range of formats with irregular structure.

2. Provide Instructions—or Don’t

You can:

Point to a golden schema or a Fabric destination table
Let the AI infer expectations from instructions, example files, or even code
Use Autoconfigure to ingest prior docs and extract transformation logic

3. Leave the Dirty Work to Osmos

The Wrangler decides:

Whether to generate transformation code
Whether to chunk and semantically analyze the file using LLMs
How to best get your clean, validated tabular data

Each file is processed independently. No brittle code. No manual tuning.

4. Review, Approve, Repeat

Review outputs before committing
Compare the output side-by-side with the input for validation
Request changes or reprocess with new instructions
Accept the result and move on

🧩 Key Capabilities

Capability

Description

⚙️ AI Decision-Making Logic

The Wrangler processes each file independently and flexibly:

Infers structure and formatting quirks
Chooses between LLM chunking and custom code generation
Validates results through in-process checks
Supports multiple data types in a single run

The output is always clean tabular data, not reusable code, because messy files change constantly, and brittle code breaks.

🧪 Example Scenarios

Input Scenario

Wrangler Outcome

Summary

The Osmos AI Data Wrangler turns unstructured, irregular data chaos into consistent, actionable insights—fast. With no need for templates or hand-written transformations, it autonomously learns what your data should look like and delivers results you can trust.

Whether you’re prepping data for analytics or just trying to get invoice PDFs into your lakehouse, the AI Data Wrangler is your hands-free, error-free solution.

-From chaos to clean in minutes. -Powered by generative AI. -Available now in Microsoft Fabric.

Create an AI Data Wrangler

Here are the steps to create a new AI Data Wrangler.

Step 1: Create the new AI Data Wrangler

Go into the Workspace where you wish to add the AI Data Wrangler
Select New Item

Select the Osmos AI Data Wrangler item.

Name the new AI Data Wrangler.
1. Note: Names can only start and end with a letter, number, or underscore.

Select Create

Step 2: Choose Your Destination

Select Choose your destination table

Choose the Lakehouse that contains the destination table

Select Connect
Choose Destination - where your data will go by selecting a table

Hit Save

Now you are ready to process a file!

Running a Wrangler

Here are the steps to run a Wrangler.

Choose Source File

In the Osmos Wrangler, you will select the file(s), you wish to process.
Click on the Choose Files icon.

Choose the Lakehouse that contains the source file and hit Connect. Note that the Lakehouse selected will turn light gray when selected.

Select the file(s) and hit Save. The source file(s) will automatically begin to process.

The file will initiate with a status of Queued.

Cancelling a Run

You can cancel a Wrangler run once it moves into the AI Processing status (right after Queued).

Go to the Fabric Monitor.
Once your Wrangler Run moves into a status of In Progress
Hover on the right side of the Activity Name until you see the Cancel Icon
Select the Cancel Icon

This will initiate a run cancellation and update the Wrangler Run to a canceled status.

View Failed Run Detail

You can view more information about the failed run.

Go to the failed Wrangler file
Select the ellipse button to the right of the Status column
Select View failure details

Rerun a Failed Run

To rerun a failed file.

Go to the failed Wrangler run.
Select the ellipse button to the right of the Status column
Select Redo failed run.

This action will queue the file to run again.

Wrangler Data Statuses

Here are the various statuses for a Wrangler file.

Wrangler Statuses

Once you select your file(s), they will be listed on the bottom half of your Wrangler.
Each file will have a status that updates as it moves through Wrangler processing.

Statuses include:

Queued
AI Processing
Ready for Review
Completed
Failed
Rejected
Cancelled

Wrangler Context

What is Wrangler Context?

Descriptors

What are Descriptors?

Descriptors define and enforce schema-level constraints to ensure structural consistency across datasets. Essentially, they provide a method for users to describe the field. This will enable them to explain more details about the column definition, such as their acronyms, and share any other relevant information. Field Descriptors do not know source data. They are optional and scoped to a destination table. This means all Wanglers pointing to a table share a standard set of field descriptors.

Users can provide descriptions for column headings when they want to guide the column cleaning to achieve better results. These column header description fields are called column descriptors. Column descriptors are applied during the file review process. Once a column descriptor has been entered, you must save and rerun the file to apply the changes.

Descriptors are scoped to the destination table, not a Wrangler.

Here are suggestions for adding column descriptors to drive the most effective outcomes.

Describe valid data for this field.
Describe this field’s relationship to other fields in this table.
Describe any business rules that govern how this field should be populated.

Step 1: Access the Column Descriptors

When the file is ready, select Ready for Review.
In the review screen, select Retry (note, it will default to Approve)
A Retry file processing information message will pop up, select Got It

Step 2: Adding a Column Descriptor

In the Retry screen, select Add Descriptor, which is located directly below the column header field.
When you select the Descriptor, the instructions box will open on the right.
In the box, describe how your data should be cleaned.
Select Save Descriptor.

Note: The column descriptor will update from Add Descriptor to Edit Descriptor.

Step 2: Applying the Descriptor

To apply the descriptors to the file, select Rerun File in the lower right-hand corner.

Step 3: Editing a Descriptor

Descriptors can be modified by selecting Edit Descriptor.
Update the instructions.
Select Save Descriptor.

Best Practices for Column Descriptors

Define valid data that can be stored in the is column.
Do not specify the source data. Focus on the destination column.
Column Descriptors are tied to the destination table. If multiple Wranglers are pointed to the same table, they share the same descriptors.
Provide examples
Tell us how you want us to handle null
Tell us how you want us to handle errors; even provide an example
Be careful not to create contradictions between the data type and the column descriptor. For instance, if the data type is int32, do not ask to round to zero decimal places.
If information is already in a descriptor, be careful of deleting it, especially if it is being shared across Wranglers. It is best practice to add to a description by editing the descriptor rather than deleting it.

Instructions

What are Instructions?

Instructions provide guardrails that guide the AI, ensuring transformations stay within defined constraints and follow business intent. Users will upload one or more files, such as Business Requirements Documents and Information Architecture documentation. The documentation will be used to generate updates to descriptors and validators and instructions to guide the wrangler.

Enter instructions specific to the Wrangler, such as cleaning rules, formatting guidelines, or preprocessing steps. These will only apply to the Wrangler's operations and won't affect others tied to the destination table.

There are two types of Instructions in the Wrangler.

Auto-configure Wrangler using documentation
1. Select a folder with documents that define destination data, source data, and how to transform from source to destination. The Wrangler will use this information to create instructions, descriptors and validators for review.
Provide Instructions
1. Manually enter specific instructions

Different from Descriptors as Instructions are scoped to the Wrangler.

Writing to the Destination

Step 5: Accept Your File and Write to the Destination

Review and approve your file(s) before writing to the Lakehouse.

Select Ready for Review

To accept and write to the destination, select Approve.
If you do not wish to write the current file to the destination, select Reject.
1. The file will not be written to the destination.
2. It will update to a Failed Status.
3. Reasons for rejection may vary. For example, the user initially chose the wrong file to process.
If you select Retry, it will take you to the Retry Processing window.
1. The retry screen allows you to incorporate column descriptors.
2. If you choose to rerun the file, the outcome of the previous run(s) will not be saved.

File Metadata

File Name Available in the Column Mapping

The Source originating File Name field can automatically be added to a destination column.

Simply add a column named Filename.
When you choose a file, it will automatically add the file name to the filename column.

Test Drive Scenario

Here is a sample scenario to test out the Wrangler in your environment

Log into Microsoft Fabric: Go to app.fabric.microsoft.com
Create a Lakehouse for the destination table.
1. Below is a sample Lakehouse table schema for the scenario. The SQL script is run in a notebook, which will generate your table. Note that there are various methods for creating a schema.

%%sql
CREATE TABLE Sales_Orders_Sample (
 Customer_Name STRING,
 Company STRING,
 Order_Date DATE, 
 Order_No STRING,
 Qty FLOAT,
 Parts_Ordered STRING,
 Price FLOAT,
 Address STRING,
 Description STRING
) USING DELTA;

Upload your Source data file(s) to the Lakehouse
1. Below are two sample files to run through your Wrangler.

AI Data Engineer

AI Data Engineer | Autonomous ETL with Spark

Let Osmos AI Data Engineer autonomously create execution-ready Spark notebooks for data engineering and ETL tasks.

AI Data Engineer Overview

Overview

Built for complexity. Designed for autonomy.

The Osmos AI Data Engineer is your intelligent agent for solving high-scale, high-complexity data engineering problems. It autonomously builds execution-ready, reusable Spark notebooks, transforming fragmented datasets into pipeline-ready code—so you can scale faster, build better, and worry less.

From intricate transformations to massive data integrations, Osmos AI Data Engineers help you orchestrate your Fabric environment with precision, visibility, and speed.

🚀 What It Is

The AI Data Engineer is a purpose-built AI agent that generates thoroughly tested Python Spark notebooks tailored to your ETL and data engineering workloads. Whether working with relational databases, JSONs, or hundreds of interrelated CSVs, it autonomously engineers pipelines that are production-grade, versionable, and ready for reuse.

👥 Who It’s For

Primary User Persona: Data Engineering Teams
Secondary Personas: Data Services and Platform Engineering Teams

🎯 Key Use Cases

Building Spark-based ETL pipelines for:
- Relational data (SQL, warehouse exports)
- CSV, JSON, XML, Parquet, and log files
- Interrelated or hierarchical datasets
- Huge files or complex schema mappings
Automating workspace orchestration in Fabric or other lakehouse environments
Generating repeatable and testable pipelines for ongoing data integration

🛠️ How It Works

1. Tell Your Data Engineer About the Use Case

Upload whatever you have—source files, schema docs, designs, prior instructions. The AI will use this context to understand your goal.

2. AI Builds a Purpose-Built Notebook

The AI Data Engineer:

Samples and profiles input data intelligently
Identifies transformation logic and schema alignment
Writes Spark-based Python code with built-in test coverage
Iterates on logic if errors are detected during validation

3. Oversee the Engineer’s Work

You stay in control:

Monitor progress
Validate the output
Review or modify code
Provide additional feedback or instructions

4. Integrate Into Your Workflow

Schedule notebooks to run automatically
Plug them into Fabric, Airflow, dbt, or your existing pipelines
Store in Git for version control and long-term reuse

🧩 Key Capabilities

Capability

Description

Autonomous by Design

Just describe what you need; the agent handles data sampling, transformation, and validation.

Built to Adapt

Capable of transforming complex, nested, or massive datasets

Code You Can Trust

Fully tested and version-controlled Python Spark notebooks

Production-Ready Output

Ready for deployment into orchestration tools and lakehouse pipelines

⚙️ AI Decision-Making Logic

Much like a skilled mid-level engineer, the AI Data Engineer:

Sample files and learns from structure and patterns
Adjusts logic based on runtime errors and data anomalies
Writes and executes test cases
Reprocesses files until success
Outputs clean, reusable, and reliable notebooks

The goal: execution-ready notebooks that can be plugged into your production systems with confidence.

🧪 Example Scenarios

Input Scenario

AI Engineer Outcome

Hundreds of CSVs needing to be merged and normalized

Consolidated schema, clean joins, Spark notebook ready for scheduling

Relational data dump + JSON blobs

Parsed relationships, flattened nested data, transformed into analytics table

Terabytes of log data from multiple sources

Partitioned and transformed Spark pipeline to generate usable summaries

Mixed XML + Parquet sources for ML preprocessing

Cleaned, reshaped training dataset exported via structured Spark workflow

Summary

The Osmos AI Data Engineer is where modern data engineering meets intelligent automation. Designed to tackle what traditional ETL tools can’t, it generates reusable, testable code you can trust—letting you move faster, break less, and deliver more.

-Focus on what matters most. -Let the AI handle the engineering.

Create an AI Data Engineer

Here are the steps to create a new AI Data Engineer.

Create the new AI Data Engineer

Go into the Workspace where you wish to add the AI Data Engineer
Select New Item

Select the Osmos AI Data Engineer item.

Name the new AI Data Engineer
1. Note: Names can only start and end with a letter, number, or underscore.
Select Create

Your AI Data Engineer is created!

Connect a Destination Table

Connect to Destination Table

Select Connect to destination

Choose the Lakehouse that contains the destination table.

Select Connect
Choose Destination - where your data will go by selecting a table.

Hit Save

Now you have a destination connected!

Auto-Configure Instructions

Overview

The Auto-Configure Instructions feature enables users to define how Osmos AI Data Agents (such as the AI Data Engineer) should transform and process source data, without requiring any code. Whether you provide instructions manually or let the AI derive them from documentation in a folder, this feature puts you in control of the configuration logic. At the same time, the AI does the heavy lifting.

What It Is

Auto-Configure is an intelligent setup assistant that helps you:

Create a structured instruction set for your AI agent
Define destination schema and transformation logic
Provide either typed instructions or a folder of documentation
Preview and edit the AI-generated configuration before execution

It transforms your documentation, business rules, or direct input into clean, human-readable instructions the AI can follow for repeatable, controlled data processing.

Auto-configure instructions

Select Add Instructions

Select Manually provide Instructions

Now you have the option to upload a folder and/or provide instructions manually.

Two Ways to Configure Instructions

1. Manual Instructions

Directly input your configuration logic using a structured form:

Destination Tables: Specify where your output should go. This often pre-populates from your Fabric workspace.
Source Files: Identify what data is being transformed.
Ingestion Instructions: Describe transformations, validation rules, mappings, and business logic.

✅ Best for cases where you know exactly what the AI needs to do or when dealing with new, one-off logic.

2. Choose an Instructions Folder

Point the AI to a folder containing relevant documentation. The AI will:

Read all provided files (up to 10)
Extract transformation logic, schemas, and business intent
Generate editable instructions from:
- Business requirements docs
- Data models and schema designs
- Code snippets or prior scripts (SQL, Python, etc.)
- Sample source/output files

✅ Best for existing projects, historical context, or when repurposing prior data transformation logic.

How It Works Behind the Scenes

The Osmos AI Data Engineer uses generative AI to analyze your inputs and convert them into a structured instruction set, including:

Target schema details
Source-to-destination mapping logic
Transformation functions and validations
Edge cases and data quality checks

These instructions act as guardrails for the AI, helping it:

Stay aligned with business rules
Ensure data integrity
Avoid brittle or incorrect transformations

🧩 Real-World Example

Let’s say your destination is a employee_pets table, and you want the AI to extract employee and pet information from messy spreadsheets. You could either:

Manually configure: “Destination table is employee_pets. Use all files in the folder. Map emp_type to one of [Full-time, Part-time, Contractor]. Standardize phone numbers. Extract date from header.”
Use a folder: Upload a folder containing:
- A document describing the target schema
- A sample table of cleaned data
- A script with useful regex patterns
- Notes about mapping rules

The AI will parse that information and present it back to you as an editable instruction template. You can then adjust as needed.

🔁 Iteration & Feedback

After reviewing the generated instructions, you can:
- Edit them inline
- Add edge-case handling
- Strengthen constraints (e.g., "fail if source columns change")
If the result isn't right, update your instructions and regenerate
Use real-time feedback to refine and guide the AI’s behavior

🔒Folder Constraints

When using a folder to generate instructions:

Limit of 10 files per instruction set
All files must be relevant to the current data use case
Accepted formats include DOCX, PDF, TXT, CSV, XLSX, SQL, and code files

Generate Notebook

Overview

The Generate Notebook feature is at the heart of how the Osmos AI Data Engineer turns your configuration into powerful, reusable Python code. With one click, it produces a fully functional Spark-based notebook that is ready to run, schedule, version, and integrate into pipelines.

These notebooks do more than ingest and transform data—they represent long-living, production-grade workflows that evolve with your needs, while putting human reviewers in complete control.

What It Is

Generate Notebook triggers the AI Data Engineer to build a ready-to-run Python notebook based on your configuration instructions, source files, and destination schemas. The notebook is:

Execution-ready: Includes logic for ingestion, transformation, and validation
Reusable: Can be versioned, re-executed, and adapted for new data
Pipeline-ready: Built for integration into orchestration systems (e.g., Fabric, Airflow)
Autonomous but supervised: All actions are user-initiated, ensuring full control

Think of it as saying: “Hey engineer, write me a Python notebook for this job.” And the AI does it—intelligently, iteratively, and at scale.

What the AI Does Behind the Scenes

When you click Generate Notebook, the AI Data Engineer will:

Sample & Analyze Files It inspects your input data (CSV, JSON, XML, Parquet, etc.) to understand schemas, anomalies, and transformations.
Write the Code It generates Spark-based Python code that:
- Ingests your data
- Transforms it according to your instructions
- Includes built-in schema checks and validation logic
Write Its Tests The notebook includes test cases to catch data issues, logic gaps, or structural inconsistencies.
Handle Errors Automatically If tests fail, the AI:
- Resamples the data
- Revises the code
- Re-generates logic until a working solution is found
Add Bookkeeping Built-in logic tracks what data has been processed, avoiding duplicates or reprocessing in future runs.

🔄 Iteration & Control

You're always in the loop:

Preview the generated notebook
Edit the code or configuration as needed
Regenerate if something’s missing or incorrect
Schedule or manually trigger runs
Version in Git or any repo of your choice

Notebooks are written defensively. They handle schema shifts gracefully, raising clear errors for issues that need intervention.

🧩 Key Capabilities

Capability

Description

✅ Summary

The Generate Notebook feature lets you go from config to code—automatically. Whether you're managing a complex data lake or building a repeatable ingestion flow, Osmos AI writes high-quality notebooks for you.

-No boilerplate. -No hand coding. -Just reliable, production-ready notebooks you can trust.