For your AI models to work well — whether for search, recommendations, or anything else — you need good data. But raw data is usually messy and incomplete. Before you can train a model or use it for search, you have to turn that raw data into clear, meaningful features.
Feature engineering is the process of cleaning up your data and creating new signals that actually help your model learn or make better predictions. This step is just as important for preparing training data as it is for powering your AI in production.
It can take some work to get from raw data to useful features. Let’s look at a simple example to see why this matters and how it’s done.
The Challenge: Manual Feature Engineering
Imagine you are building a product recommendation system for a large e-commerce platform. The goal is to find items that are genuinely “similar” to a product a user is currently viewing. This notion of “similarity” must be sophisticated. It goes far beyond a simple text match on the product description. It needs to incorporate nuanced business concepts like popularity, value, brand equity, and key product attributes.Step 1: The Raw Data Table
We start with a raw data table in our data lakehouse. We’ll call itproducts_raw. This table contains the basic, unprocessed information scraped from our product catalog. An embedding model could be applied directly to the description column for semantic search. However, that would only capture a fraction of the story.
Table: products_raw
| product_id | title | description | category | price | original_price | review_count | avg_rating |
|---|---|---|---|---|---|---|---|
| 101 | V-Neck Tee | ”A soft, 100% cotton v-neck shirt.” | T-Shirt | 25.00 | 25.00 | 1200 | 4.8 |
| 102 | Designer Tee | ”Limited edition organic cotton tee.” | T-Shirt | 90.00 | 150.00 | 25 | 4.9 |
| 103 | New Tee | ”A new v-neck t-shirt.” | T-Shirt | 22.00 | 22.00 | 1 | 5.0 |
Step 2: The Problem with Raw Data
Using this raw data to generate embeddings for a recommendation model is destined for failure. Critical business signals that define true product similarity are either misleading, hidden, or entirely absent.- Misleading Popularity: The “New Tee” boasts a perfect 5.0 rating, but with only a single review. Is it truly more “popular” or “better” than the “V-Neck Tee” with 1200 reviews and a 4.8 rating? A naive model would think so. This leads to poor recommendations.
- Missing Price Context: The model sees a price of $90.00 for the “Designer Tee.” It has no intrinsic understanding that this represents a steep 40% discount. This is a powerful purchasing signal. The model also doesn’t know how this price compares to the average price of other T-shirts.
- Hidden Attributes: Key attributes like “organic” or “limited edition” are buried within the free-text
description. They are invisible to any model that doesn’t perform sophisticated text analysis. Yet, they are crucial for matching user preferences.
Step 3: Manual Feature Engineering
To address these challenges, a data scientist or ML engineer typically embarks on a manual, code-intensive journey. The goal is to extract hidden signals and craft new, meaningful features. This process often takes place in a Jupyter notebook. It involves several intricate steps.-
Engineer
popularity_score:
To create a more accurate measure of popularity, combine the average rating with the volume of reviews. A common approach is to use logarithmic scaling. This prevents products with massive review counts from overwhelming the score.Python -
Engineer
price_tier:
To help the model understand value, bucket raw prices into clear tiers such as ‘budget’, ‘mid-range’, or ‘premium’. Logic:
Use conditional logic (for example,CASE WHENin SQL ornp.selectin Python) to assign a tier based on price thresholds. -
Engineer
discount_pct:
To explicitly signal a “deal” to the model, calculate the discount percentage by combining the original and current prices. Logic:Python -
Engineer
price_vs_cat_avg:
To contextualize a product’s price, compare it to the average price within its category. This requires an aggregation step to compute the average price per category. Then, calculate the feature for each product. Logic:
For each product:Python -
Engineer
is_organic:
To surface key product attributes, process thedescriptiontext to identify important keywords. Logic:
Use a regular expression or string search for the keyword “organic” to create a boolean (true/false) flag.
Step 4: The Enriched Table
After executing this complex chain of logic, we produce a new, enriched table. The features in this table are more potent and ready to be fed into an embedding model. This produces high-quality vectors for our recommendation system. Table:products_engineered
| product_id | … | popularity_score | price_tier | discount_pct | price_vs_cat_avg | is_organic |
|---|---|---|---|---|---|---|
| 101 | … | 33.6 | ’budget’ | 0.0 | 0.54 | false |
| 102 | … | 15.9 | ’premium’ | 0.4 | 1.95 | true |
| 103 | … | 3.5 | ’budget’ | 0.0 | 0.47 | false |