What is Feature Engineering?

This section introduces the concept of feature engineering using an example of a product recommendation system.

For your AI models to work well — whether for search, recommendations, or anything else — you need good data. But raw data is usually messy and incomplete. Before you can train a model or use it for search, you have to turn that raw data into clear, meaningful features. Feature engineering is the process of cleaning up your data and creating new signals that actually help your model learn or make better predictions. This step is just as important for preparing training data as it is for powering your AI in production. It can take some work to get from raw data to useful features. Let’s look at a simple example to see why this matters and how it’s done.

The Challenge: Manual Feature Engineering

Imagine you are building a product recommendation system for a large e-commerce platform. The goal is to find items that are genuinely “similar” to a product a user is currently viewing. This notion of “similarity” must be sophisticated. It goes far beyond a simple text match on the product description. It needs to incorporate nuanced business concepts like popularity, value, brand equity, and key product attributes.

Step 1: The Raw Data Table

We start with a raw data table in our data lakehouse. We’ll call it products_raw. This table contains the basic, unprocessed information scraped from our product catalog. An embedding model could be applied directly to the description column for semantic search. However, that would only capture a fraction of the story. Table: products_raw

product_id	title	description	category	price	original_price	review_count	avg_rating
101	V-Neck Tee	”A soft, 100% cotton v-neck shirt.”	T-Shirt	25.00	25.00	1200	4.8
102	Designer Tee	”Limited edition organic cotton tee.”	T-Shirt	90.00	150.00	25	4.9
103	New Tee	”A new v-neck t-shirt.”	T-Shirt	22.00	22.00	1	5.0

Step 2: The Problem with Raw Data

Using this raw data to generate embeddings for a recommendation model is destined for failure. Critical business signals that define true product similarity are either misleading, hidden, or entirely absent.

Misleading Popularity: The “New Tee” boasts a perfect 5.0 rating, but with only a single review. Is it truly more “popular” or “better” than the “V-Neck Tee” with 1200 reviews and a 4.8 rating? A naive model would think so. This leads to poor recommendations.
Missing Price Context: The model sees a price of $90.00 for the “Designer Tee.” It has no intrinsic understanding that this represents a steep 40% discount. This is a powerful purchasing signal. The model also doesn’t know how this price compares to the average price of other T-shirts.
Hidden Attributes: Key attributes like “organic” or “limited edition” are buried within the free-text description. They are invisible to any model that doesn’t perform sophisticated text analysis. Yet, they are crucial for matching user preferences.

Step 3: Manual Feature Engineering

To address these challenges, a data scientist or ML engineer typically embarks on a manual, code-intensive journey. The goal is to extract hidden signals and craft new, meaningful features. This process often takes place in a Jupyter notebook. It involves several intricate steps.

Engineer popularity_score:
To create a more accurate measure of popularity, combine the average rating with the volume of reviews. A common approach is to use logarithmic scaling. This prevents products with massive review counts from overwhelming the score.
Python
```
popularity_score = log(review_count + 1) * avg_rating
```
Engineer price_tier:
To help the model understand value, bucket raw prices into clear tiers such as ‘budget’, ‘mid-range’, or ‘premium’. Logic:
Use conditional logic (for example, CASE WHEN in SQL or np.select in Python) to assign a tier based on price thresholds.
Engineer discount_pct:
To explicitly signal a “deal” to the model, calculate the discount percentage by combining the original and current prices. Logic:
Python
```
discount_pct = (original_price - price) / original_price
```
Engineer price_vs_cat_avg:
To contextualize a product’s price, compare it to the average price within its category. This requires an aggregation step to compute the average price per category. Then, calculate the feature for each product. Logic:
For each product:
Python
```
price_vs_cat_avg = price / avg_price_for_category
```
Engineer is_organic:
To surface key product attributes, process the description text to identify important keywords. Logic:
Use a regular expression or string search for the keyword “organic” to create a boolean (true/false) flag.

Each of these steps requires careful data manipulation and domain knowledge. Iterative experimentation is also needed to ensure the resulting features are both accurate and useful for downstream models.

Step 4: The Enriched Table

After executing this complex chain of logic, we produce a new, enriched table. The features in this table are more potent and ready to be fed into an embedding model. This produces high-quality vectors for our recommendation system. Table: products_engineered

product_id	…	popularity_score	price_tier	discount_pct	price_vs_cat_avg	is_organic
101	…	33.6	’budget’	0.0	0.54	false
102	…	15.9	’premium’	0.4	1.95	true
103	…	3.5	’budget’	0.0	0.47	false

Why Use Feature Engineering?

Manual feature engineering works for small datasets. However, things change dramatically at scale. In real-world production systems, you often need to process tens or hundreds of millions of records. Sometimes this must happen in real time. The logic that was simple in a notebook, such as aggregations, conditional logic, or text processing, becomes much harder to manage and execute efficiently. As data grows, so does complexity. You might need to generate features from new sources. These could include user behavior logs, images, or videos. This can involve running large-scale machine learning models for tasks like image captioning or text embedding. These tasks require significant compute resources and robust infrastructure. Managing this at scale means dealing with distributed systems, scheduling, and monitoring. You must ensure that feature pipelines are reliable and reproducible. Experimenting with new features or updating existing ones can require major engineering work. This slows down iteration and innovation. Infrastructure challenges, such as orchestrating batch and streaming jobs, handling dependencies, and scaling inference, often become the main bottleneck. In short, the biggest challenge in modern feature engineering is not just coming up with good features. It is also about building infrastructure that can handle complex, multimodal operations. The goal is to deliver fresh, high-quality features quickly and reliably at massive scale.

Get started

Guides

Feature Engineering (Geneva)

Support

What is Feature Engineering?

The Challenge: Manual Feature Engineering

Step 1: The Raw Data Table

Step 2: The Problem with Raw Data

Step 3: Manual Feature Engineering

Step 4: The Enriched Table

Why Use Feature Engineering?

Get started

Guides

Feature Engineering (Geneva)

Support

​The Challenge: Manual Feature Engineering

​Step 1: The Raw Data Table

​Step 2: The Problem with Raw Data

​Step 3: Manual Feature Engineering

​Step 4: The Enriched Table

​Why Use Feature Engineering?

The Challenge: Manual Feature Engineering

Step 1: The Raw Data Table

Step 2: The Problem with Raw Data

Step 3: Manual Feature Engineering

Step 4: The Enriched Table

Why Use Feature Engineering?