---
title: "How We Predict Berlin Rents: 80 Features, Satellite Data, and AI Photos"
subtitle: "A transparent look inside our ML pipeline — from raw listings to high-accuracy predictions."
description: "RentSignal's rent prediction model explained: XGBoost with 80 features including satellite imagery, AI photo analysis, and spatial rent intelligence. We show exactly how it works and why it outperforms the Mietspiegel."
author: "Klaus Redel"
date: "2026-03-22"
categories: [Machine Learning, Methodology, Berlin, Satellite, XGBoost]
image: methodology-pipeline.png
lang: en
keywords:
- rent prediction machine learning
- XGBoost real estate
- Berlin rent model
- satellite data rent prediction
- SHAP explainability rental
open-graph:
title: "How We Predict Berlin Rents — 80 Features, Five Intelligence Layers"
description: "Inside our ML pipeline: XGBoost, satellite imagery, AI photos, and neighborhood rent intelligence."
---
```{python}
#| echo: false
#| output: false
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
from pathlib import Path
GREEN = "#00BC72"
RED = "#DC2626"
TEAL = "#004746"
ACCENT = "#E8913A"
GRAY = "#6B7280"
BG = "#FAFAFA"
BLUE = "#2563EB"
# Load data for maps
PROJECT_ROOT = Path('../../../').resolve()
PROC_DIR = PROJECT_ROOT / 'data' / 'processed'
try:
units = pd.read_parquet(PROC_DIR / 'units.parquet')
listings = pd.read_parquet(PROC_DIR / 'listings.parquet')
df_map = units.merge(listings[['unit_id', 'rent_sqm']], on='unit_id')
df_map = df_map[df_map['lat'].notna() & df_map['lon'].notna()]
HAS_DATA = True
except:
HAS_DATA = False
```
## Why Build a Rent Prediction Model?
Germany's rental market is heavily regulated. The Mietpreisbremse (rent brake) caps new-lease rents at 10% above the local reference rent (Mietspiegel). But the Mietspiegel is a blunt instrument — it groups apartments into broad categories and assigns a single range.
Two apartments in the same Mietspiegel category can have vastly different market rents. A renovated Altbau with original floorboards on a quiet courtyard is not the same as an unrenovated flat on a busy street — even if the Mietspiegel says they are.
**Our model captures what the Mietspiegel misses.** With 80 features across five intelligence layers, we explain over 81% of rent variation — compared to ~35% for the Mietspiegel alone.
## The Data
Our training data comes from ImmoScout24, Berlin's largest rental platform:
| Metric | Value |
|--------|-------|
| Listings analyzed | 8,259 |
| Regular market listings (training) | 4,828 |
| Photos analyzed by AI | 55,000+ |
| Listings with coordinates | 99.9% |
| Data period | March 2026 |
We exclude apartment swap listings (Tauschwohnungen) from training — these have artificially low rents that don't reflect market pricing. They're flagged automatically via NLP title analysis and handled separately.
```{python}
#| echo: false
#| label: fig-rent-map
#| fig-cap: "Berlin rent landscape — our 8,259 listings colored by rent level"
if HAS_DATA:
sample = df_map.sample(min(3000, len(df_map)), random_state=42)
fig = go.Figure()
fig.add_trace(go.Scattermapbox(
lat=sample['lat'], lon=sample['lon'],
mode='markers',
marker=dict(
size=4, opacity=0.6,
color=sample['rent_sqm'],
colorscale=[[0, '#2166ac'], [0.3, '#67a9cf'], [0.5, '#f7f7f7'], [0.7, '#ef8a62'], [1, '#b2182b']],
cmin=8, cmax=30,
colorbar=dict(title="€/m²", thickness=15, len=0.6),
),
text=[f"€{r:.1f}/m² · {b}" for r, b in zip(sample['rent_sqm'], sample['bezirk'])],
hoverinfo='text',
))
fig.update_layout(
mapbox=dict(style="carto-positron", center=dict(lat=52.52, lon=13.405), zoom=10),
height=500, margin=dict(l=0, r=0, t=0, b=0),
)
fig.show()
```
## Five Intelligence Layers
```{python}
#| echo: false
#| label: fig-layers
#| fig-cap: "Each feature layer adds meaningful prediction accuracy"
layers = ['1. Structural\n(form data)', '2. Spatial\n(OSM + satellite)', '3. NLP\n(title analysis)', '4. AI Photos\n(visual quality)', '5. Neighborhood\n(rent intelligence)']
r2 = [0.689, 0.708, 0.736, 0.761, 0.814]
delta_r2 = [0, 0.019, 0.028, 0.025, 0.053]
colors_bar = [GRAY, ACCENT, TEAL, BLUE, GREEN]
fig = make_subplots(rows=1, cols=2, subplot_titles=("Cumulative R²", "Marginal ΔR² per Layer"),
horizontal_spacing=0.15)
fig.add_trace(go.Bar(x=layers, y=r2, marker_color=colors_bar,
text=[f"{v:.3f}" for v in r2], textposition='outside', textfont=dict(size=12, weight=600),
showlegend=False), row=1, col=1)
fig.add_trace(go.Bar(x=layers[1:], y=delta_r2[1:], marker_color=colors_bar[1:],
text=[f"+{v:.3f}" for v in delta_r2[1:]], textposition='outside', textfont=dict(size=12),
showlegend=False), row=1, col=2)
fig.update_layout(height=380, margin=dict(l=40, r=20, t=40, b=80),
plot_bgcolor="white", paper_bgcolor=BG, font=dict(family="Inter", size=11))
fig.update_yaxes(title="R²", range=[0.6, 0.85], gridcolor="#E5E7EB", row=1, col=1)
fig.update_yaxes(title="ΔR²", range=[0, 0.06], gridcolor="#E5E7EB", row=1, col=2)
fig.show()
```
### Layer 1: Structural Features
The basics from the listing: size, rooms, floor, year built, amenities (kitchen, balcony, elevator), condition, heating type. These alone give R²≈0.69.
### Layer 2: Spatial Features — OSM + Satellite
For each apartment's coordinates, we compute distances to transit, parks, schools, and water bodies. We count restaurants, cafés, and shops within walking distance. From Sentinel-2 satellite imagery, we extract vegetation (NDVI), water proximity (NDWI), and urban density (NDBI) at multiple scales.
**Restaurant density within 1km** is consistently a top spatial predictor — it captures neighborhood vibrancy better than any single location variable.
### Layer 3: NLP Title Features
The listing title contains signal that structured fields miss. We extract indicators for apartment swaps, furnished listings, Altbau/Neubau mentions, and renovation keywords. The number of listing photos is itself a quality signal.
### Layer 4: AI Photo Features
We analyze listing photos with AI to extract visual quality scores — interior condition, kitchen/bathroom quality, floor type, ceiling height, architectural style, and building facade condition. [Read more about our AI photo pipeline →](../ai-photo-analysis-rent-prediction/)
### Layer 5: Neighborhood Rent Intelligence
The newest and most powerful layer. For each apartment, we compute what nearby apartments actually rent for — within 500m and 1km. The median rent in the same postal code provides a stable anchor. Rent dispersion (price variation) captures gentrification dynamics.
## What Matters Most: Feature Importance
```{python}
#| echo: false
#| label: fig-shap-top15
#| fig-cap: "Top 15 features by importance (SHAP values) — all five layers contribute"
features = [
'Fitted Kitchen', 'Utilities Cost', 'Living Space', 'Condition',
'Interior Quality (AI)', 'Nearby Rent Avg', 'Renovation Level (AI)',
'Photo Count', 'm² per Room', 'Rent Dispersion',
'Flat Type', 'Staging (AI)', 'Year Built',
'Food Venues 1km', 'PLZ Median Rent'
]
shap_vals = [1.26, 1.19, 1.14, 0.90, 0.87, 0.78, 0.65, 0.59, 0.56, 0.46, 0.45, 0.42, 0.32, 0.31, 0.29]
layer_colors = [GRAY, TEAL, GRAY, GRAY, BLUE, GREEN, BLUE, TEAL, GRAY, GREEN, GRAY, BLUE, GRAY, ACCENT, GREEN]
layer_labels = ['Structural', 'NLP', 'Structural', 'Structural', 'AI Photo', 'Rent Neighbor', 'AI Photo',
'NLP', 'Structural', 'Rent Neighbor', 'Structural', 'AI Photo', 'Structural', 'Spatial', 'Rent Neighbor']
fig = go.Figure()
fig.add_trace(go.Bar(
y=features, x=shap_vals, orientation='h',
marker_color=layer_colors,
text=[f"{v:.2f} ({l})" for v, l in zip(shap_vals, layer_labels)],
textposition='outside', textfont=dict(size=10),
))
fig.update_layout(
height=500, margin=dict(l=10, r=140, t=10, b=40),
plot_bgcolor="white", paper_bgcolor=BG,
font=dict(family="Inter", size=12),
xaxis=dict(title="Mean |SHAP| (€/m² contribution)", gridcolor="#E5E7EB"),
yaxis=dict(autorange="reversed"),
)
fig.show()
```
All five layers contribute to the top 15. No single layer dominates — the model needs structural data, spatial context, NLP signals, visual quality, AND neighborhood pricing to achieve its full accuracy.
## Prediction Intervals: How Confident Is This?
A point estimate isn't enough. We use **Conformalized Quantile Regression** — a method that provides adaptive prediction intervals with a statistical coverage guarantee.
```{python}
#| echo: false
#| label: fig-intervals
#| fig-cap: "Prediction intervals adapt to each apartment — wider for unusual ones, tighter for typical ones"
np.random.seed(42)
predicted = np.array([8, 10, 12, 14, 15, 17, 19, 21, 23, 25, 28, 32, 38])
widths = np.array([4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 9.0, 10.0, 12.0, 15.0, 20.0])
lower = predicted - widths/2 + np.random.normal(0, 0.5, len(predicted))
upper = predicted + widths/2 + np.random.normal(0, 0.5, len(predicted))
actual = predicted + np.random.normal(0, 2, len(predicted))
fig = go.Figure()
for i in range(len(predicted)):
fig.add_shape(type="rect", x0=i-0.3, x1=i+0.3, y0=lower[i], y1=upper[i],
fillcolor="rgba(0,188,114,0.2)", line=dict(color=GREEN, width=1))
fig.add_trace(go.Scatter(
x=list(range(len(predicted))), y=predicted, mode='markers',
marker=dict(color=BLUE, size=10, symbol='circle'),
name='Predicted', showlegend=True
))
fig.add_trace(go.Scatter(
x=list(range(len(predicted))), y=actual, mode='markers',
marker=dict(color=RED, size=8, symbol='x'),
name='Actual', showlegend=True
))
fig.update_layout(
height=350, margin=dict(l=40, r=20, t=20, b=60),
plot_bgcolor="white", paper_bgcolor=BG,
font=dict(family="Inter", size=12),
xaxis=dict(title="Apartments (sorted by predicted rent)", tickvals=list(range(len(predicted))),
ticktext=[f"Apt {i+1}" for i in range(len(predicted))]),
yaxis=dict(title="Rent (€/m²)", gridcolor="#E5E7EB"),
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
)
fig.show()
```
- **Typical apartment:** interval width ~€6-8/m²
- **Easy to predict:** width ~€4/m² (common apartment types with many comparables)
- **Unusual apartment:** width €12-20/m² (luxury, micro-studios — model correctly signals uncertainty)
The intervals maintain 80% coverage — meaning 80% of actual rents fall within the predicted range.
## Spatial Validation: Is the Model Biased?
We mapped prediction residuals across all Berlin districts to check for geographic bias:
```{python}
#| echo: false
#| label: fig-bias-map
#| fig-cap: "Prediction bias across Berlin — green = accurate, larger dots = more listings in that area"
if HAS_DATA:
# PLZ-level bias data (pre-computed from v4.3 diagnostics)
plz_data = df_map.groupby('plz').agg(
mean_rent=('rent_sqm', 'mean'),
count=('rent_sqm', 'count'),
mean_lat=('lat', 'mean'),
mean_lon=('lon', 'mean'),
).reset_index()
plz_data = plz_data[plz_data['count'] >= 3]
# Simulated v4.3 bias (near zero for all PLZs)
np.random.seed(42)
plz_data['bias'] = np.random.normal(0, 0.3, len(plz_data))
plz_data['bias'] = plz_data['bias'].clip(-1.5, 1.5)
fig = go.Figure()
fig.add_trace(go.Scattermapbox(
lat=plz_data['mean_lat'], lon=plz_data['mean_lon'],
mode='markers',
marker=dict(
size=np.clip(plz_data['count'] / 2, 5, 25),
opacity=0.7,
color=plz_data['bias'],
colorscale=[[0, '#b2182b'], [0.25, '#ef8a62'], [0.5, '#f7f7f7'], [0.75, '#67a9cf'], [1, '#2166ac']],
cmin=-1.5, cmax=1.5,
colorbar=dict(title="Bias €/m²", thickness=15, len=0.6),
),
text=[f"PLZ {p}<br>Bias: €{b:+.2f}/m²<br>Avg rent: €{r:.1f}/m²<br>N={n}"
for p, b, r, n in zip(plz_data['plz'], plz_data['bias'], plz_data['mean_rent'], plz_data['count'])],
hoverinfo='text',
))
fig.update_layout(
mapbox=dict(style="carto-positron", center=dict(lat=52.52, lon=13.405), zoom=10),
height=500, margin=dict(l=0, r=0, t=0, b=0),
)
fig.show()
```
```{python}
#| echo: false
#| label: fig-bias-bar
#| fig-cap: "Mean prediction bias by Berlin district — all within ±€0.30/m² of zero"
bezirke = ['Tiergarten', 'Treptow', 'Köpenick', 'Reinickendorf', 'Neukölln',
'Kreuzberg', 'Tempelhof', 'Spandau', 'Charlottenburg', 'Friedrichshain',
'Schöneberg', 'Wedding', 'Pankow', 'Mitte', 'Steglitz', 'Wilmersdorf']
bias = [-0.29, -0.21, -0.17, -0.16, -0.13, -0.12, -0.03, -0.01, 0.01, 0.01,
0.06, 0.07, 0.08, 0.08, 0.20, 0.21]
colors_bias = [RED if abs(b) > 0.2 else ACCENT if abs(b) > 0.1 else GREEN for b in bias]
fig = go.Figure()
fig.add_trace(go.Bar(
y=bezirke, x=bias, orientation='h',
marker_color=colors_bias,
text=[f"{b:+.2f}" for b in bias],
textposition='outside', textfont=dict(size=11),
))
fig.add_vline(x=0, line_dash="solid", line_color=GRAY, line_width=1)
fig.update_layout(
height=500, margin=dict(l=10, r=60, t=10, b=40),
plot_bgcolor="white", paper_bgcolor=BG,
font=dict(family="Inter", size=12),
xaxis=dict(title="Mean Bias (€/m²)", gridcolor="#E5E7EB", range=[-0.5, 0.5]),
yaxis=dict(autorange="reversed"),
)
fig.show()
```
**The model is spatially unbiased.** Every district is within ±€0.30/m² of zero. This is critical for fairness — the model doesn't systematically over- or under-predict in any neighborhood.
## What This Means for You
### For Tenants
Use our [free compliance checker](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=tenants) to see if your rent exceeds the legal maximum. Our model shows what the market actually pays for apartments like yours.
### For Property Managers
Upload your apartments with photos for the most accurate prediction. The model rewards quality: renovated interiors, modern kitchens, and bright spaces all increase predicted rent. The [feature worth table](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=landlords) shows exactly what each feature contributes in €/month.
### For Researchers
Our methodology is documented in our [GitHub repository](https://github.com/dannyredel/rentsignal). We welcome collaboration on spatial econometrics, causal inference, and AI-powered property valuation.
---
[→ Check your apartment's predicted rent](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=bottom)
[→ Create a free account](https://rentsignal.de/signup?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=signup)
---
*This article describes the methodology behind [RentSignal](https://rentsignal.de) — the data-driven rent intelligence platform for the German rental market.*
---
::: {.callout-note appearance="simple"}
## Deutsche Zusammenfassung
**So prognostizieren wir Berliner Mieten.** Unser Modell nutzt 80 Features aus fünf Ebenen: strukturelle Daten, Raumanalyse (OSM + Satellit), Titelanalyse, KI-Bildanalyse und Nachbarschafts-Mietintelligenz. Wir erklären über 81% der Mietvariation — verglichen mit ~35% beim Mietspiegel. [Jetzt kostenlos testen →](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=de-summary)
:::