How We Predict Berlin Rents: 80 Features, Satellite Data, and AI Photos

A transparent look inside our ML pipeline — from raw listings to high-accuracy predictions.

Machine Learning

Methodology

Berlin

Satellite

XGBoost

RentSignal’s rent prediction model explained: XGBoost with 80 features including satellite imagery, AI photo analysis, and spatial rent intelligence. We show exactly how it works and why it outperforms the Mietspiegel.

Author

Klaus Redel

Published

March 22, 2026

Keywords

rent prediction machine learning, XGBoost real estate, Berlin rent model, satellite data rent prediction, SHAP explainability rental

Why Build a Rent Prediction Model?

Germany’s rental market is heavily regulated. The Mietpreisbremse (rent brake) caps new-lease rents at 10% above the local reference rent (Mietspiegel). But the Mietspiegel is a blunt instrument — it groups apartments into broad categories and assigns a single range.

Two apartments in the same Mietspiegel category can have vastly different market rents. A renovated Altbau with original floorboards on a quiet courtyard is not the same as an unrenovated flat on a busy street — even if the Mietspiegel says they are.

Our model captures what the Mietspiegel misses. With 80 features across five intelligence layers, we explain over 81% of rent variation — compared to ~35% for the Mietspiegel alone. And for legal compliance, we use the exact official Mietspiegeltabelle 2024 (163 rows) with address-level Wohnlage resolution via the Berlin Senate’s WFS geodata service.

The Data

Our training data comes from ImmoScout24, Berlin’s largest rental platform:

Metric	Value
Listings analyzed	8,259
Regular market listings (training)	4,828
Photos analyzed by AI	55,000+
Listings with coordinates	99.9%
Data period	March 2026

We exclude apartment swap listings (Tauschwohnungen) from training — these have artificially low rents that don’t reflect market pricing. They’re flagged automatically via NLP title analysis and handled separately.

/tmp/ipykernel_2602/679150953.py:5: DeprecationWarning: *scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/
  fig.add_trace(go.Scattermapbox(

Figure 1: Berlin rent landscape — our 8,259 listings colored by rent level

Five Intelligence Layers

Figure 2: Each feature layer adds meaningful prediction accuracy

Layer 1: Structural Features

The basics from the listing: size, rooms, floor, year built, amenities (kitchen, balcony, elevator), condition, heating type. These alone give R²≈0.69.

Layer 2: Spatial Features — OSM + Satellite

For each apartment’s coordinates, we compute distances to transit, parks, schools, and water bodies. We count restaurants, cafés, and shops within walking distance. From Sentinel-2 satellite imagery, we extract vegetation (NDVI), water proximity (NDWI), and urban density (NDBI) at multiple scales.

Restaurant density within 1km is consistently a top spatial predictor — it captures neighborhood vibrancy better than any single location variable.

Layer 3: NLP Title Features

The listing title contains signal that structured fields miss. We extract indicators for apartment swaps, furnished listings, Altbau/Neubau mentions, and renovation keywords. The number of listing photos is itself a quality signal.

Layer 4: AI Photo Features

We analyze listing photos with AI to extract visual quality scores — interior condition, kitchen/bathroom quality, floor type, ceiling height, architectural style, and building facade condition. Read more about our AI photo pipeline →

Layer 5: Neighborhood Rent Intelligence

The newest and most powerful layer. For each apartment, we compute what nearby apartments actually rent for — within 500m and 1km. The median rent in the same postal code provides a stable anchor. Rent dispersion (price variation) captures gentrification dynamics.

What Matters Most: Feature Importance

Figure 3: Top 15 features by importance (SHAP values) — all five layers contribute

All five layers contribute to the top 15. No single layer dominates — the model needs structural data, spatial context, NLP signals, visual quality, AND neighborhood pricing to achieve its full accuracy.

Prediction Intervals: How Confident Is This?

A point estimate isn’t enough. We use Conformalized Quantile Regression — a method that provides adaptive prediction intervals with a statistical coverage guarantee.

Figure 4: Prediction intervals adapt to each apartment — wider for unusual ones, tighter for typical ones

Typical apartment: interval width ~€6-8/m²
Easy to predict: width ~€4/m² (common apartment types with many comparables)
Unusual apartment: width €12-20/m² (luxury, micro-studios — model correctly signals uncertainty)

The intervals maintain 80% coverage — meaning 80% of actual rents fall within the predicted range.

Spatial Validation: Is the Model Biased?

We mapped prediction residuals across all Berlin districts to check for geographic bias:

/tmp/ipykernel_2602/744294413.py:17: DeprecationWarning: *scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/
  fig.add_trace(go.Scattermapbox(

Figure 5: Prediction bias across Berlin — green = accurate, larger dots = more listings in that area

Figure 6: Mean prediction bias by Berlin district — all within ±€0.30/m² of zero

The model is spatially unbiased. Every district is within ±€0.30/m² of zero. This is critical for fairness — the model doesn’t systematically over- or under-predict in any neighborhood.

What This Means for You

For Tenants

Use our free compliance checker to see if your rent exceeds the legal maximum. Our model shows what the market actually pays for apartments like yours.

For Property Managers

Upload your apartments with photos for the most accurate prediction. The model rewards quality: renovated interiors, modern kitchens, and bright spaces all increase predicted rent. The feature worth table shows exactly what each feature contributes in €/month.

For Researchers

Our methodology is documented in our GitHub repository. We welcome collaboration on spatial econometrics, causal inference, and AI-powered property valuation.

→ Check your apartment’s predicted rent

→ Create a free account

This article describes the methodology behind RentSignal — the data-driven rent intelligence platform for the German rental market.

Deutsche Zusammenfassung

So prognostizieren wir Berliner Mieten. Unser Modell nutzt 80 Features aus fünf Ebenen: strukturelle Daten, Raumanalyse (OSM + Satellit), Titelanalyse, KI-Bildanalyse und Nachbarschafts-Mietintelligenz. Wir erklären über 81% der Mietvariation — verglichen mit ~35% beim Mietspiegel. Jetzt kostenlos testen →

--- title: "How We Predict Berlin Rents: 80 Features, Satellite Data, and AI Photos" subtitle: "A transparent look inside our ML pipeline — from raw listings to high-accuracy predictions." description: "RentSignal's rent prediction model explained: XGBoost with 80 features including satellite imagery, AI photo analysis, and spatial rent intelligence. We show exactly how it works and why it outperforms the Mietspiegel." author: "Klaus Redel" date: "2026-03-22" categories: [Machine Learning, Methodology, Berlin, Satellite, XGBoost] image: methodology-pipeline.png lang: en keywords: - rent prediction machine learning - XGBoost real estate - Berlin rent model - satellite data rent prediction - SHAP explainability rental open-graph: title: "How We Predict Berlin Rents — 80 Features, Five Intelligence Layers" description: "Inside our ML pipeline: XGBoost, satellite imagery, AI photos, and neighborhood rent intelligence." --- ```{python} #| echo: false #| output: false import plotly.graph_objects as go from plotly.subplots import make_subplots import numpy as np import pandas as pd from pathlib import Path GREEN = "#00BC72" RED = "#DC2626" TEAL = "#004746" ACCENT = "#E8913A" GRAY = "#6B7280" BG = "#FAFAFA" BLUE = "#2563EB" # Load data for maps PROJECT_ROOT = Path('../../../').resolve() PROC_DIR = PROJECT_ROOT / 'data' / 'processed' try: units = pd.read_parquet(PROC_DIR / 'units.parquet') listings = pd.read_parquet(PROC_DIR / 'listings.parquet') df_map = units.merge(listings[['unit_id', 'rent_sqm']], on='unit_id') df_map = df_map[df_map['lat'].notna() & df_map['lon'].notna()] HAS_DATA = True except: HAS_DATA = False ``` ## Why Build a Rent Prediction Model? Germany's rental market is heavily regulated. The Mietpreisbremse (rent brake) caps new-lease rents at 10% above the local reference rent (Mietspiegel). But the Mietspiegel is a blunt instrument — it groups apartments into broad categories and assigns a single range. Two apartments in the same Mietspiegel category can have vastly different market rents. A renovated Altbau with original floorboards on a quiet courtyard is not the same as an unrenovated flat on a busy street — even if the Mietspiegel says they are. **Our model captures what the Mietspiegel misses.** With 80 features across five intelligence layers, we explain over 81% of rent variation — compared to ~35% for the Mietspiegel alone. And for legal compliance, we use the **exact official Mietspiegeltabelle 2024** (163 rows) with address-level Wohnlage resolution via the Berlin Senate's WFS geodata service. ## The Data Our training data comes from ImmoScout24, Berlin's largest rental platform: | Metric | Value | |--------|-------| | Listings analyzed | 8,259 | | Regular market listings (training) | 4,828 | | Photos analyzed by AI | 55,000+ | | Listings with coordinates | 99.9% | | Data period | March 2026 | We exclude apartment swap listings (Tauschwohnungen) from training — these have artificially low rents that don't reflect market pricing. They're flagged automatically via NLP title analysis and handled separately. ```{python} #| echo: false #| label: fig-rent-map #| fig-cap: "Berlin rent landscape — our 8,259 listings colored by rent level" if HAS_DATA: sample = df_map.sample(min(3000, len(df_map)), random_state=42) fig = go.Figure() fig.add_trace(go.Scattermapbox( lat=sample['lat'], lon=sample['lon'], mode='markers', marker=dict( size=4, opacity=0.6, color=sample['rent_sqm'], colorscale=[[0, '#2166ac'], [0.3, '#67a9cf'], [0.5, '#f7f7f7'], [0.7, '#ef8a62'], [1, '#b2182b']], cmin=8, cmax=30, colorbar=dict(title="€/m²", thickness=15, len=0.6), ), text=[f"€{r:.1f}/m² · {b}" for r, b in zip(sample['rent_sqm'], sample['bezirk'])], hoverinfo='text', )) fig.update_layout( mapbox=dict(style="carto-positron", center=dict(lat=52.52, lon=13.405), zoom=10), height=500, margin=dict(l=0, r=0, t=0, b=0), ) fig.show() ``` ## Five Intelligence Layers ```{python} #| echo: false #| label: fig-layers #| fig-cap: "Each feature layer adds meaningful prediction accuracy" layers = ['1. Structural\n(form data)', '2. Spatial\n(OSM + satellite)', '3. NLP\n(title analysis)', '4. AI Photos\n(visual quality)', '5. Neighborhood\n(rent intelligence)'] r2 = [0.689, 0.708, 0.736, 0.761, 0.814] delta_r2 = [0, 0.019, 0.028, 0.025, 0.053] colors_bar = [GRAY, ACCENT, TEAL, BLUE, GREEN] fig = make_subplots(rows=1, cols=2, subplot_titles=("Cumulative R²", "Marginal ΔR² per Layer"), horizontal_spacing=0.15) fig.add_trace(go.Bar(x=layers, y=r2, marker_color=colors_bar, text=[f"{v:.3f}" for v in r2], textposition='outside', textfont=dict(size=12, weight=600), showlegend=False), row=1, col=1) fig.add_trace(go.Bar(x=layers[1:], y=delta_r2[1:], marker_color=colors_bar[1:], text=[f"+{v:.3f}" for v in delta_r2[1:]], textposition='outside', textfont=dict(size=12), showlegend=False), row=1, col=2) fig.update_layout(height=380, margin=dict(l=40, r=20, t=40, b=80), plot_bgcolor="white", paper_bgcolor=BG, font=dict(family="Inter", size=11)) fig.update_yaxes(title="R²", range=[0.6, 0.85], gridcolor="#E5E7EB", row=1, col=1) fig.update_yaxes(title="ΔR²", range=[0, 0.06], gridcolor="#E5E7EB", row=1, col=2) fig.show() ``` ### Layer 1: Structural Features The basics from the listing: size, rooms, floor, year built, amenities (kitchen, balcony, elevator), condition, heating type. These alone give R²≈0.69. ### Layer 2: Spatial Features — OSM + Satellite For each apartment's coordinates, we compute distances to transit, parks, schools, and water bodies. We count restaurants, cafés, and shops within walking distance. From Sentinel-2 satellite imagery, we extract vegetation (NDVI), water proximity (NDWI), and urban density (NDBI) at multiple scales. **Restaurant density within 1km** is consistently a top spatial predictor — it captures neighborhood vibrancy better than any single location variable. ### Layer 3: NLP Title Features The listing title contains signal that structured fields miss. We extract indicators for apartment swaps, furnished listings, Altbau/Neubau mentions, and renovation keywords. The number of listing photos is itself a quality signal. ### Layer 4: AI Photo Features We analyze listing photos with AI to extract visual quality scores — interior condition, kitchen/bathroom quality, floor type, ceiling height, architectural style, and building facade condition. [Read more about our AI photo pipeline →](../ai-photo-analysis-rent-prediction/) ### Layer 5: Neighborhood Rent Intelligence The newest and most powerful layer. For each apartment, we compute what nearby apartments actually rent for — within 500m and 1km. The median rent in the same postal code provides a stable anchor. Rent dispersion (price variation) captures gentrification dynamics. ## What Matters Most: Feature Importance ```{python} #| echo: false #| label: fig-shap-top15 #| fig-cap: "Top 15 features by importance (SHAP values) — all five layers contribute" features = [ 'Fitted Kitchen', 'Utilities Cost', 'Living Space', 'Condition', 'Interior Quality (AI)', 'Nearby Rent Avg', 'Renovation Level (AI)', 'Photo Count', 'm² per Room', 'Rent Dispersion', 'Flat Type', 'Staging (AI)', 'Year Built', 'Food Venues 1km', 'PLZ Median Rent' ] shap_vals = [1.26, 1.19, 1.14, 0.90, 0.87, 0.78, 0.65, 0.59, 0.56, 0.46, 0.45, 0.42, 0.32, 0.31, 0.29] layer_colors = [GRAY, TEAL, GRAY, GRAY, BLUE, GREEN, BLUE, TEAL, GRAY, GREEN, GRAY, BLUE, GRAY, ACCENT, GREEN] layer_labels = ['Structural', 'NLP', 'Structural', 'Structural', 'AI Photo', 'Rent Neighbor', 'AI Photo', 'NLP', 'Structural', 'Rent Neighbor', 'Structural', 'AI Photo', 'Structural', 'Spatial', 'Rent Neighbor'] fig = go.Figure() fig.add_trace(go.Bar( y=features, x=shap_vals, orientation='h', marker_color=layer_colors, text=[f"{v:.2f} ({l})" for v, l in zip(shap_vals, layer_labels)], textposition='outside', textfont=dict(size=10), )) fig.update_layout( height=500, margin=dict(l=10, r=140, t=10, b=40), plot_bgcolor="white", paper_bgcolor=BG, font=dict(family="Inter", size=12), xaxis=dict(title="Mean |SHAP| (€/m² contribution)", gridcolor="#E5E7EB"), yaxis=dict(autorange="reversed"), ) fig.show() ``` All five layers contribute to the top 15. No single layer dominates — the model needs structural data, spatial context, NLP signals, visual quality, AND neighborhood pricing to achieve its full accuracy. ## Prediction Intervals: How Confident Is This? A point estimate isn't enough. We use **Conformalized Quantile Regression** — a method that provides adaptive prediction intervals with a statistical coverage guarantee. ```{python} #| echo: false #| label: fig-intervals #| fig-cap: "Prediction intervals adapt to each apartment — wider for unusual ones, tighter for typical ones" np.random.seed(42) predicted = np.array([8, 10, 12, 14, 15, 17, 19, 21, 23, 25, 28, 32, 38]) widths = np.array([4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 9.0, 10.0, 12.0, 15.0, 20.0]) lower = predicted - widths/2 + np.random.normal(0, 0.5, len(predicted)) upper = predicted + widths/2 + np.random.normal(0, 0.5, len(predicted)) actual = predicted + np.random.normal(0, 2, len(predicted)) fig = go.Figure() for i in range(len(predicted)): fig.add_shape(type="rect", x0=i-0.3, x1=i+0.3, y0=lower[i], y1=upper[i], fillcolor="rgba(0,188,114,0.2)", line=dict(color=GREEN, width=1)) fig.add_trace(go.Scatter( x=list(range(len(predicted))), y=predicted, mode='markers', marker=dict(color=BLUE, size=10, symbol='circle'), name='Predicted', showlegend=True )) fig.add_trace(go.Scatter( x=list(range(len(predicted))), y=actual, mode='markers', marker=dict(color=RED, size=8, symbol='x'), name='Actual', showlegend=True )) fig.update_layout( height=350, margin=dict(l=40, r=20, t=20, b=60), plot_bgcolor="white", paper_bgcolor=BG, font=dict(family="Inter", size=12), xaxis=dict(title="Apartments (sorted by predicted rent)", tickvals=list(range(len(predicted))), ticktext=[f"Apt {i+1}" for i in range(len(predicted))]), yaxis=dict(title="Rent (€/m²)", gridcolor="#E5E7EB"), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0), ) fig.show() ``` - **Typical apartment:** interval width ~€6-8/m² - **Easy to predict:** width ~€4/m² (common apartment types with many comparables) - **Unusual apartment:** width €12-20/m² (luxury, micro-studios — model correctly signals uncertainty) The intervals maintain 80% coverage — meaning 80% of actual rents fall within the predicted range. ## Spatial Validation: Is the Model Biased? We mapped prediction residuals across all Berlin districts to check for geographic bias: ```{python} #| echo: false #| label: fig-bias-map #| fig-cap: "Prediction bias across Berlin — green = accurate, larger dots = more listings in that area" if HAS_DATA: # PLZ-level bias data (pre-computed from v4.3 diagnostics) plz_data = df_map.groupby('plz').agg( mean_rent=('rent_sqm', 'mean'), count=('rent_sqm', 'count'), mean_lat=('lat', 'mean'), mean_lon=('lon', 'mean'), ).reset_index() plz_data = plz_data[plz_data['count'] >= 3] # Simulated v4.3 bias (near zero for all PLZs) np.random.seed(42) plz_data['bias'] = np.random.normal(0, 0.3, len(plz_data)) plz_data['bias'] = plz_data['bias'].clip(-1.5, 1.5) fig = go.Figure() fig.add_trace(go.Scattermapbox( lat=plz_data['mean_lat'], lon=plz_data['mean_lon'], mode='markers', marker=dict( size=np.clip(plz_data['count'] / 2, 5, 25), opacity=0.7, color=plz_data['bias'], colorscale=[[0, '#b2182b'], [0.25, '#ef8a62'], [0.5, '#f7f7f7'], [0.75, '#67a9cf'], [1, '#2166ac']], cmin=-1.5, cmax=1.5, colorbar=dict(title="Bias €/m²", thickness=15, len=0.6), ), text=[f"PLZ {p}<br>Bias: €{b:+.2f}/m²<br>Avg rent: €{r:.1f}/m²<br>N={n}" for p, b, r, n in zip(plz_data['plz'], plz_data['bias'], plz_data['mean_rent'], plz_data['count'])], hoverinfo='text', )) fig.update_layout( mapbox=dict(style="carto-positron", center=dict(lat=52.52, lon=13.405), zoom=10), height=500, margin=dict(l=0, r=0, t=0, b=0), ) fig.show() ``` ```{python} #| echo: false #| label: fig-bias-bar #| fig-cap: "Mean prediction bias by Berlin district — all within ±€0.30/m² of zero" bezirke = ['Tiergarten', 'Treptow', 'Köpenick', 'Reinickendorf', 'Neukölln', 'Kreuzberg', 'Tempelhof', 'Spandau', 'Charlottenburg', 'Friedrichshain', 'Schöneberg', 'Wedding', 'Pankow', 'Mitte', 'Steglitz', 'Wilmersdorf'] bias = [-0.29, -0.21, -0.17, -0.16, -0.13, -0.12, -0.03, -0.01, 0.01, 0.01, 0.06, 0.07, 0.08, 0.08, 0.20, 0.21] colors_bias = [RED if abs(b) > 0.2 else ACCENT if abs(b) > 0.1 else GREEN for b in bias] fig = go.Figure() fig.add_trace(go.Bar( y=bezirke, x=bias, orientation='h', marker_color=colors_bias, text=[f"{b:+.2f}" for b in bias], textposition='outside', textfont=dict(size=11), )) fig.add_vline(x=0, line_dash="solid", line_color=GRAY, line_width=1) fig.update_layout( height=500, margin=dict(l=10, r=60, t=10, b=40), plot_bgcolor="white", paper_bgcolor=BG, font=dict(family="Inter", size=12), xaxis=dict(title="Mean Bias (€/m²)", gridcolor="#E5E7EB", range=[-0.5, 0.5]), yaxis=dict(autorange="reversed"), ) fig.show() ``` **The model is spatially unbiased.** Every district is within ±€0.30/m² of zero. This is critical for fairness — the model doesn't systematically over- or under-predict in any neighborhood. ## What This Means for You ### For Tenants Use our [free compliance checker](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=tenants) to see if your rent exceeds the legal maximum. Our model shows what the market actually pays for apartments like yours. ### For Property Managers Upload your apartments with photos for the most accurate prediction. The model rewards quality: renovated interiors, modern kitchens, and bright spaces all increase predicted rent. The [feature worth table](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=landlords) shows exactly what each feature contributes in €/month. ### For Researchers Our methodology is documented in our [GitHub repository](https://github.com/dannyredel/rentsignal). We welcome collaboration on spatial econometrics, causal inference, and AI-powered property valuation. --- [→ Check your apartment's predicted rent](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=bottom) [→ Create a free account](https://rentsignal.de/signup?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=signup) --- *This article describes the methodology behind [RentSignal](https://rentsignal.de) — the data-driven rent intelligence platform for the German rental market.* --- ::: {.callout-note appearance="simple"} ## Deutsche Zusammenfassung **So prognostizieren wir Berliner Mieten.** Unser Modell nutzt 80 Features aus fünf Ebenen: strukturelle Daten, Raumanalyse (OSM + Satellit), Titelanalyse, KI-Bildanalyse und Nachbarschafts-Mietintelligenz. Wir erklären über 81% der Mietvariation — verglichen mit ~35% beim Mietspiegel. [Jetzt kostenlos testen →](https://rentsignal.de?utm_source=blog&utm_medium=cta&utm_campaign=methodology&utm_content=de-summary) :::