Reward Shaping for Robot Manipulation RL: What Works and What Doesn't

Dense vs. Sparse Rewards: The Core Tradeoff

Sparse rewards — task completion only, no intermediate signal — are theoretically clean and cannot be hacked. But sparse rewards fail catastrophically for manipulation tasks exceeding 10 sequential steps because of credit assignment: when the policy sees a reward only at episode completion, it has no signal to explain which of the 200 preceding actions contributed to success. Sample efficiency drops exponentially with task horizon under sparse rewards.

Dense rewards — a per-step progress signal based on some proxy for task advancement — dramatically improve sample efficiency. A policy learning to place an object with a dense distance-to-goal reward converges in 10–50× fewer environment steps than the sparse equivalent. The cost is reward hacking: the agent finds ways to maximize your proxy without solving your actual task.

To put numbers on this: in our Isaac Lab benchmarks for a pick-and-place task on a simulated OpenArm, sparse reward required approximately 200M environment steps (around 8 hours on an RTX 4090 with 4,096 parallel envs) to reach 70% success. Dense reward with distance shaping reached the same success rate in 12M steps (~30 minutes). But the dense-reward agent also discovered three separate reward hacks in the process, which required iterative reward redesign. The practical skill in manipulation RL is building dense rewards that capture the task without opening loopholes.

Rule of thumb: Use sparse rewards only when your task horizon is under 50 steps and the goal configuration is unambiguous. For everything else — peg insertion, assembly, tool use, anything multi-phase — you need dense shaping, and you need to budget 2–5 reward redesign iterations before training begins.

Potential-Based Shaping: The Theoretically Safe Dense Reward

The Ng, Daswani, and Russell (1999) theorem on potential-based reward shaping provides a mathematically grounded approach to adding dense rewards without changing the optimal policy. The construction is: F(s, a, s') = γΦ(s') − Φ(s), where Φ is any real-valued potential function over states. Adding F to your base reward r produces a shaped reward r + F that shares the same set of optimal policies as r alone.

For manipulation, the most natural potential function is negative distance to goal: Φ(s) = −d(object, target_position). This produces a shaping term that gives positive reward when the agent moves the object toward the target and negative reward when it moves away — without ever rewarding the agent for just being near the object without having grasped it (which would be a non-potential-based dense reward and would change optimal behavior).

Implementing Potential-Based Shaping in Practice

The theorem is clean. Implementation is not. The main subtlety: potential-based shaping preserves optimality only for the undiscounted or properly discounted infinite-horizon case. In practice, you run finite episodes with early termination, which breaks the theoretical guarantee. The shaping term can now create incentives to prolong episodes (accumulating more positive shaping) rather than complete the task quickly.

The fix is to combine potential-based shaping with a per-step time penalty and a dominant terminal reward. The time penalty ensures episodes don't drag on, and the terminal reward ensures task completion always dominates any accumulated shaping signal. In our standard configuration:

def compute_reward(self, obs, action, next_obs, done, info):
    # Potential-based shaping (preserves optimal policy)
    phi_current = -torch.norm(obs["object_pos"] - obs["target_pos"], dim=-1)
    phi_next = -torch.norm(next_obs["object_pos"] - next_obs["target_pos"], dim=-1)
    shaping = self.gamma * phi_next - phi_current

    # Grasp bonus: activated only on first contact
    grasp_reward = 0.0
    if info["contact_force"] > self.contact_threshold and not info["has_grasped"]:
        grasp_reward = 0.5

    # Sparse task completion (dominates all dense signals)
    success = next_obs["object_at_target"].float()
    task_reward = 10.0 * success

    # Time penalty (discourages episode prolongation)
    time_penalty = -0.01

    # Action smoothness penalty
    jerk = torch.norm(action - 2 * obs["prev_action"] + obs["prev_prev_action"], dim=-1)
    smooth_penalty = -0.02 * jerk

    total = task_reward + 0.3 * shaping + grasp_reward + time_penalty + smooth_penalty
    return total

The key ratios: the task completion reward (10.0) should be at least 10× the maximum cumulative shaping over an episode. With a workspace diagonal of ~1.0m and episodes of 200 steps, the maximum cumulative shaping is approximately 0.3 × 1.0 = 0.3 per step, or 60 over an episode. So a task reward of 10.0 is actually too low — you'd want 100+ to ensure dominance. We use 10.0 here for clarity; in production, scale appropriately.

Multi-Phase Potential Functions

Many manipulation tasks have multiple sequential phases: approach, grasp, transport, place. A single potential function (distance to final target) provides no shaping signal for the approach and grasp phases. The solution is phase-aware potential functions that switch based on task progress:

Phase 1 — Approach: Φ(s) = −d(end_effector, object). Provides gradient toward the object.
Phase 2 — Grasp: Φ(s) = −d(end_effector, object) + grasp_quality(s). Rewards stable contact.
Phase 3 — Transport: Φ(s) = −d(object, target). Only active after grasp is detected.
Phase 4 — Place: Φ(s) = −d(object, target) − orientation_error(object, target_orientation). Active when object is within 5cm of target.

Phase transitions should be detected by physical signals (contact force thresholds, object height changes), not by time steps. Time-based phases create exploitation opportunities where the agent learns to trigger phase transitions prematurely.

Manipulation-Specific Reward Components

Component	Normalization	Weight	Activate When	Purpose
EE-to-Object Distance	Workspace diagonal (0.8–1.2m)	0.1–0.3×	Before first contact	Encourage approach
Object-to-Target Distance	Workspace diagonal	0.4–0.6×	After grasp detected	Primary transport signal
Grasp Quality	[0, 1] continuous	0.1–0.2×	During contact	Prevent dragging
Orientation Alignment	Quaternion error [0, π]	0.1–0.2×	Last 20% of transport	Correct placement angle
Action Smoothness (jerk)	Max expected jerk	−0.05–0.1×	Always	Hardware-safe motions
Joint Limit Penalty	Binary per joint	−0.1× per violation	Always	Stay within workspace
Collision Penalty	Binary	−1.0 terminal	Self-collision or table	Physical safety
Time Penalty	Constant per step	−0.01×	Always	Discourage stalling

End-Effector to Object Distance: Normalize by workspace diagonal (typically 0.8–1.2m) to keep the component in [0, 1]. Weight: 0.1–0.3× total reward scale. Purpose: encourage approaching the object. Disable after first contact to avoid the hover-near-object hack. In practice, we use a decaying activation weight: full weight when no contact has occurred, linearly decaying to zero over 10 steps after first contact, then permanently off.

Object to Target Distance: Activate only after grasp is detected (contact force > threshold). Normalize by workspace. Weight: 0.4–0.6×. This is the primary dense signal for the transport phase. For tasks with tight placement tolerances (<2mm), add an exponential bonus in the final 5cm: bonus = exp(-100 * d^2) where d is in meters. This creates a strong gradient for the precision placement phase without affecting the broader transport behavior.

Grasp Quality: A binary or continuous measure of contact stability — e.g., whether contact normals form a positive force closure, or GQ-CNN score if you have it. Weight: 0.1–0.2×. Without this, the agent learns to drag objects rather than grasp them. In simulation, you can compute antipodal grasp quality from the contact points and normals provided by the physics engine. In MuJoCo: check that at least 2 contact points exist on opposite sides of the object with normal angles differing by >150 degrees.

Action Smoothness: Penalize jerk (third derivative of joint position) to discourage jerky motions that damage hardware. Compute as ||q̈̇_{t} − q̈̇_{t-1}||. Weight: 0.05–0.1× as a negative penalty. Critical for sim-to-real transfer since jerky sim policies fail on real hardware due to unmodeled joint elasticity. On OpenArm specifically, we found that policies trained without jerk penalties caused the wrist joint servo to overheat within 15 minutes of continuous operation.

Reward Hacking: A Field Guide

The following hacks appear repeatedly across manipulation reward designs. They are listed as warnings, not discoveries — your agent will find them independently.

Hack Name	Mechanism	Frequency	Fix
Hover Exploitation	EE hovers near object, accumulating proximity reward without grasping	Very common	Disable EE-proximity after first contact; add time penalty for non-contact phases
Table Push for Termination	Pushes object off table to trigger early episode end (less accumulated negative reward)	Common	Large negative terminal reward (−10×) for out-of-bounds; verify termination conditions
Success Oscillation	Rapidly oscillates object through target zone to trigger success on rolling window check	Common	Require success maintained for 10+ consecutive steps
Gravity Exploitation	Drops object near target from height, scoring placement reward during free-fall	Moderate	Check that EE is within 5cm of object at success time; require low object velocity (<0.05 m/s)
Contact Cycling	Repeatedly touches and releases object to accumulate first-contact bonus	Moderate	Make grasp bonus one-time per episode using a boolean flag
Joint Limit Surfing	Jams joints against limits to achieve workspace positions unreachable with smooth motion	Occasional	Add continuous joint-limit proximity penalty, not just at-limit penalty
Sim Physics Exploitation	Exploits penetration or tunneling bugs to teleport objects	Rare but catastrophic	Validate object positions between steps; flag teleports >5cm as invalid

Anti-Hacking Design Principles

Terminal Reward Dominance: The task completion terminal reward should be at least 10× the maximum possible cumulative dense reward over an episode. This ensures the policy always prefers actually completing the task over exploiting dense reward accumulation. Calculate the maximum: (max shaping per step) × (max episode length) × (shaping weight). Your terminal reward must exceed this by an order of magnitude.
Component Normalization: Keep all dense reward components in [-1, 1] and set explicit weights. Un-normalized reward components at different scales interact unpredictably and make the reward function very sensitive to implementation details. Log each component's running statistics during training — if any component's variance is >10× another's, you have a normalization problem.
Adversarial Testing: Before full training runs, test your reward function with a scripted adversarial policy that deliberately tries to maximize reward without completing the task. If the adversary finds a high-reward path that isn't task completion, redesign before wasting GPU time. At SVRC we maintain a library of ~20 adversarial scripts for common manipulation tasks. Running them takes 5 minutes and saves days.
Reward Component Logging: Log every reward component individually during training, not just the total. When reward curves look healthy but success rate stalls, the component breakdown reveals which term the agent is exploiting. Use TensorBoard or Weights & Biases to track each component as a separate scalar.

Curriculum Reward Progression

For complex multi-phase tasks — assemble a gear, insert a peg with clearance <1mm, sort mixed objects — a single static reward function rarely works. The approach that consistently succeeds is curriculum reward progression: start with a simplified reward that makes the task learnable, then gradually increase difficulty and precision requirements.

Stage 1: Coarse Reaching (Steps 0–5M)

Reward only end-effector proximity to the object. No grasp quality requirement, no placement precision. The goal is to learn basic arm control and workspace navigation. Success threshold: EE within 5cm of object.

Stage 2: Grasp Acquisition (Steps 5M–20M)

Add grasp quality reward. Tighten the proximity threshold. The agent now needs to actually close fingers around the object and establish stable contact. Success threshold: object lifted 10cm above table surface.

Stage 3: Transport (Steps 20M–50M)

Activate object-to-target distance reward. The agent must grasp, lift, and move the object toward the target. Success threshold: object within 5cm of target.

Stage 4: Precision Placement (Steps 50M–100M)

Tighten all thresholds. Add orientation alignment reward. Enable the exponential precision bonus. Success threshold: object within 2mm of target with orientation error <5 degrees. This is where you need the most patience and the most careful reward tuning.

class CurriculumRewardManager:
    """Manages reward curriculum progression based on training progress."""

    def __init__(self, env_cfg):
        self.stages = [
            {"name": "reach",     "threshold": 0.70, "min_steps": 2_000_000},
            {"name": "grasp",     "threshold": 0.60, "min_steps": 5_000_000},
            {"name": "transport", "threshold": 0.50, "min_steps": 10_000_000},
            {"name": "precise",   "threshold": None, "min_steps": 0},
        ]
        self.current_stage = 0
        self.total_steps = 0

    def maybe_advance(self, success_rate, steps_in_stage):
        stage = self.stages[self.current_stage]
        if (stage["threshold"] is not None
            and success_rate >= stage["threshold"]
            and steps_in_stage >= stage["min_steps"]
            and self.current_stage < len(self.stages) - 1):
            self.current_stage += 1
            print(f"Advancing to stage: {self.stages[self.current_stage]['name']}")

    def get_reward_weights(self):
        """Returns reward component weights for current curriculum stage."""
        if self.current_stage == 0:   # reach
            return {"ee_dist": 1.0, "grasp": 0.0, "obj_dist": 0.0, "orient": 0.0}
        elif self.current_stage == 1: # grasp
            return {"ee_dist": 0.5, "grasp": 0.5, "obj_dist": 0.0, "orient": 0.0}
        elif self.current_stage == 2: # transport
            return {"ee_dist": 0.2, "grasp": 0.3, "obj_dist": 0.5, "orient": 0.0}
        else:                          # precise
            return {"ee_dist": 0.1, "grasp": 0.2, "obj_dist": 0.4, "orient": 0.3}

The stage advancement thresholds (70%, 60%, 50%) are deliberately set below 100% — you want the policy to move on once it has learned the essential behavior at each stage, not once it has overfit to the easy version. If you wait for 95% success before advancing, the policy develops habits optimized for the easy reward that are hard to unlearn when precision requirements increase.

Sim-to-Real Reward Gaps

A reward function that works perfectly in simulation may produce completely different behavior on real hardware, even with domain randomization. The reward gap has three main sources:

1. Contact Dynamics Mismatch

Sim contact models (MuJoCo's soft-contact, Isaac Sim's GPU-accelerated rigid body) differ from reality in friction coefficients, contact stiffness, and deformation behavior. A reward that depends on fine-grained contact states (grasp quality, sliding friction) will produce different gradients in sim vs. real. Mitigation: randomize friction coefficients (0.3–1.2× nominal), contact stiffness (±50%), and object mass (±30%) during training. SVRC's RL environment includes these randomization ranges as defaults.

2. Observation Delay

Real robot observations arrive with variable latency (1–15ms for joint encoders, 30–80ms for camera frames). Reward components computed from observations that are stale by different amounts produce inconsistent signals. In sim, all observations are synchronized. Mitigation: add random observation delays (0–2 control steps) to simulated sensors during training. This is more important than most practitioners realize — we've seen policies that achieve 90% success in sim drop to 30% on real hardware purely from observation delay.

3. Actuator Model Mismatch

Simulated motors respond to commands instantly and exactly. Real servos have bandwidth limits (typically 5–15 Hz for position-controlled servos), backlash (0.1–0.5 degrees on typical harmonic drives), and torque ripple. Rewards that depend on precise end-effector positioning are affected because the real robot cannot achieve the positions the sim policy expects. Mitigation: model actuator dynamics as a first-order low-pass filter with frequency randomized between 5–20 Hz, and add uniform backlash noise of ±0.3 degrees per joint. On OpenArm specifically, the wrist joint has the most backlash (~0.4 degrees) and benefits most from this randomization.

Practical test: Before deploying a sim-trained policy, evaluate it in simulation with all randomizations set to their extremes simultaneously (worst-case friction, maximum delay, lowest actuator bandwidth). If success rate stays above 50%, the policy is likely robust enough for real deployment. If it drops below 30%, add more randomization during training.

Reward Functions for Specific Tasks

The following are field-tested reward specifications for common manipulation tasks, tuned through multiple design iterations at SVRC:

Pick and Place (Simple)

Two-phase: approach + transport. EE-distance shaping (weight 0.3) until contact, then object-distance shaping (weight 0.5) after grasp. Terminal: +100 for object within 3cm of target for 10 steps. Time penalty: −0.01/step. Total training: ~15M steps in Isaac Lab (4096 envs), ~45 minutes on RTX 4090.

Peg Insertion (Tight Tolerance <1mm)

Four-phase curriculum required. Approach, grasp, coarse alignment (peg axis within 10 degrees of hole axis), fine insertion. The critical insight: use force-based reward in the final insertion phase rather than position-based. The reason is that sub-millimeter position accuracy is unreliable from vision, but the force profile during successful insertion is distinctive and consistent. Reward the characteristic force signature (increasing Z-force with bounded XY-forces) rather than position. Total training: ~100M steps, ~5 hours.

Object Reorientation (In-Hand)

Single-phase but requires dexterous finger control. Reward: orientation error between current and target orientation, measured as geodesic distance on SO(3). Critical addition: penalize object linear velocity (the object should rotate in place, not fly across the workspace). Also penalize finger-object slip events detected by sudden drops in contact force. This task is notoriously hard — expect 200M+ steps and success rates of 40–60% even with careful reward design. Works best with the Allegro hand in sim; on real hardware, the Shadow hand's tactile feedback helps but introduces sensor noise.

Bimanual Assembly

Requires separate reward streams for each arm, plus a coordination bonus. Each arm gets its own approach/grasp/transport phases. The coordination bonus rewards both arms being in the correct relative configuration for the assembly step. This is where reward design becomes genuinely difficult: too much weight on coordination causes both arms to freeze in the correct relative pose without actually completing the task; too little weight and the arms collide. We use 0.15× weight on coordination, with the coordination potential defined as negative distance between the two grasped parts' mating surfaces.

Common Reward Design Mistakes

Reward on observation, not state: Computing reward from camera observations (which have noise) rather than privileged sim state (which is exact). During training in sim, always compute reward from ground-truth state. The policy observes noisy sensors, but the reward should be clean.
Forgetting episode boundaries: Potential-based shaping needs to handle episode resets correctly. If you don't zero out the potential at episode start, the first step of each episode gets a spurious large shaping signal from the difference between terminal state potential and initial state potential.
Reward delay mismatch: If your reward depends on contact detection and your contact detection has a 2-step delay, the reward signal arrives 2 steps late. The policy learns to act 2 steps ahead of when it "should," causing oscillation. Ensure reward computation and observation timing are aligned.
Not testing at boundaries: Your reward might work for objects at the center of the workspace but break for objects near edges (where distance normalization behaves differently) or near the robot's joint limits (where reachability constraints interact with the reward gradient). Test with objects at all four corners of the workspace.
Single-seed evaluation: RL is notoriously seed-sensitive. A reward that produces 90% success on one seed may produce 20% on another. Always evaluate across at least 3 seeds before concluding your reward works. At SVRC, we use 5 seeds as standard and report mean ± std.

Tools and Frameworks

The following frameworks provide reward function infrastructure for manipulation RL:

Isaac Lab (NVIDIA): GPU-parallelized envs with built-in reward function classes. Best for scale (4,096–16,384 parallel envs). The RewardManager class supports weighted multi-component rewards with per-component logging out of the box. Recommended for production training.
MuJoCo + Gymnasium: More accurate contact physics, fewer parallel envs (typically 1–64 with CPU). Better for reward prototyping where you need to inspect individual episodes in detail. The mujoco.viewer is invaluable for diagnosing reward hacks — watching the agent's behavior at 0.1× speed reveals what the reward curves hide.
robosuite (Stanford): Pre-built manipulation tasks with standard reward functions. Good starting point for reward design — study their implementations before writing your own. The lift, stack, and nut assembly rewards are well-tested.
SVRC RL Environment: Pre-configured Isaac Lab environments with OpenArm and other SVRC hardware models, validated reward functions, and domain randomization presets. See documentation.