How I Built a Gesture Controller in Next.js

9 min read

August 25, 2025

Live

Source Code

From the beginning, I have barely used a canvas element in my projects. Either the project doesn't need one, or I would find some other alternatives to use instead of a canvas element, which sometimes comes from negotiating the design requirements. I remember I have extensively used the canvas element when I worked with the Three.js project. I imported a canvas element from the react-three-fiber packages to render a chaotic, manually adjusted model. I am specifically mentioning the model is chaotic because the amount of computation we have to adjust to just match our responsiveness requirements is ridiculous. You can check out the project here. I haven't manipulated the parameters of canvas significantly.

I thought I could experiment with the canvas element, but not for the same purpose I have used previously. That's how I started working on this project.

I had some spare time a couple of weeks back, as I've finished all my tasks even before the discussed timeline. I thought I could use the time to do something interesting and productive. I have decided to make the webpage interactive based on my hand signals. The idea of the project is to give an intuitive experience to the users while keeping their hands off from the trackpad, keyboard, and mouse. One doesn't have to use any of these peripherals to do a scrolling action while it could be done using your own hands. But honestly, I had a fun time developing this. I am sharing the process here if you want to do something like this.

I have done some design considerations here to focus on the functionality part. However, you can populate your page with as many aesthetic elements as you can.

Here, I used Next.js as my tech stack, so if you have created a page, you are free to follow the following steps.

<main className="w-screen min-h-screen flex flex-col items-center justify-center">
  <GestureController />
</main>

const GestureController: FC<GestureControllerProps> = (props) => {
  const videoRef = useRef<HTMLVideoElement>(null);
  const canvasRef = useRef<HTMLCanvasElement>(null);

  return (
    <section className="w-full h-full items-center justify-center  relative">
      <video
        ref={videoRef}
        className="w-full max-w-2xl rounded-lg "
        playsInline
        autoPlay
        muted
      />
      <canvas
        ref={canvasRef}
        className="absolute inset-0 max-w-2xl mx-auto pointer-events-none"
      />
    </section>
  );
};

Here, you can see the video and canvas with respective refs from which you can get the data to understand the signs. You can refer to the following independent functions to get a clear idea of how a video element can be integrated with a canvas element.

const initializeCamera = useCallback(async () => {
  try {
    setIsLoading(true);
    setError("");

    const stream = await navigator.mediaDevices.getUserMedia({
      video: {
        width: { ideal: 600 },
        height: { ideal: 480 },
        facingMode: "user",
      },
    });

    if (videoRef.current) {
      videoRef.current.srcObject = stream;
      await videoRef.current.play();
    }
    setIsLoading(false);
  } catch (err) {
    setError("Failed to access camera. Please grant camera permissions.");
    setIsLoading(false);
  }
}, []);

const drawVideoToCanvas = useCallback(async () => {
  const video = videoRef.current;
  const canvas = canvasRef.current;

  if (!video || !canvas || video.readyState !== 4) return;

  const ctx = canvas.getContext("2d");
  if (!ctx) return;

  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;

  ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
}, []);

I have used the CDN approach here instead of using the model locally. You can try both approaches if you want. But, in case you've decided to go with the CDN approach, please refer to this link to get a better understanding about the model and the MediaPipe library I have used here. I'll walk you through step by step.

You have to install the following library to access the MediaPipe properties.

pnpm add @mediapipe/tasks-vision

useEffect(() => {
  (async () => {
    // 1. open camera (works on https or localhost)
    const stream = await navigator.mediaDevices.getUserMedia({
      video: { facingMode: "user", width: 1200, height: 600 },
    });
    if (!videoRef.current) return;
    videoRef.current.srcObject = stream;
    await videoRef.current.play();

    // 2. load wasm & model, create recognizer in VIDEO mode
    const vision = await FilesetResolver.forVisionTasks(
      "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm"
    );

    recognizer = await GestureRecognizer.createFromOptions(vision, {
      baseOptions: {
        modelAssetPath:
          "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
      },
      runningMode: "VIDEO",
      numHands: 2,
      cannedGesturesClassifierOptions: {
        // example: only keep a few gestures
        categoryAllowlist: [
          "Open_Palm",
          "Closed_Fist",
          "Thumb_Up",
          "Thumb_Down",
        ],
      },
    });

    raf = requestAnimationFrame(frame);
  })();
}, []);

Apart from the model and library, I have to talk about the importance of using the requestAnimationFrame function here. It is crucial when you want to execute the smoother animations while keeping the main thread intact. So, it is better to cover those parameters by using the requestAnimationFrame() function. As we reset the setInterval while component mounting and unmounting, the requestAnimationFrame() can also be reset.

useEffect(() => {
  let recognizer: GestureRecognizer | null = null;
  let raf = 0;
  let lastVideoTime = -1;

  // landmark indices per MediaPipe
  const indexTipId = 8;
  const thumbTipId = 4;

  // simple smoothing for pointer
  let smoothX = 0.5,
    smoothY = 0.5;

  const drawLandmarks = function (res: GestureRecognizerResult) {
    const ctx = canvasRef.current?.getContext("2d");
    const video = videoRef.current;

    if (!ctx || !video) return;

    canvasRef.current!.width = video.videoWidth;
    canvasRef.current!.height = video.videoHeight;

    ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);
    ctx.lineWidth = 2;

    (res.landmarks ?? res.handednesses ?? []).forEach((points) => {
      // draw points
      ctx.beginPath();
      points.forEach((p) => {
        ctx.moveTo(p.x * ctx.canvas.width, p.y * ctx.canvas.height);
        ctx.arc(
          p.x * ctx.canvas.width,
          p.y * ctx.canvas.height,
          3,
          0,
          Math.PI * 2
        );
      });
      ctx.stroke();
    });
  };

  const distance = (
    a: { x: number; y: number },
    b: { x: number; y: number }
  ) => {
    const dx = a.x - b.x,
      dy = a.y - b.y;
    return Math.hypot(dx, dy);
  };

  const actOnResults = function (res: GestureRecognizerResult) {
    const gestureForHands = res.gestures ?? [];
    if (gestureForHands.length > 0 && gestureForHands[0].length > 0) {
      const top = gestureForHands[0][0];

      switch (top.categoryName) {
        case "Thumb_Up":
          props.onThumbUp?.();
          break;
        case "Thumb_Down":
          props.onThumbDown?.();
          break;
        case "Open_Palm":
          props.onOpenPalm?.();
          break;
        case "Closed_Fist":
          props.onClosedFist?.();
          break;
        case "Pointing_Up":
          props.onPointingUp?.();
          break;
        case "Victory":
          props.onVictory?.();
          break;
        case "ILoveYou":
          props.onILoveYou?.();
          break;
      }
    }

    const hands = res.landmarks ?? res.handedness ?? [];

    if (hands.length > 0) {
      const pts = hands[0];
      const idx = pts[indexTipId];
      const thumb = pts[thumbTipId];

      // Pointer position (normalized 0..1) with light smoothing
      const targetX = idx.x,
        targetY = idx.y;
      const alpha = 0.35;
      smoothX = smoothX + alpha * (targetX - smoothX);
      smoothY = smoothY + alpha * (targetY - smoothY);
      props.onPointerMove?.(smoothX, smoothY);

      // Normalize pinch by hand size (wrist(0) to middle_mcp(9))
      const wrist = pts[0],
        middleMcp = pts[9];
      const handScale = Math.max(0.001, distance(wrist, middleMcp));
      const pinch = distance(idx, thumb) / handScale;
      if (pinch < 0.35) props.onPinchClick?.();
    }
  };

  const frame = () => {
    const video = videoRef.current!;
    if (!recognizer || !video) return;
    const now = performance.now();

    if (video.currentTime !== lastVideoTime) {
      const result = recognizer.recognizeForVideo(video, now);
      drawLandmarks(result);
      actOnResults(result);

      lastVideoTime = video.currentTime;
    }
    raf = requestAnimationFrame(frame);
  };
}, []);

I have initiated the recognizer variable inside the same useEffect hook. The input data of the video element will be shared with the recognizer variable inside the frame function. Based on the result it provides, the actOnResult function confirms the signal. The categories are already trained by the model extensively. If you want to add a new hand signal, you can download the entire model and train it. The incredible part of using the model is it can also be used to track our actions through the sequence of frames, yet this blog only covers the frame-based confirmation.

The canvas element is used to draw the points on my hand based on the response landmarks. To make this happen, it starts from matching your canvas width and height according to the video proportion and creating a 2D context. I have learned the importance of the 2D context when I was developing a whiteboard. You can only manipulate the stroke properties when you create the 2D context.

(Math.hypot() is used to get the square root of the sum of squares of its arguments.)

You can understand what I have done in the root file based on the predictions in the following file.

"use client";
import { FC, useEffect, useRef, useState } from "react";
import GestureController from "@/app/_components/projects/gesture-controller";

interface pageProps {}

const page: FC<pageProps> = ({}) => {
  const scrollInterval = useRef<NodeJS.Timeout | null>(null);
  const [gesture, setGesture] = useState<"up" | "down" | null>(null);

  // Start continuous scroll
  const startScrolling = (direction: "up" | "down") => {
    if (scrollInterval.current) return;

    scrollInterval.current = setInterval(() => {
      window.scrollBy({
        top: direction === "up" ? -40 : 40,
        behavior: "smooth",
      });
    }, 30);
  };

  // Stop continuous scroll
  const stopScrolling = () => {
    if (scrollInterval.current) {
      clearInterval(scrollInterval.current);
      scrollInterval.current = null;
    }
  };

  // Whenever gesture changes
  useEffect(() => {
    if (gesture === "up") {
      startScrolling("up");
    } else if (gesture === "down") {
      startScrolling("down");
    } else {
      stopScrolling();
    }

    return () => stopScrolling();
  }, [gesture]);
  return (
    <main className="w-screen min-h-screen flex flex-col items-center justify-center">
      <GestureController
        onThumbUp={() => setGesture("down")}
        onThumbDown={() => setGesture("up")}
        onClosedFist={() => setGesture(null)} //
      />
      <section className="w-screen h-screen bg-amber-700 flex gap-2  flex-col items-center justify-center">
        <div className="border shadow-lg backdrop-blur-2xl flex gap-2  flex-col items-center justify-center p-8 rounded-md">
          <h1 className="text-5xl text-white font-normal">
            Thumbs Up - Scroll Down
          </h1>
          <h1 className="text-5xl text-white font-normal">
            Thumbs Down - Scroll Up
          </h1>
          <h1 className="text-5xl text-white font-normal">
            Close Fist - Pause
          </h1>
        </div>
      </section>
      <section className="w-screen h-screen bg-slate-600" />
      <section className="w-screen h-screen bg-blue-700" />
    </main>
  );
};

export default page;