The Scanner — Walking Your Filesystem Into the DB
RecursiveDirectoryIterator to walk your media folders. INSERT IGNORE skips files already in the DB. Detect audio vs video by file extension. Use mime_content_type for the MIME. We do not extract metadata yet — that's a separate, optional step.Why CLI, not a web page?
You might be wondering why we're putting the scanner in bin/ as a command-line script instead of making it a web page like the rest of the app. Reasonable question. Two big reasons.
First, web requests have a time limit — by default PHP kills any script that runs longer than 30 seconds. Scanning a media library can take minutes if you have a lot of files. You'd hit the limit and the scan would just die halfway through. CLI scripts have no such limit by default, so they can run for as long as they need to. We could raise the web time limit, but it's the wrong tool for the job.
Second, scanning is something you do occasionally — when you add new files, or when you first set up the server. It's not part of the visitor experience. Putting it behind a URL is a security risk (someone could trigger it constantly and DoS your disk), but also it's just conceptually backwards. Maintenance scripts belong on the command line.
The pattern you're learning here — "build the engine as a CLI tool, build the visitor-facing layer as web pages" — is exactly how every real CMS, every WordPress, every Drupal handles this. cron jobs run the maintenance, web requests serve the visitors. Two different lifecycles.
The plan for the scanner
Here's the algorithm in plain English. For each top-level media folder you care about: recursively walk every file underneath. For each file: check its extension. If it's audio (mp3, flac, wav, ogg, m4a) or video (mp4, mkv, webm, avi, mov), insert a row into the database with the full path, the detected type, and a placeholder title derived from the filename. Skip everything else (text files, sidecar .srt subtitles, hidden files, you know the drill). At the end, print a summary so you know it worked.
That's it. No fancy stuff yet — no metadata extraction, no thumbnails, no album art. Just "build me a list of what exists." Keep it simple, ship it, then improve.
RecursiveDirectoryIterator, the PHP way to walk a folder tree
PHP has a built-in for "walk every file under this folder" that's slightly clunky-looking but super useful once you've used it twice. It's called RecursiveDirectoryIterator, paired with RecursiveIteratorIterator (yes, that's two words, both required, terrible naming, what can you do).
Quick mental model: RecursiveDirectoryIterator is a thing that knows how to walk into subfolders. RecursiveIteratorIterator wraps it and gives you back a flat iterator over every file at every depth. You loop over it with a regular foreach and you get every file in the tree.
🐍 Python: This is basically PHP's version of os.walk() or pathlib.Path.rglob('*'). Same job, just clunkier syntax because PHP didn't get to design its filesystem API from scratch.
Detecting audio vs video by extension
The honest way to know if a file is audio or video is to look at its content — read the first few bytes and see what format it actually is. The pragmatic way is to look at the file extension. For a personal home server with files you put there yourself, the pragmatic way is more than fine. If you've named a music file "movie.mp4" then yeah, it'll get miscategorized, but also: who's doing that? Let's keep it simple.
$audio_exts = ['mp3', 'flac', 'wav', 'ogg', 'm4a', 'opus', 'aac'];
$video_exts = ['mp4', 'mkv', 'webm', 'avi', 'mov', 'm4v'];
$ext = strtolower(pathinfo($filename, PATHINFO_EXTENSION));
if (in_array($ext, $audio_exts, true)) $type = 'audio';
elseif (in_array($ext, $video_exts, true)) $type = 'video';
else continue; // skip everything else
The strtolower handles the fact that some files are named "Song.MP3" with caps. pathinfo(..., PATHINFO_EXTENSION) is PHP's built-in for getting just the extension without the dot. Strict in_array with the third true argument forces strict type comparison, because the alternative is "PHP arrays compare loosely and sometimes do silly things." Always pass true to in_array.
Building the title from the filename
For now, the "title" of a track is just its filename without the extension. So "01 One More Time.mp3" becomes "01 One More Time". Not glamorous, but it gets us a working library immediately. In chapter 7 we'll extract real ID3 tag titles for music and prettier names for video — but we don't need to wait for that to start streaming things. Ship the rough thing now, polish later.
$title = pathinfo($filename, PATHINFO_FILENAME);
That's the whole "title generation" code for now. Good enough.
INSERT IGNORE — the dedup trick
Here's a small but lovely SQL pattern. When you run the scanner a second time, most of your files will already be in the database. We don't want errors. We don't want duplicates. We don't want to write a separate "does this row already exist?" check. MariaDB has us covered:
INSERT IGNORE INTO media (path, type, title, size_b, mime)
VALUES (:path, :type, :title, :size, :mime);
The IGNORE means: "if this insert would violate a unique constraint, silently skip it instead of erroring." Since we have UNIQUE KEY uq_path (path) on the path column, INSERT IGNORE silently skips files we've already seen. Re-running the scanner is now idempotent — same result whether you run it once or twenty times.
The trade-off: INSERT IGNORE swallows OTHER errors too (like "you tried to insert NULL into a NOT NULL column"). Slight risk. For our case, we're inserting known-good data, so the convenience wins.
Detecting the MIME type
The browser needs to know whether a file is an MP4 video or an MP3 audio so it can render the right player and decoder. We could guess from the extension again, but PHP has a better way:
$mime = mime_content_type($filepath); // "audio/mpeg", "video/mp4", etc.
This function reads the actual file bytes (the magic numbers at the start of the file) and returns the real MIME type. More reliable than trusting the extension. Once we've got it, we store it in the database so we don't have to redetect on every stream request.
Build: The Scanner
Okay, time to write the actual scanner. This is going to feel satisfying — you'll point it at your music folder, run it, and watch the database fill up. Real "I made the thing" energy.
One small thing before we start: this CLI script runs as the user you're logged in as (probably erictey), not as www-data. That means it can read anywhere your user can read, but the web app won't be able to read those same files unless www-data also has access. We'll come back to permissions in the streaming chapter. For now, scan files that live under your own home folder and we'll be fine.
- Create
/home/erictey/server/homestream/bin/scan.php. - Paste this in:
#!/usr/bin/env php <?php declare(strict_types=1); require __DIR__ . '/../lib/db.php'; // Folders to scan. Add more if your media lives in multiple places. $roots = [ '/home/erictey/media/music', '/home/erictey/media/movies', ]; $audio_exts = ['mp3','flac','wav','ogg','m4a','opus','aac']; $video_exts = ['mp4','mkv','webm','avi','mov','m4v']; $insert = db()->prepare(" INSERT IGNORE INTO media (path, type, title, size_b, mime) VALUES (:path, :type, :title, :size, :mime) "); $scanned = 0; $added = 0; foreach ($roots as $root) { if (!is_dir($root)) { echo " (skipping missing folder: $root)\n"; continue; } echo "Scanning $root\n"; $it = new RecursiveIteratorIterator( new RecursiveDirectoryIterator($root, RecursiveDirectoryIterator::SKIP_DOTS) ); foreach ($it as $fileinfo) { if (!$fileinfo->isFile()) continue; $path = $fileinfo->getPathname(); $name = $fileinfo->getFilename(); $ext = strtolower(pathinfo($name, PATHINFO_EXTENSION)); $type = null; if (in_array($ext, $audio_exts, true)) $type = 'audio'; elseif (in_array($ext, $video_exts, true)) $type = 'video'; if ($type === null) continue; $scanned++; $title = pathinfo($name, PATHINFO_FILENAME); $size = $fileinfo->getSize(); $mime = mime_content_type($path) ?: ($type === 'audio' ? 'audio/mpeg' : 'video/mp4'); $insert->execute([ 'path' => $path, 'type' => $type, 'title' => $title, 'size' => $size, 'mime' => $mime, ]); // rowCount() = 1 if inserted, 0 if skipped by IGNORE if ($insert->rowCount() === 1) { $added++; echo " + $title\n"; } } } echo "\nDone. Saw $scanned media files. Added $added new ones.\n"; - Make it executable:
chmod +x /home/erictey/server/homestream/bin/scan.php. The shebang at the top (#!/usr/bin/env php) means you can run it directly without typing "php" first. - Drop at least one or two audio files into
/home/erictey/media/music/. If you don't have any handy, grab any free Internet Archive track. You just need SOMETHING to scan. - Run it:
/home/erictey/server/homestream/bin/scan.php(orphp /home/erictey/server/homestream/bin/scan.phpif the executable bit didn't take). - You should see output like "Scanning /home/erictey/media/music" then a line per file added, then a final summary count.
Stretch goals:
- Run the scanner a second time. Notice the added count is 0 — INSERT IGNORE is doing its job. Now drop a new file in and re-run. Only the new file shows up. That's the idempotency we wanted.
- Verify in MariaDB:
sudo mariadb -e "SELECT id, type, title, size_b FROM homestream.media LIMIT 10;". You should see real rows with real data. - Add a
--dry-runflag that prints what it would insert without actually inserting. Useful for testing changes safely. - Time the scan with
time bin/scan.php. Note how long it takes per thousand files. If it ever gets slow, the bottleneck is almost certainly the mime_content_type call (it reads each file) — you could replace it with extension-based detection for a speedup.
What you flexed: CLI PHP scripts with shebangs, RecursiveDirectoryIterator for filesystem walking, pathinfo() for splitting filenames, prepared statements with named placeholders, INSERT IGNORE for idempotent imports, rowCount() to detect whether an insert actually happened. That's a real production-shape script you wrote.