๐ŸŸข Open to opportunities

Mykolas Perevicius

Full-Stack Engineer |

I thrive at the intersection of backend rigor and frontend empathy, shipping production code that helps everyday people get more done with less friction.

Resume.doc

Microsoft Word 2003 โ€ข Double-click to open

class Engineer:
  def __init__(self):
    self.mode = "ship"
    self.focus = "impact"

Live GitHub Metrics

Experience

Oct 2025 - Present

Software Engineer

UserAuthGuard by Asan Digital
  • Leading test-driven development initiative for production Django SaaS serving 17,000+ students across multiple school districts
  • Achieved 90% test coverage (from 51%) across 840+ Python files with comprehensive pytest-django behavioral test suites
  • Architected multi-tenant Chromebook management system with Google Workspace, Stripe, and Dell Warranty API integrations
  • Implemented real-time features using Django Channels (WebSockets) with Redis/Celery for async task processing
Django REST pytest-django PostgreSQL Redis Celery Docker
Jun 2023 - Aug 2023

Software Engineer (Internship)

Bessemer Trust
  • Built securities analysis platform that reduced reconciliation time by 60% for 20+ wealth advisors
  • Optimized SQL Server queries and indexes, cutting report generation latency by 45%
  • Delivered AI Tech Talk to 100+ employees on practical ML applications
  • Developed and tested full-stack C#/.NET & React features for critical internal securities management platform
C# .NET Core React 18 SQL Server Redis
Dec 2022 - Present

Curriculum Developer & Instructor

The Coding Place
  • Created 117 Jupyter notebooks covering Python from basics to GPU programming
  • Achieved 90% student certification rate (PCEP/PCAP) across 120+ students
  • Designed and taught 30+ hours of comprehensive coursework covering Web Development, OOP, and Data Structures & Algorithms
  • Built automated testing pipeline with GitHub Actions for student code evaluation
Python Jupyter GitHub Actions JavaScript TypeScript
Dec 2021 - Present

Software Engineer

Project Innovate Newark
  • Developed and launched event-planning suite using React, Node.js, and PostgreSQL, reducing scheduling conflicts by 30% across 9+ programs
  • Implemented Docker for containerization, significantly reducing release cycles from days to hours
  • Established role-based access control and JWT authentication, achieving zero critical findings in penetration testing
  • Contributed to increasing data reporting efficiency by ~30% through modular UI elements and RESTful API integration
React Redux Node.js PostgreSQL Docker AWS
Sep 2021 - Sep 2022

Research Intern

Bergen Community College
  • Prototyped embedded systems incorporating GUIs and machine vision capabilities for robotics applications
  • Engineered CNN-powered recycling-bin prototype attaining 95% material-detection accuracy
  • Built Arduino-based EV GUI that boosted converted pickup truck range by approximately 20%
  • Systematically collected, cleaned, processed, and analyzed experimental data using Python libraries
Python Arduino CNN Computer Vision

Featured Projects

Internet Explorer Internet Explorer - Projects Portfolio
https://perevici.us/projects/all

๐Ÿจ Koala's Forge

The free, powerful alternative to Ninite for 2025. Cross-platform system installer supporting 100+ applications. Born from losing access to all three machines in one weekend.

100+ Apps Cross-platform
PowerShell Bash Python

๐Ÿ“Š GitHub Stats

โญ -- Stars
๐Ÿ”ฑ -- Forks
๐Ÿ‘๏ธ -- Watchers

โœจ Key Features

Automated package management across Windows, Mac, and Linux
Silent installations with customizable configurations
Dependency resolution and conflict detection
Rollback support for failed installations
Checking updates...

โšก Distributed AlexNet

Custom CUDA kernels and MPI orchestration for AlexNet. Achieved 4.6ร— speedup over baseline PyTorch on DGX-1 cluster.

4.6ร— Faster Multi-GPU
CUDA C++ MPI PyTorch

๐Ÿš€ Ultimate System Setup

AI Lab + Complete Dev Environment automation. Hardware detection, system optimizations, and 100+ app installations.

100+ Tools Auto-config
Shell Docker Linux

๐ŸŽ“ Education Playground

Interactive Python learning platform with 117 Jupyter notebooks covering basics to GPU programming. Live learning platform with 90% student pass rate.

117 Notebooks ๐Ÿš€ LIVE
Python Jupyter Education

๐ŸŽต Melody Matcher

Interactive music puzzle game built for GirlHacks 2024. Listen to song snippets, group them together, and find the right song!

Canvas Game ๐Ÿš€ LIVE
Paper.js Howler.js Game Dev

โ™ป๏ธ Smart Recycling Bin

CNN-powered recycling bin prototype with 95% material detection accuracy. Integrated Arduino-based sorting mechanism.

95% Accuracy Real-time
TensorFlow Arduino Computer Vision
Ready

๐Ÿจ Koala's Forge

The free, powerful alternative to Ninite for 2025. Cross-platform system installer supporting 100+ applications. Born from losing access to all three machines in one weekend.

100+ Apps Cross-platform
PowerShell Bash Python

โšก Distributed AlexNet

Custom CUDA kernels and MPI orchestration for AlexNet. Achieved 4.6ร— speedup over baseline PyTorch on DGX-1 cluster.

4.6ร— Faster Multi-GPU
CUDA C++ MPI PyTorch

๐Ÿš€ Ultimate System Setup

AI Lab + Complete Dev Environment automation. Hardware detection, system optimizations, and 100+ app installations.

100+ Tools Auto-config
Shell Docker Linux

๐ŸŽ“ Education Playground

Interactive Python learning platform with 117 Jupyter notebooks covering basics to GPU programming. Live learning platform with 90% student pass rate.

117 Notebooks ๐Ÿš€ LIVE
Python Jupyter Education

๐ŸŽต Melody Matcher

Interactive music puzzle game built for GirlHacks 2024. Listen to song snippets, group them together, and find the right song!

Canvas Game ๐Ÿš€ LIVE
Paper.js Howler.js Game Dev

โ™ป๏ธ Smart Recycling Bin

CNN-powered recycling bin prototype with 95% material detection accuracy. Integrated Arduino-based sorting mechanism.

95% Accuracy Real-time
TensorFlow Arduino Computer Vision

Technical Arsenal

Languages

Python
C/C++
Java
C#
JavaScript/TypeScript
Go
Rust

Frameworks

Django REST
Spring Boot
.NET Core
React 18
Node.js
Express

Infrastructure

Docker
Kubernetes
AWS (Lambda, ECS, RDS)
GitHub Actions
Jenkins
Linux

Specialized

CUDA Programming
MPI
PyTorch
TensorFlow
Machine Learning
Computer Vision

Education

New Jersey Institute of Technology

Sep 2021 - Dec 2025

B.S. Computer Science โ€ข Dean's List

Key Coursework: GPU Cluster Programming, Compiler Design, Machine Learning, Operating Systems, Advanced Data Structures & Algorithms, Database Systems Design, Programming Languages Concepts

Technical Writing

Achieving 4.6ร— Speedup: Custom CUDA Kernels for AlexNet

December 2024 GPU Computing CUDA

When PyTorch's default kernels aren't fast enough, you write your own. Here's how I optimized AlexNet training on an NVIDIA DGX-1 cluster using custom CUDA kernels and MPI orchestration, the architectural decisions that mattered most, and what I learned about memory coalescing the hard way.

โšก 4.6ร— faster than baseline PyTorch
๐Ÿ”ง Custom memory management patterns
๐Ÿ“Š Multi-GPU orchestration with MPI

Why I Built Yet Another System Installer (And Why You Might Need One Too)

November 2024 DevOps Automation

Losing access to three machines in one weekend taught me something: your development environment should be reproducible in under an hour. Koala's Forge is my answer to Ninite's $30/year subscription, a free, cross-platform installer supporting 100+ apps with silent installations, dependency resolution, and rollback support.

๐Ÿจ 100+ applications supported
๐Ÿ”„ Automatic dependency resolution
โช Rollback failed installations

The Hidden Cost of Convenience: When Abstractions Leak Performance

October 2024 Performance Systems

High-level frameworks are amazing, until they're not. A deep dive into when PyTorch's conveniences become bottlenecks, why dropping to CUDA gave me 4.6ร— speedup, and how to know when it's time to stop using abstractions and start writing assembly (or close to it). Plus: the mental model I use to decide when optimization is premature vs. necessary.

๐ŸŽฏ When to optimize (and when not to)
โš™๏ธ Understanding abstraction overhead
๐Ÿ” Profiling strategies that actually work

Breaking Things to Understand Them: A Weekend with Distributed Consensus

September 2024 Distributed Systems Learning

I spent a weekend intentionally breaking Raft consensus to understand how it works. What happens when you introduce network partitions? What if followers lie about their log indices? Can you make the cluster elect two leaders? Turns out, theoretical correctness and practical resilience are two very different things.

๐Ÿ”จ Breaking Raft in creative ways
๐Ÿงช Network partition experiments
๐Ÿ’ก What textbooks don't tell you

Teaching Python by Building Games: Why Education Playground Works

August 2024 Education Python

117 interactive Jupyter notebooks covering Python from basics to GPU programming. The secret? Every concept is taught through building something you can actually see and interact with. No "hello world" tutorials here, we're building games, visualizations, and tools from day one. Here's why learning by building beats learning by reading.

๐ŸŽฎ Learn by building games
๐Ÿ“š 117 hands-on lessons
๐Ÿš€ Basics to GPU programming

Achieving 4.6ร— Speedup: Custom CUDA Kernels for AlexNet

December 2024 GPU Computing CUDA MPI

The Problem: PyTorch Was Too Slow

Training AlexNet on ImageNet should be fast, we're running on an NVIDIA DGX-1 with 8ร— V100 GPUs. But PyTorch's default kernels weren't cutting it. Profiling showed we were spending 40% of our time in convolution operations, and another 30% just moving data between GPUs.

That's when I decided: if the framework won't give me the performance I need, I'll write my own kernels.

Memory Coalescing: The 10ร— Difference

The first major optimization came from fixing memory access patterns. GPU memory is fast, but only if you access it correctly. Here's the problem with naive implementations:

// Simulating bad vs good memory access patterns // Run this to see the performance difference! console.log("=== Memory Access Pattern Demo ===\n"); // BAD: Non-coalesced access (strided) function badMemoryAccess(size) { const arr = new Float32Array(size); const start = performance.now(); for (let i = 0; i < size; i += 8) { // Stride of 8 arr[i] = i * 2; } return performance.now() - start; } // GOOD: Coalesced access (sequential) function goodMemoryAccess(size) { const arr = new Float32Array(size); const start = performance.now(); for (let i = 0; i < size; i++) { // Sequential arr[i] = i * 2; } return performance.now() - start; } const size = 10000000; const badTime = badMemoryAccess(size); const goodTime = goodMemoryAccess(size); console.log(`Bad (strided): ${badTime.toFixed(2)}ms`); console.log(`Good (sequential): ${goodTime.toFixed(2)}ms`); console.log(`Speedup: ${(badTime / goodTime).toFixed(2)}x\n`); console.log("On a GPU, this difference is even more extreme!"); console.log("Coalesced memory = 10x faster in real CUDA code.");

Custom Convolution Kernel

The core of the speedup came from a custom convolution kernel that:

  • Uses shared memory to reduce global memory accesses by 80%
  • Coalesces memory reads for 10ร— bandwidth improvement
  • Optimizes thread block size for V100's SM architecture
  • Fuses operations to eliminate intermediate buffers

Here's a simplified version of the core convolution logic:

// Simplified 2D convolution (concept demo in JS) // Real CUDA version is in C++ with __shared__ memory function conv2D(input, kernel, inputSize, kernelSize) { const outputSize = inputSize - kernelSize + 1; const output = new Float32Array(outputSize * outputSize); console.log(`Input: ${inputSize}x${inputSize}`); console.log(`Kernel: ${kernelSize}x${kernelSize}`); console.log(`Output: ${outputSize}x${outputSize}\n`); // Convolution operation for (let oy = 0; oy < outputSize; oy++) { for (let ox = 0; ox < outputSize; ox++) { let sum = 0; // Apply kernel for (let ky = 0; ky < kernelSize; ky++) { for (let kx = 0; kx < kernelSize; kx++) { const ix = ox + kx; const iy = oy + ky; sum += input[iy * inputSize + ix] * kernel[ky * kernelSize + kx]; } } output[oy * outputSize + ox] = sum; } } return output; } // Example: 5x5 input with 3x3 kernel const input = new Float32Array([ 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 4, 5, 6, 7, 8, 5, 6, 7, 8, 9 ]); const kernel = new Float32Array([ 1, 0, -1, 1, 0, -1, 1, 0, -1 ]); const result = conv2D(input, kernel, 5, 3); console.log("Convolution result:"); for (let y = 0; y < 3; y++) { const row = []; for (let x = 0; x < 3; x++) { row.push(result[y * 3 + x].toFixed(1)); } console.log(row.join(" ")); } console.log("\nIn CUDA: This runs in parallel across 1000s of threads!");

Multi-GPU with MPI

Once a single GPU was optimized, scaling to 8 GPUs required careful orchestration with MPI (Message Passing Interface). The key challenges:

"The fastest code in the world is useless if you spend all your time waiting for data to transfer between GPUs."

My solution used:

  • Asynchronous communication - overlap compute with transfers
  • Ring-allreduce for gradient synchronization
  • NCCL for fast GPU-to-GPU communication
  • Pipeline parallelism to keep all GPUs busy

Results: 4.6ร— Faster

Final benchmarks on ImageNet training:

  • Baseline PyTorch: 3.2 hours/epoch
  • Custom CUDA: 0.7 hours/epoch
  • Speedup: 4.6ร—

The biggest lesson? Frameworks are great until they're not. When you need maximum performance, understanding what's happening at the hardware level and being willing to drop down to custom kernels makes all the difference.

Try It Yourself

Want to explore the full implementation? Check out the repository:

View Full Project on GitHub โ†’

Why I Built Yet Another System Installer (And Why You Might Need One Too)

November 2024 DevOps Automation PowerShell

The Weekend Everything Broke

Friday evening: Laptop dies. Saturday morning: Desktop BSOD. Sunday afternoon: My GPU workstation refused to boot. Three machines, one weekend, zero backups of my development environment.

I spent 18 hours reinstalling tools. VSCode. Python. Node. Docker. Postgres. Git. CUDA toolkit. The list went on. By hour 12, I was searching for "Ninite alternatives 2024" and discovering that the free version only supports 15 apps, while Ninite Pro costs $30/year. For something I might use twice a year.

That's when I decided: I'm building my own.

What Is Koala's Forge?

Koala's Forge is a free, open-source system installer that supports 100+ applications across Windows, macOS, and Linux. Think Ninite, but cross-platform, completely free, and designed for developers who need GPU drivers, compilers, and dev tools that "mainstream" installers ignore.

// Simplified version of the core installation logic // Real implementation is in PowerShell/Bash with robust error handling class PackageInstaller { constructor() { this.packages = new Map(); this.dependencies = new Map(); } // Register a package with its metadata addPackage(name, config) { this.packages.set(name, { name, url: config.url, installer: config.installer, depends: config.depends || [], postInstall: config.postInstall || null }); // Build dependency graph (config.depends || []).forEach(dep => { if (!this.dependencies.has(dep)) { this.dependencies.set(dep, []); } this.dependencies.get(dep).push(name); }); } // Topological sort for dependency resolution resolveDependencies(packageName) { const visited = new Set(); const stack = []; const visit = (pkg) => { if (visited.has(pkg)) return; visited.add(pkg); const packageInfo = this.packages.get(pkg); if (!packageInfo) { throw new Error(`Package ${pkg} not found`); } // Visit dependencies first packageInfo.depends.forEach(dep => visit(dep)); stack.push(pkg); }; visit(packageName); return stack; } // Install packages in correct order async installPackages(packageNames) { const allPackages = new Set(); // Collect all packages and dependencies packageNames.forEach(name => { const deps = this.resolveDependencies(name); deps.forEach(pkg => allPackages.add(pkg)); }); console.log(`Installing ${allPackages.size} packages (including dependencies)...\n`); // Install in dependency order for (const pkgName of allPackages) { const pkg = this.packages.get(pkgName); console.log(`[${Array.from(allPackages).indexOf(pkgName) + 1}/${allPackages.size}] Installing ${pkg.name}...`); // Simulate installation await this.simulateInstall(pkg); if (pkg.postInstall) { console.log(` Running post-install script for ${pkg.name}`); } } console.log(`\nโœ“ All packages installed successfully!`); } async simulateInstall(pkg) { // In real implementation: download, verify checksum, run installer return new Promise(resolve => setTimeout(resolve, 100)); } } // Example usage const installer = new PackageInstaller(); // Register packages with dependencies installer.addPackage('python', { url: 'https://python.org/downloads/latest', installer: 'python-installer.exe', postInstall: () => console.log('Adding Python to PATH') }); installer.addPackage('pip', { url: 'https://bootstrap.pypa.io/get-pip.py', depends: ['python'] }); installer.addPackage('jupyter', { url: 'pypi://jupyter', depends: ['python', 'pip'], postInstall: () => console.log('Configuring Jupyter kernel') }); installer.addPackage('vscode', { url: 'https://code.visualstudio.com/download', installer: 'vscode-installer.exe' }); // Install Jupyter (automatically resolves Python + pip) installer.installPackages(['jupyter', 'vscode']);

Key Features

What makes Koala's Forge different from existing solutions?

1. Dependency Resolution

Want to install Jupyter? The installer knows you need Python and pip first. It builds a dependency graph and installs everything in the correct order. No more "ERROR: Python not found" halfway through your setup.

2. Silent Installations

Every package supports silent installation flags. Start the script, go make coffee, come back to a fully configured system. No clicking "Next" 47 times.

3. Rollback Support

If an installation fails (corrupted download, disk full, dependency conflict), Koala's Forge can roll back to the previous state. Checkpoints are created before each package installation.

// Rollback mechanism (simplified) class InstallationManager { constructor() { this.checkpoints = []; this.installedPackages = []; } createCheckpoint() { const checkpoint = { timestamp: new Date(), packages: [...this.installedPackages], state: 'saved' }; this.checkpoints.push(checkpoint); console.log(`Checkpoint created: ${checkpoint.packages.length} packages`); return checkpoint; } async installWithRollback(packageName) { // Create checkpoint before installation const checkpoint = this.createCheckpoint(); try { console.log(`Installing ${packageName}...`); // Simulate installation that might fail const success = Math.random() > 0.3; // 70% success rate for demo if (!success) { throw new Error('Installation failed: Download corrupted'); } this.installedPackages.push(packageName); console.log(`โœ“ ${packageName} installed successfully`); } catch (error) { console.error(`โœ— Installation failed: ${error.message}`); console.log('Rolling back to previous checkpoint...'); // Rollback to checkpoint this.rollback(checkpoint); } } rollback(checkpoint) { const packagesToRemove = this.installedPackages.filter( pkg => !checkpoint.packages.includes(pkg) ); packagesToRemove.forEach(pkg => { console.log(` Removing ${pkg}...`); }); this.installedPackages = [...checkpoint.packages]; console.log(`โœ“ Rolled back to checkpoint (${checkpoint.packages.length} packages)`); } } // Demo: Try installing packages with potential failures const manager = new InstallationManager(); async function demo() { await manager.installWithRollback('Git'); await manager.installWithRollback('Docker'); await manager.installWithRollback('Node.js'); await manager.installWithRollback('VSCode'); console.log(`\nFinal installed packages: ${manager.installedPackages.join(', ')}`); } demo();

Real-World Impact

Since releasing Koala's Forge:

  • Setup time: 18 hours โ†’ 45 minutes for a full dev environment
  • 100+ packages supported across Windows, macOS, Linux
  • Zero manual downloads โ€” everything is scripted
  • Reproducible environments โ€” same setup on every machine
"Your development environment should be code, not a 47-step manual process you hope you remember correctly."

Try It Yourself

Koala's Forge is completely free and open source. If you're tired of reinstalling your dev environment from scratch every time you get a new machine, give it a try:

View Project on GitHub โ†’

The Hidden Cost of Convenience: When Abstractions Leak Performance

October 2024 Performance Systems Optimization

The Performance Paradox

High-level frameworks are amazing. PyTorch lets you prototype a neural network in 10 lines of Python. Django abstracts away SQL injection nightmares. React handles DOM updates for you.

But here's the catch: every abstraction has a cost. Sometimes that cost is invisible. Sometimes it's 4.6ร— slower than it needs to be.

"Premature optimization is the root of all evil." โ€” Donald Knuth

Everyone quotes Knuth. Few people finish the quote: "Yet we should not pass up our opportunities in that critical 3%." The trick is knowing when you're in that 3%.

When Abstractions Break Down

Let me tell you about the time PyTorch cost me 18 hours of compute time per epoch.

I was training AlexNet on ImageNet using PyTorch's built-in convolution layers. Standard stuff. The code was clean, the API was beautiful, and the performance was... terrible.

Profiling showed 40% of time in convolution operations. Not unexpected. But when I dug deeper, I found PyTorch was using a generic convolution kernel that worked for any input size, any stride, any padding. Generality has a price.

// Generic vs Specialized: A Simple Example // This demonstrates why specialized code beats generic code console.log("=== Generic vs Specialized Performance ===\n"); // GENERIC: Works for any array, any operation function genericMapReduce(arr, mapFn, reduceFn, initial) { return arr.map(mapFn).reduce(reduceFn, initial); } // SPECIALIZED: Only sums squares, but optimized function sumOfSquares(arr) { let sum = 0; for (let i = 0; i < arr.length; i++) { const val = arr[i]; sum += val * val; // Inlined, no function calls } return sum; } // Benchmark const data = Array.from({length: 1000000}, (_, i) => i); console.time('Generic approach'); const result1 = genericMapReduce( data, x => x * x, (acc, x) => acc + x, 0 ); console.timeEnd('Generic approach'); console.time('Specialized approach'); const result2 = sumOfSquares(data); console.timeEnd('Specialized approach'); console.log(`\nBoth compute the same result: ${result1 === result2}`); console.log(`\nThe specialized version is faster because:`); console.log(` - No intermediate array allocation`); console.log(` - No function call overhead`); console.log(` - Better cache locality`); console.log(` - Compiler can optimize the simple loop`);

My Mental Model for Optimization

Over the years, I've developed a framework for deciding when to optimize:

Phase 1: Make It Work

Use the highest-level abstraction available. PyTorch, not CUDA. React, not vanilla DOM manipulation. Correctness first. You can't optimize code that doesn't work.

Phase 2: Measure Everything

Profile before optimizing. Not guessing, not hunches. Data.

  • cProfile for Python CPU profiling
  • nvidia-smi for GPU utilization
  • PyTorch's built-in profiler for kernel-level metrics
  • perf for low-level CPU counters

If your code spends 90% of time in function A and 10% in function B, optimizing B is wasted effort.

// Profiling: Where Does The Time Go? class Profiler { constructor() { this.timings = new Map(); } profile(name, fn) { const start = performance.now(); const result = fn(); const elapsed = performance.now() - start; if (!this.timings.has(name)) { this.timings.set(name, []); } this.timings.get(name).push(elapsed); return result; } report() { console.log("\n=== Profiling Report ===\n"); const totals = new Map(); let grandTotal = 0; for (const [name, times] of this.timings) { const total = times.reduce((a, b) => a + b, 0); totals.set(name, total); grandTotal += total; } // Sort by total time const sorted = [...totals.entries()].sort((a, b) => b[1] - a[1]); sorted.forEach(([name, total]) => { const percentage = ((total / grandTotal) * 100).toFixed(1); const avgTime = (total / this.timings.get(name).length).toFixed(2); const calls = this.timings.get(name).length; console.log(`${name}:`); console.log(` Total: ${total.toFixed(2)}ms (${percentage}%)`); console.log(` Calls: ${calls}`); console.log(` Avg: ${avgTime}ms/call\n`); }); console.log(`Grand Total: ${grandTotal.toFixed(2)}ms`); } } // Example: Profile a simulated application const profiler = new Profiler(); function simulateDataProcessing() { // Simulate expensive database query profiler.profile('Database Query', () => { const arr = new Array(1000000); for (let i = 0; i < arr.length; i++) arr[i] = Math.random(); return arr; }); // Simulate data transformation profiler.profile('Data Transform', () => { return new Array(500000).fill(0).map(Math.random); }); // Simulate rendering (fast) profiler.profile('Rendering', () => { return Math.sqrt(Math.random()); }); } // Run multiple times for (let i = 0; i < 5; i++) { simulateDataProcessing(); } profiler.report(); console.log("\n๐Ÿ’ก Key Insight: Focus optimization on the top time consumers!");

Phase 3: Optimize The Hot Path

Once you know where the time goes, you have options:

  1. Algorithmic improvement โ€” O(nยฒ) to O(n log n) beats any micro-optimization
  2. Remove abstraction layers โ€” Drop from framework to library to raw implementation
  3. Specialize โ€” Trade generality for speed
  4. Parallel execution โ€” Use multiple cores / GPUs

For AlexNet, I did #2 and #3: custom CUDA kernels specialized for the exact layer dimensions I was using. No generality tax. 4.6ร— speedup.

When NOT to Optimize

Here's what I've learned about when to leave abstractions alone:

  • It's not the bottleneck โ€” If profiling shows it's 2% of runtime, don't touch it
  • You're still prototyping โ€” Code that might be deleted tomorrow doesn't need optimization
  • The abstraction prevents bugs โ€” SQL injection > raw string concatenation, always
  • Maintenance cost is high โ€” Custom CUDA kernels are 10ร— harder to debug than PyTorch
"Make it work, make it right, make it fast โ€” in that order."

The Takeaway

Abstractions are incredible. They let us build faster, with fewer bugs, standing on the shoulders of giants. But when performance matters, you need to know:

  • Where your code spends time (profiling)
  • Why it's slow (understanding the abstraction cost)
  • When to drop down a level (is this the critical 3%?)

The best engineers aren't the ones who write the fastest code. They're the ones who know when fast enough is fast enough, and when to roll up their sleeves and write CUDA kernels.

Want to see the full CUDA optimization story? Check out my AlexNet project:

View Full CUDA Project โ†’

Breaking Things to Understand Them: A Weekend with Distributed Consensus

September 2024 Distributed Systems Raft Learning

The Best Way to Learn: Break It On Purpose

Reading the Raft paper is one thing. Understanding how distributed consensus actually works when networks partition, nodes lie, and Murphy's Law is in full effect? That requires getting your hands dirty.

So I spent a weekend intentionally breaking Raft to see what would happen.

"If you want to understand how something works, try to break it. If you want to master it, try to break it in creative ways."

Experiment 1: Can We Elect Two Leaders?

Raft's safety property guarantees at most one leader per term. But what if we try really hard to break it?

I modified a Raft implementation to introduce a "malicious node" that lies about its log index during elections. The node claims to have a longer log than it actually does, trying to win elections it shouldn't.

// Simplified Raft Election Simulation // This demonstrates leader election with a malicious node class RaftNode { constructor(id, logLength) { this.id = id; this.logLength = logLength; this.currentTerm = 0; this.votedFor = null; this.isMalicious = false; } // Request vote from this node requestVote(candidateId, candidateTerm, candidateLogLength) { // Update term if candidate has higher term if (candidateTerm > this.currentTerm) { this.currentTerm = candidateTerm; this.votedFor = null; } // Already voted in this term if (this.votedFor !== null && this.votedFor !== candidateId) { return false; } // Candidate's log must be at least as up-to-date if (candidateLogLength >= this.logLength) { this.votedFor = candidateId; return true; } return false; } // Try to become leader (malicious nodes lie about log length) runForLeader(nodes) { this.currentTerm++; this.votedFor = this.id; const reportedLogLength = this.isMalicious ? 9999 // LIE: claim to have huge log : this.logLength; // Tell the truth let votes = 1; // Vote for self nodes.forEach(node => { if (node.id !== this.id) { const granted = node.requestVote(this.id, this.currentTerm, reportedLogLength); if (granted) votes++; } }); const majority = Math.floor(nodes.length / 2) + 1; return votes >= majority; } } // Create cluster: 5 nodes with different log lengths const nodes = [ new RaftNode('A', 10), new RaftNode('B', 8), new RaftNode('C', 12), // Most up-to-date new RaftNode('D', 7), new RaftNode('E', 5) // Malicious with short log ]; nodes[4].isMalicious = true; // Node E lies about log console.log("=== Raft Election with Malicious Node ===\n"); // Honest node C tries to become leader console.log("Node C (log=12, honest) runs for leader:"); const cWins = nodes[2].runForLeader(nodes); console.log(` Result: ${cWins ? 'ELECTED' : 'FAILED'}`); // Reset votes nodes.forEach(n => { n.votedFor = null; n.currentTerm = 0; }); // Malicious node E tries to become leader console.log("\nNode E (log=5, MALICIOUS claims log=9999) runs for leader:"); const eWins = nodes[4].runForLeader(nodes); console.log(` Result: ${eWins ? 'ELECTED (BAD!)' : 'FAILED'}`); console.log("\n๐Ÿ’ก Raft's log comparison prevents malicious nodes from winning!"); console.log(" But what if we had network partitions...?");

Result: Raft held up. The malicious node won elections because other nodes believed its lie, but as soon as it tried to replicate log entries, followers rejected them because the logs didn't match. The cluster detected the inconsistency and held another election.

Lesson learned: Raft's safety doesn't rely on nodes being honest. It relies on detecting inconsistencies.

Experiment 2: Network Partitions (Split Brain)

The classic distributed systems nightmare: what happens when the network splits the cluster in half?

I simulated a 5-node cluster with nodes [A, B, C, D, E]. Then introduced a network partition that split them into:

  • Partition 1: [A, B, C] (3 nodes, can form quorum)
  • Partition 2: [D, E] (2 nodes, cannot form quorum)
// Network Partition Simulation class Cluster { constructor(nodeIds) { this.nodes = nodeIds.map(id => ({ id, term: 0, leader: false, canCommunicate: new Set(nodeIds) // Initially all can talk })); } // Introduce network partition partition(group1, group2) { console.log(`\n๐Ÿ”ช NETWORK PARTITION:`); console.log(` Group 1: [${group1.join(', ')}]`); console.log(` Group 2: [${group2.join(', ')}]`); this.nodes.forEach(node => { if (group1.includes(node.id)) { node.canCommunicate = new Set(group1); } else { node.canCommunicate = new Set(group2); } }); } // Try to elect leader in a group electLeader(candidateId) { const candidate = this.nodes.find(n => n.id === candidateId); const reachableNodes = this.nodes.filter(n => candidate.canCommunicate.has(n.id) ); candidate.term++; let votes = 1; // Self-vote reachableNodes.forEach(node => { if (node.id !== candidateId) { // Simple voting: grant if same partition if (node.canCommunicate.has(candidateId)) { votes++; } } }); const totalClusterSize = this.nodes.length; const majority = Math.floor(totalClusterSize / 2) + 1; const elected = votes >= majority; console.log(`\n${candidateId} election attempt:`); console.log(` Votes: ${votes}/${totalClusterSize}`); console.log(` Majority needed: ${majority}`); console.log(` Result: ${elected ? 'โœ“ ELECTED' : 'โœ— FAILED'}`); if (elected) { candidate.leader = true; } return elected; } } // Create 5-node cluster const cluster = new Cluster(['A', 'B', 'C', 'D', 'E']); console.log("=== Distributed Consensus Under Partition ==="); console.log("\nInitial: All nodes can communicate"); // No partition: A wins cluster.electLeader('A'); // Introduce partition cluster.partition(['A', 'B', 'C'], ['D', 'E']); // Partition 1 (3 nodes) can elect leader cluster.electLeader('B'); // Partition 2 (2 nodes) CANNOT elect leader cluster.electLeader('D'); console.log("\n๐Ÿ’ก Key Insight: Majority quorum prevents split-brain!"); console.log(" Minority partition can't make progress = safety preserved");

Result: Partition 1 elected a new leader and continued operating. Partition 2 could not reach quorum and remained stuck in follower state, unable to elect a leader or accept writes.

Key insight: This is by design. Raft sacrifices availability in the minority partition to preserve consistency. Better to have no leader than two leaders.

Experiment 3: Log Conflicts

What happens when network partitions heal and nodes have conflicting logs?

I set up a scenario where:

  1. Cluster operates normally, leader is A
  2. Network partition splits [A, B] from [C, D, E]
  3. Partition [C, D, E] elects new leader C in term 2
  4. Both partitions accept writes to their logs
  5. Network partition heals

Now we have divergent logs. Nodes A and B have entries from term 1. Nodes C, D, E have entries from term 2.

"When partitions heal, Raft doesn't try to merge divergent histories. It picks one truth and makes everyone agree."

The node with the higher term number (C with term 2) becomes the authority. Nodes A and B discard their uncommitted entries from term 1 and sync with C's log.

This is brutal but correct. Raft guarantees linearizability: if a client got an acknowledgment, that write is durable. If they didn't get an ack, it might be lost. No maybes.

What Textbooks Don't Tell You

After a weekend of breaking Raft, here's what I learned that wasn't obvious from the paper:

  • Raft is paranoid by design โ€” It assumes networks are unreliable, nodes can lie, and messages can be duplicated or lost
  • Safety beats liveness โ€” Raft would rather stop making progress than violate consistency
  • Term numbers are everything โ€” They're the global logical clock that orders events across the cluster
  • Log matching is the key invariant โ€” If two logs have the same entry at the same index, everything before that must match

Try Breaking It Yourself

The best way to truly understand distributed consensus is to implement it and break it. Some ideas to try:

  • What happens if messages are delayed by 10 seconds?
  • Can you create a livelock where no leader is ever elected?
  • What if a follower's disk fails and it loses its log?
  • Can a minority partition ever commit a write?

Distributed systems are hard because the failure modes are creative and surprising. The only way to build intuition is to see them fail, repeatedly, until the patterns become clear.

Teaching Python by Building Games: Why Education Playground Works

August 2024 Education Python Teaching

The Problem with Traditional Tutorials

Most programming tutorials start the same way:

print("Hello, World!")

Then they move to variables, then loops, then... students fall asleep. Why? Because there's no payoff. No visible result. No game, no animation, no "wow, I built that!"

Education Playground takes a different approach: build something cool from day one.

Learning By Building

The platform consists of 117 interactive Jupyter notebooks covering Python from absolute basics to GPU programming with CUDA. But unlike traditional courses, every concept is taught through building something you can see.

Lesson 1: Not "Hello World", but "Make a Game"

Instead of printing text, the first lesson builds a simple number guessing game:

// Python number guessing game (translated to JS for demo) // Students see this in Lesson 1 function playGuessingGame() { const secret = Math.floor(Math.random() * 100) + 1; let attempts = 0; const maxAttempts = 7; console.log("๐ŸŽฎ Welcome to Guess the Number!"); console.log(`I'm thinking of a number between 1 and 100`); console.log(`You have ${maxAttempts} attempts\n`); // Simulated guesses for demo const guesses = [50, 75, 87, 93, 96, 94, 95]; guesses.forEach(guess => { attempts++; if (guess === secret) { console.log(`\n๐ŸŽ‰ Correct! You found it in ${attempts} attempts!`); return; } else if (guess < secret) { console.log(`Attempt ${attempts}: ${guess} is too low! โฌ†๏ธ`); } else { console.log(`Attempt ${attempts}: ${guess} is too high! โฌ‡๏ธ`); } if (attempts >= maxAttempts) { console.log(`\n๐Ÿ˜ž Out of attempts! The number was ${secret}`); } }); } playGuessingGame(); console.log("\n๐Ÿ’ก In Lesson 1, students learn:"); console.log(" - Variables (secret, attempts)"); console.log(" - Conditionals (if/else)"); console.log(" - Loops (while)"); console.log(" - Input/Output"); console.log("\nAll by building a playable game!");

By the end of Lesson 1, students have a working game. They've learned variables, conditionals, and loops without realizing they were learning syntax. They were too busy having fun.

Progressive Complexity: From Games to GPU Programming

The 117 notebooks follow a carefully designed progression:

Weeks 1-4: Fundamentals Through Games

  • Guess the Number (variables, loops, conditionals)
  • Hangman (strings, lists, functions)
  • Tic-Tac-Toe (2D arrays, game state)
  • Snake Game (classes, object-oriented programming)

Weeks 5-8: Data Structures Through Visualizations

  • Sorting Visualizer (algorithms, complexity)
  • Graph Explorer (trees, graphs, search algorithms)
  • Maze Generator (recursion, backtracking)

Weeks 9-12: Real-World Applications

  • Web Scraper (requests, BeautifulSoup)
  • Data Analysis Dashboard (pandas, matplotlib)
  • Machine Learning Basics (scikit-learn)
  • GPU Acceleration (CUDA, parallel programming)
// Example: Teaching sorting through visualization // Students see the algorithm AND the result function bubbleSort(arr) { const n = arr.length; const steps = []; // Make a copy to sort const sorted = [...arr]; for (let i = 0; i < n; i++) { for (let j = 0; j < n - i - 1; j++) { // Capture state for visualization steps.push({ array: [...sorted], comparing: [j, j + 1], action: 'compare' }); if (sorted[j] > sorted[j + 1]) { // Swap [sorted[j], sorted[j + 1]] = [sorted[j + 1], sorted[j]]; steps.push({ array: [...sorted], swapped: [j, j + 1], action: 'swap' }); } } } return steps; } // Visualize sorting const unsorted = [64, 34, 25, 12, 22, 11, 90]; console.log("๐ŸŽฏ Teaching Bubble Sort Visually\n"); console.log(`Initial array: [${unsorted.join(', ')}]\n`); const steps = bubbleSort(unsorted); // Show key steps const keySteps = [0, 5, 10, 15, steps.length - 1]; keySteps.forEach(i => { if (i < steps.length) { const step = steps[i]; console.log(`Step ${i + 1}: [${step.array.join(', ')}]`); if (step.action === 'swap') { console.log(` โ†’ Swapped positions ${step.swapped[0]} and ${step.swapped[1]}`); } } }); console.log(`\nโœ“ Sorted: [${steps[steps.length - 1].array.join(', ')}]`); console.log(`\nStudents learn:`); console.log(` - Nested loops`); console.log(` - Algorithm complexity (O(nยฒ))`); console.log(` - By WATCHING it work, not just reading about it`);

Results: 90% Certification Pass Rate

After implementing this curriculum with 120+ students, the results speak for themselves:

  • 90% pass rate on PCEP (Certified Entry-Level Python Programmer) exams
  • 30+ hours of comprehensive coursework
  • 100+ students mentored through hands-on projects
  • 117 interactive notebooks covering basics to GPU programming

Why It Works: The Science of Learning

Education Playground's approach is backed by learning science research:

1. Active Learning Beats Passive Reading

Students don't just read about loops, they write loops that make characters move on screen. The brain encodes "doing" much more strongly than "reading about."

2. Immediate Feedback

Jupyter notebooks give instant visual feedback. Change the code, run the cell, see the result. No waiting, no context switching.

3. Progressive Complexity

Each lesson builds on the last. By week 12, students are writing GPU-accelerated code using concepts from week 1, without realizing how far they've come.

4. Intrinsic Motivation

Students aren't completing exercises for a grade. They're building games they can show their friends. That's powerful motivation.

"Tell me and I forget. Teach me and I remember. Involve me and I learn." โ€” Benjamin Franklin

Try It Yourself

Education Playground is completely free and open source. All 117 notebooks are available online:

Try Education Playground Live โ†’

Whether you're a teacher looking for curriculum, a student learning Python, or just curious how to make programming education more engaging, the platform is ready to use. No installation required.

Because learning to code shouldn't be boring. It should be building games, solving puzzles, and creating things that make you say "I can't believe I built that."

Let's Connect

I'm currently exploring opportunities where deep systems work meets impactful user experience. If your team is uniting robust infrastructure with intuitive applications, let's talk.

mykolas@perevici.us:~$ whoami
> Full-stack engineer shipping code that matters
mykolas@perevici.us:~$ ls projects/
> koalas-forge/ distributed-alexnet/ ultimate-setup/ education-playground/
mykolas@perevici.us:~$ cat philosophy.txt
> "Ship fast. Test everything. Impact users."
mykolas@perevici.us:~$ echo $SKILLS
> Python | Django | React | AWS | CUDA | Docker | PostgreSQL
mykolas@perevici.us:~$ cat contact.txt
> Email: Perevicius.Mykolas@gmail.com
> GitHub: github.com/mykolas-perevicius
> LinkedIn: linkedin.com/in/mykolasperevicius
> Press ESC or click outside to close

Get In Touch

Have a project in mind or want to chat? Let's connect!