SymSpell C99: Building the Fastest Spell Checker in Pure C
How I packaged and published the first pure C99 implementation of a spell-checking algorithm that's 1 million times faster than traditional approaches
Introduction: The Need for Speed
Imagine typing in a search box, and as you type, the system instantly corrects your misspellings in real-time. Or think about a text editor that can check millions of words per second without slowing down your workflow. This level of performance seemed impossible with traditional spell-checking algorithms—until SymSpell came along.
When I discovered that no pure C implementation of this revolutionary algorithm existed, I saw an opportunity. What started as a learning project evolved into SymSpell C99—the first and only pure C99 implementation of Wolf Garbe's SymSpell algorithm, now available as open source.
🚀 Quick Stats:
- ⚡ 5µs average lookup time (real-world usage)
- 🎯 82-84% correction accuracy
- 🧹 ~700 lines of clean C99 code
- 🌐 Zero dependencies, POSIX-compliant
- 📦 86,060 word dictionary included
GitHub Repository: github.com/sumanpokhrel-11/symspell-c99
What is SymSpell?
SymSpell is a spell-checking algorithm created by Wolf Garbe that's reportedly 1 million times faster than traditional spell-checkers like those based on Peter Norvig's approach.
The Problem with Traditional Spell Checkers
Traditional spell-checkers work by:
- Taking a misspelled word
- Generating all possible corrections (edits, insertions, deletions, transpositions)
- Checking which generated candidates exist in the dictionary
- Ranking by frequency/probability
For a word with edit distance 2, this can generate hundreds of thousands of candidates. It's computationally expensive and slow.
The SymSpell Innovation: Symmetric Delete
SymSpell flips this approach on its head with a brilliant insight:
"Instead of generating corrections from misspellings, pre-compute deletions from dictionary words."
The key insight: If you delete characters from both a misspelled word and the correct word, they might become the same string.
Example:
Misspelling: "spelling" → delete 1 char → "speling"
Correct: "spelling" → no deletions → "spelling"
Misspelling: "speling" → generate deletes → "speling", "peling", "seling", ...
Dictionary: "spelling" → pre-computed deletes → "spelling", "pelling", "selling", ...
Match found: "spelling" ✓
By pre-computing all possible deletions of dictionary words and storing them in a hash table, lookups become incredibly fast—just a few hash table lookups instead of generating thousands of candidates.
Why C99?
When I started this project, implementations existed in C#, Python, Java, JavaScript, and many other languages. But there was no pure C implementation. Here's why C99 was the right choice:
1. Performance
C provides direct control over memory and CPU, essential for a performance-critical algorithm like spell-checking. No garbage collection pauses, no interpreter overhead—just pure, predictable speed.
2. Portability
C99 is the most portable language. It runs on everything from embedded systems to supercomputers. POSIX compliance means it works seamlessly on Linux, macOS, BSD, and Unix systems.
3. Integration
C libraries can be easily integrated into virtually any other language through FFI (Foreign Function Interface). Want to use it from Python? Rust? Go? No problem.
4. Educational Value
Implementing complex algorithms in C teaches you how computers actually work—memory management, cache efficiency, data structures, and optimization techniques.
5. No Dependencies
The implementation is completely self-contained. No external libraries, no frameworks, no package managers. Just standard C99 and POSIX.
How SymSpell Works: Under the Hood
Let me walk you through the algorithm step by step.
Step 1: Dictionary Preprocessing
When loading the dictionary, for each word we:
- Store the original word with its frequency
- Generate all possible deletions up to edit distance N
- Store each deletion as a key pointing to the original word
Dictionary word: "example"
Edit distance: 2
Generated deletes:
- "xample" → points to "example"
- "eample" → points to "example"
- "exmple" → points to "example"
- "exaple" → points to "example"
... (and many more)
All stored in a hash table for O(1) lookup.
Step 2: Fast Path - Correct Words
When a user enters a word, we first check if it's in the dictionary:
Input: "example"
Hash lookup: Found! ✓
Result: Return immediately (0.7µs)
This is the fast path. Since most words users type are spelled correctly (about 85%), this optimization is crucial. It's why our real-world average is 5µs, not 30µs.
Step 3: Correction Path - Misspelled Words
If the word isn't found, we generate deletions:
Input: "exampl" (missing 'e')
Generate deletes:
- "xampl" → lookup in hash table
- "eampl" → lookup in hash table
- "exmpl" → lookup in hash table
- "examl" → lookup in hash table
- "examp" → lookup in hash table ✓ Found!
Hash table returns: "example"
Calculate edit distance: 1
Return: "example" (30µs worst case)
Step 4: Ranking Results
When multiple candidates are found, we rank by:
- Edit distance (prefer closer matches)
- Word frequency (prefer common words)
- IWF score (Inverse Word Frequency - for filtering common words)
Why This is Fast
The beauty of SymSpell:
- ✅ Pre-computation done once at startup
- ✅ Lookups are just hash table operations (O(1))
- ✅ No complex string generation during lookup
- ✅ Memory for speed tradeoff (hash table is large but lookups are instant)
Key Features of SymSpell C99
1. Blazing Fast Performance
- Correctly spelled words: 0.7µs (fast path)
- Real-world average: ~5µs (accounting for 85% correct words)
- Misspelling correction: 30µs (worst case)
2. High Accuracy
- 82-84% correction rate on standard test datasets
- 98% vocabulary coverage
- Tested against multiple misspelling corpora (CodeSpell, Microsoft, Wikipedia)
3. Production-Ready Code
- Zero compiler warnings with
-Wall -Wextra -Werror - Comprehensive test suite included
- Memory leak free (validated with Valgrind)
- Well-documented with clear API
4. Optimized Dictionary
- 86,060 words carefully curated for spell-checking
- Frequency data from Google N-gram corpus
- Balanced across multiple test sets
- Additional 25,327-word dictionary for technical terms
5. Complete Tooling
- Dictionary building scripts (Perl-based pipeline)
- Performance benchmarking tools
- Interactive test program
- Comprehensive documentation
6. POSIX Compliant
- Works on Linux, macOS, BSD, Unix
- Custom POSIX compatibility layer included
- No platform-specific dependencies
Implementation Journey: Challenges and Solutions
Building this wasn't without challenges. Here are some of the interesting problems I encountered and solved:
Challenge 1: Hash Table Performance
Problem: With 688,710 delete entries for 86k words, hash collisions were a concern.
Solution: Implemented a custom hash table using:
- xxHash3 for fast, high-quality hashing
- Open addressing with linear probing
- Load factor tuning to keep it 16.4% full for optimal performance
// Using xxHash3 for superior hash distribution
uint64_t hash = XXH3_64bits(word, word_len);
// Open addressing with linear probing
uint32_t index = hash % table->capacity;
while (table->entries[index].key != NULL) {
if (strcmp(table->entries[index].key, word) == 0) {
return &table->entries[index]; // Found
}
index = (index + 1) % table->capacity; // Linear probe
}
Challenge 2: Memory Efficiency
Problem: Storing 688k+ entries requires careful memory management.
Solution:
- Custom memory pool for string storage
- Compact data structures (packed structs)
- Lazy loading of large dictionaries
- Total memory footprint: ~45MB (acceptable for the performance gained)
Challenge 3: ARM64 vs x86 Compatibility
Problem: Ensuring consistent behavior across architectures.
Solution: While working on this project, I discovered and fixed a critical bug in the ARM64 inline assembly code in the POSIX compatibility layer:
// BEFORE (buggy):
__asm__ volatile (
"mov w0, #0\n\t" // Writing to hardcoded w0
: "=r" (result) // Compiler assigns result to w7!
);
// AFTER (fixed):
__asm__ volatile (
"mov %w0, #0\n\t" // %w0 = use compiler-assigned register
: "=r" (result) // Now correctly uses w7
);
This fix improved performance by 34.6% and taught me valuable lessons about inline assembly constraints. Read more about this in my article: "Debugging ARM64 Inline Assembly: A Case Study"
Challenge 4: Dictionary Quality
Problem: Generic word lists don't work well for spell-checking.
Solution: Built a comprehensive dictionary pipeline:
- Start with SCOWL (Spell Checker Oriented Word Lists)
- Add frequency data from Google N-grams
- Filter against real misspelling datasets
- Validate accuracy across multiple test corpora
- Iterate until achieving 82-84% accuracy
Challenge 5: Real-World Performance vs Benchmarks
Problem: Benchmarks showed 30µs, but that's only for misspellings.
Solution: Implemented a two-tier approach:
- Fast path: Direct dictionary lookup (0.7µs) for correct words
- Correction path: Full SymSpell algorithm (30µs) for misspellings
- Real-world average: 5µs (85% fast path hits)
Performance Benchmarks: Numbers that Matter
Lookup Time Breakdown
| Scenario | Time | Frequency |
|---|---|---|
| Correctly spelled word (fast path) | 0.7µs | ~85% |
| Misspelled word (correction) | 30µs | ~15% |
| Real-world average | 5.1µs | 100% |
Calculation: 0.85 × 0.7µs + 0.15 × 30µs = 5.1µs
Accuracy Across Test Sets
| Test Corpus | Accuracy | Test Size |
|---|---|---|
| CodeSpell misspellings | 84.2% | 2,156 pairs |
| Microsoft misspellings | 82.7% | 4,200 pairs |
| Wikipedia misspellings | 83.1% | 3,500 pairs |
| Average | 83.3% | 9,856 pairs |
Memory Usage
- Dictionary data: ~12MB
- Delete hash table: ~33MB
- Total: ~45MB
- Per-lookup overhead: Negligible (stack-allocated)
Comparison with Other Implementations
While I can't do direct apples-to-apples comparisons (different languages, different test machines), here's how SymSpell C99 compares conceptually:
- Python SymSpell: ~100-200µs (interpreted overhead)
- C# SymSpell: ~10-20µs (JIT compilation, GC pauses)
- SymSpell C99: ~5µs real-world (compiled native, no GC)
The C implementation benefits from:
- No interpreter/VM overhead
- No garbage collection pauses
- Direct memory access
- Compiler optimizations (O3 flag)
- Cache-friendly data structures
Platform Performance
Apple M4 (ARM64):
- Average lookup: 5.1µs
- Fast path: 0.7µs
- Correction: 30µs
x86-64 (Intel Xeon):
- Average lookup: 5.8µs
- Fast path: 0.8µs
- Correction: 32µs
Performance is consistent across architectures thanks to portable C99 code.
Getting Started: Quick Start Guide
Installation
# Clone the repository
git clone https://github.com/sumanpokhrel-11/symspell-c99.git
cd symspell-c99
# Build
make
# Run interactive test
./test_symspell dictionaries/dictionary.txt
# Run benchmark
make benchmark
Basic Usage
#include "symspell.h"
int main() {
// Create dictionary with:
// max_edit_distance = 2
// prefix_length = 7
symspell_dict_t* dict = symspell_create(2, 7);
// Load dictionary
int result = symspell_load_dictionary(
dict,
"dictionaries/dictionary.txt",
0, // term_index (column in file)
1 // count_index (frequency column)
);
if (result < 0) {
fprintf(stderr, "Failed to load dictionary\n");
return 1;
}
// Lookup suggestions
symspell_suggestion_t suggestions[5];
int count = symspell_lookup(
dict,
"speling", // misspelled word
2, // max_edit_distance
suggestions, // output array
5 // max suggestions
);
// Print results
printf("Suggestions for 'speling':\n");
for (int i = 0; i < count; i++) {
printf(" %s (distance=%d, freq=%lld)\n",
suggestions[i].term,
suggestions[i].distance,
suggestions[i].count);
}
// Cleanup
symspell_destroy(dict);
return 0;
}
Output:
Suggestions for 'speling':
spelling (distance=1, freq=234567)
spieling (distance=2, freq=1234)
peeling (distance=2, freq=5678)
Interactive Mode
The test program includes an interactive mode for experimentation:
$ ./test_symspell dictionaries/dictionary.txt
Creating SymSpell dictionary...
Loaded 86060 words, 688710 delete entries
=== Interactive Mode ===
Enter words to correct (or 'quit' to exit):
> recieve
Suggestions:
receive (distance=1, iwf=7.23, prob=0.0012, freq=234567)
> teh
Suggestions:
the (distance=1, iwf=1.05, prob=0.2456, freq=55234891)
tea (distance=1, iwf=6.12, prob=0.0034, freq=76543)
> quit
Building Custom Dictionaries
The project includes a complete dictionary building pipeline:
# Dictionary building scripts in dictionaries/reference/
# 1. Start with SCOWL wordlists
cd dictionaries/reference/wordlists
./build-scowl.sh
# 2. Add frequency data
cd ..
perl scripts/dictionary-build.pl \
wordlists/wordlist.txt \
frequencies/google-ngrams.txt \
> dictionaries/custom-dictionary.txt
# 3. Validate against test data
perl scripts/dictionary-find-misspellings.pl \
dictionaries/custom-dictionary.txt \
test/data/symspell/misspellings/
# 4. Test accuracy
./benchmark_symspell \
dictionaries/custom-dictionary.txt \
test/data/symspell/misspellings/misspell-codespell.txt
API Reference
Core Functions
// Create dictionary
symspell_dict_t* symspell_create(
int max_edit_distance, // 1 or 2 recommended
int prefix_length // 7 recommended
);
// Load dictionary from file
int symspell_load_dictionary(
symspell_dict_t* dict,
const char* filepath,
int term_index, // Column number for words
int count_index // Column number for frequencies
);
// Lookup suggestions
int symspell_lookup(
symspell_dict_t* dict,
const char* input,
int max_edit_distance,
symspell_suggestion_t* suggestions,
int max_suggestions
);
// Get word frequency
int64_t symspell_word_frequency(
symspell_dict_t* dict,
const char* word
);
// Get IWF score
float symspell_get_iwf(
symspell_dict_t* dict,
const char* word
);
// Cleanup
void symspell_destroy(symspell_dict_t* dict);
Data Structures
typedef struct {
char term[MAX_WORD_LENGTH]; // Suggested word
int distance; // Edit distance from input
int64_t count; // Word frequency
float iwf; // Inverse word frequency
float prob; // Probability score
} symspell_suggestion_t;
Real-World Use Cases
Where can you use SymSpell C99? Here are some practical applications:
1. Text Editors & IDEs
Integrate real-time spell-checking without UI lag. The 5µs average lookup time means you can check every word as the user types with zero perceptible delay.
Example: A text editor checking 100 words/second = 500µs total (0.0005 seconds)
2. Search Engines
Provide "Did you mean?" suggestions instantly. When a user searches for "pytohn tutorial", suggest "python tutorial" in microseconds.
Benefit: Improves search UX without adding latency.
3. Form Validation
Validate user input in real-time. Check email addresses, names, or technical terms as users type.
Example: A signup form that suggests corrections for common name misspellings.
4. Data Cleaning Pipelines
Process millions of records for data quality. Clean up messy datasets by correcting common misspellings.
Performance: Process 1 million words in ~5 seconds.
5. Chatbots & NLP Systems
Improve intent recognition by correcting user input before processing. Handle typos gracefully in conversational AI.
Example: User types "hw are yu", system understands "how are you".
6. Command-Line Tools
Build CLI tools with spell-check capabilities. Suggest corrections for command typos.
Example: Git-style "Did you mean 'commit'?" messages.
7. Embedded Systems
Run on resource-constrained devices. The small footprint (~45MB) and zero dependencies make it perfect for embedded Linux.
Example: Smart keyboards, IoT devices, automotive systems.
8. Database Query Correction
Fix typos in search queries before hitting the database. Improve search recall without expensive fuzzy matching.
Benefit: Better search results with exact match performance.
How to Contribute
SymSpell C99 is open source and welcomes contributions! Here's how you can help:
Ways to Contribute
1. Bug Reports
Found a bug? Please report it!
- Open an issue on GitHub
- Include: steps to reproduce, expected behavior, actual behavior
- Bonus: Minimal code example demonstrating the bug
2. Performance Improvements
Can you make it faster?
- Profile the code (use
perf,Instruments, orValgrind) - Identify bottlenecks
- Submit a pull request with benchmarks showing improvement
3. Dictionary Enhancements
Improve accuracy by refining the dictionary:
- Test against new misspelling corpora
- Add domain-specific dictionaries (medical, legal, technical)
- Improve frequency data
- Document dictionary building process
4. Documentation
Help others understand and use the library:
- Add code examples
- Write tutorials
- Improve API documentation
- Translate documentation
5. Language Bindings
Make SymSpell C99 accessible from other languages:
- Python: ctypes or Cython wrapper
- Node.js: Native addon using N-API
- Rust: FFI bindings
- Go: CGo bindings
- Ruby: FFI gem
6. Platform Support
Test and ensure compatibility:
- Windows (via WSL or MinGW)
- FreeBSD, OpenBSD
- Android NDK
- iOS (via cross-compilation)
7. Testing
Expand test coverage:
- Unit tests for edge cases
- Fuzzing for robustness
- Memory leak detection
- Thread safety tests
Contribution Guidelines
-
Fork the repository
git fork https://github.com/sumanpokhrel-11/symspell-c99.git -
Create a feature branch
git checkout -b feature/your-feature-name -
Follow the code style
- C99 standard compliance
- Zero warnings with
-Wall -Wextra -Werror - Clear, descriptive variable names
- Comments for complex logic
-
Write tests
- Add tests for new features
- Ensure all existing tests pass
- Run
make testbefore submitting
-
Document your changes
- Update README if needed
- Add docstrings to new functions
- Update CHANGELOG
-
Submit a pull request
- Clear description of changes
- Reference any related issues
- Include before/after benchmarks for performance changes
Code of Conduct
This project follows a simple code of conduct:
- ✅ Be respectful and constructive
- ✅ Welcome newcomers and help them learn
- ✅ Focus on what's best for the project
- ✅ Give credit where credit is due
- ❌ No harassment, discrimination, or trolling
Recognition
All contributors are recognized in:
- GitHub contributors page
- CONTRIBUTORS.md file
- Release notes
Let's build something great together! 🚀
Future Enhancements: What's Next?
Here's my roadmap for future development. These are areas I'm exploring or would love community help with:
Short-Term (Next 3-6 months)
1. CI/CD Pipeline
- GitHub Actions for automated testing
- Test on multiple platforms (Linux, macOS, BSD)
- Code coverage reporting
- Automated performance benchmarking
2. Additional Examples
- CLI spell-checker tool
- Web server integration example (with libevent)
- Batch processing script
- Performance comparison tool
3. Documentation Improvements
- API reference (Doxygen-generated)
- Architecture deep-dive
- Performance tuning guide
- Video tutorials
Medium-Term (6-12 months)
4. Language Bindings
- Python wrapper (top priority based on demand)
- Node.js native addon
- Rust FFI bindings
- Go CGo wrapper
5. Advanced Features
- Compound word support: Handle "spellchecker" vs "spell checker"
- Context-aware suggestions: Use n-grams for better ranking
- Phonetic matching: Soundex/Metaphone for homophones
- Multi-language support: Spanish, French, German dictionaries
6. Performance Optimizations
- SIMD optimizations (AVX2/NEON)
- Multi-threaded dictionary loading
- Memory-mapped dictionary files
- Incremental dictionary updates
Long-Term (12+ months)
7. Package Manager Distribution
- Homebrew formula
- Debian/Ubuntu packages (apt)
- RPM packages (yum/dnf)
- Arch AUR package
- vcpkg/Conan package
8. Advanced Use Cases
- Edit distance 3 support: For very messy input
- Streaming interface: Process text streams efficiently
- GPU acceleration: Batch processing on CUDA/OpenCL
- Distributed system: Microservice architecture
9. Research Integration
- Implement SymSpell 6.7 improvements
- Integrate neural spelling correction as fallback
- Combine with language models (GPT/BERT) for context
- Research paper on C-specific optimizations
Blue Sky Ideas
Ambitious ideas that would be amazing but require significant work:
Real-Time Collaborative Editing
Integrate with OT/CRDT systems for real-time collaborative spell-checking in Google Docs-style applications.
Browser Extension
Compile to WebAssembly and create a privacy-focused, offline spell-checker browser extension.
Machine Learning Integration
Train a neural ranker to improve suggestion quality beyond frequency-based ranking.
Domain-Specific Dictionaries
Curated dictionaries for: medical terminology, legal documents, programming languages, academic writing.
Your Ideas?
Have an idea for improvement? Open a discussion on GitHub!
Conclusion: Lessons Learned and What's Next
What I Learned
Building and publishing SymSpell C99 taught me invaluable lessons:
Technical Skills
- Systems programming: Deep understanding of memory management, cache optimization, and data structures
- Performance optimization: Profiling, benchmarking, and iterative improvement
- Cross-platform development: Writing portable C99 code that works everywhere
- Algorithm implementation: Translating academic papers into production code
- Debugging: Fixing complex bugs like ARM64 inline assembly issues
Soft Skills
- Documentation: Writing clear, comprehensive documentation for users
- Open source: Managing a public repository, responding to issues, accepting contributions
- Communication: Explaining technical concepts to diverse audiences
- Project management: Planning features, prioritizing work, meeting deadlines
Professional Growth
- Portfolio building: Created a showcase project demonstrating expertise
- Community engagement: Connected with algorithm creator and other developers
- Mentorship: Learned from experienced systems programmers
- Confidence: Proved I can build production-quality software
Key Takeaways
"Perfect is the enemy of good."
I could have spent months adding every possible feature. Instead, I shipped a solid v1.0 and will iterate based on feedback.
"Performance comes from understanding, not magic."
The 5µs average wasn't luck—it came from profiling, optimization, and deep understanding of how the algorithm and hardware work together.
"Documentation is code for humans."
Writing good docs is as important as writing good code. Nobody will use your library if they can't figure out how.
"Open source is about community, not just code."
The most rewarding part wasn't writing the code—it was sharing it, helping others use it, and learning from their feedback.
What's Next for Me?
This project opened doors I didn't know existed:
- 📧 Contacted Wolf Garbe to get listed on official SymSpell ports
- 💼 Portfolio piece demonstrating systems programming expertise
- 🎓 Learning opportunity in algorithms and performance optimization
- 🤝 Mentorship from experienced engineers (thank you, Dr. James Freeman!)
- 🌟 Open source contributor with a published project
Try It Yourself!
I encourage you to:
- Clone the repo: github.com/sumanpokhrel-11/symspell-c99
- Run the benchmarks: See the performance for yourself
- Read the code: It's only 700 lines—very approachable
- Integrate it: Use it in your next project
- Contribute: Help make it even better
Get in Touch
I'd love to hear from you:
- 🐙 GitHub: @sumanpokhrel-11
- 💼 LinkedIn: Suman Pokhrel
- 📧 Email: 11pksuman@gmail.com
- 🌐 Website: suman-pokhrel.com.np
Whether you're:
- Using SymSpell C99 in a project
- Have questions or feedback
- Want to collaborate on open source
- Interested in discussing algorithms or performance
- Looking for a passionate developer to join your team
Let's connect!
Final Thoughts
Building SymSpell C99 reminded me why I fell in love with programming: the joy of creating something useful, the satisfaction of solving hard problems, and the thrill of sharing it with the world.
If you're reading this and thinking about starting your own project:
Do it.
Don't wait for the perfect idea or the perfect time. Pick something that excites you, start small, ship it, and iterate.
The best time to start was yesterday. The second best time is now.
Thank you for reading! Now go build something amazing. 🚀
Suman Pokhrel
Software Developer | Systems Programmer | Open Source Enthusiast
Kathmandu, Nepal
⭐ Star the project on GitHub!
If you found this project useful or interesting, please star the repository:
Comments
Have questions or feedback? Leave a comment below or reach out on GitHub Discussions.