SymSpell C99: Building the Fastest Spell Checker in Pure C

How I packaged and published the first pure C99 implementation of a spell-checking algorithm that's 1 million times faster than traditional approaches

October 24, 2025 15 min read By Suman Pokhrel
C Programming Algorithms Open Source Performance

Introduction: The Need for Speed

Imagine typing in a search box, and as you type, the system instantly corrects your misspellings in real-time. Or think about a text editor that can check millions of words per second without slowing down your workflow. This level of performance seemed impossible with traditional spell-checking algorithms—until SymSpell came along.

When I discovered that no pure C implementation of this revolutionary algorithm existed, I saw an opportunity. What started as a learning project evolved into SymSpell C99—the first and only pure C99 implementation of Wolf Garbe's SymSpell algorithm, now available as open source.

🚀 Quick Stats:

  • ⚡ 5µs average lookup time (real-world usage)
  • 🎯 82-84% correction accuracy
  • 🧹 ~700 lines of clean C99 code
  • 🌐 Zero dependencies, POSIX-compliant
  • 📦 86,060 word dictionary included

GitHub Repository: github.com/sumanpokhrel-11/symspell-c99

What is SymSpell?

SymSpell is a spell-checking algorithm created by Wolf Garbe that's reportedly 1 million times faster than traditional spell-checkers like those based on Peter Norvig's approach.

The Problem with Traditional Spell Checkers

Traditional spell-checkers work by:

  1. Taking a misspelled word
  2. Generating all possible corrections (edits, insertions, deletions, transpositions)
  3. Checking which generated candidates exist in the dictionary
  4. Ranking by frequency/probability

For a word with edit distance 2, this can generate hundreds of thousands of candidates. It's computationally expensive and slow.

The SymSpell Innovation: Symmetric Delete

SymSpell flips this approach on its head with a brilliant insight:

"Instead of generating corrections from misspellings, pre-compute deletions from dictionary words."

The key insight: If you delete characters from both a misspelled word and the correct word, they might become the same string.

Example:

Misspelling: "spelling"  → delete 1 char → "speling"
Correct:     "spelling"  → no deletions  → "spelling"

Misspelling: "speling"   → generate deletes → "speling", "peling", "seling", ...
Dictionary:  "spelling"  → pre-computed deletes → "spelling", "pelling", "selling", ...

Match found: "spelling" ✓
            

By pre-computing all possible deletions of dictionary words and storing them in a hash table, lookups become incredibly fast—just a few hash table lookups instead of generating thousands of candidates.

Why C99?

When I started this project, implementations existed in C#, Python, Java, JavaScript, and many other languages. But there was no pure C implementation. Here's why C99 was the right choice:

1. Performance

C provides direct control over memory and CPU, essential for a performance-critical algorithm like spell-checking. No garbage collection pauses, no interpreter overhead—just pure, predictable speed.

2. Portability

C99 is the most portable language. It runs on everything from embedded systems to supercomputers. POSIX compliance means it works seamlessly on Linux, macOS, BSD, and Unix systems.

3. Integration

C libraries can be easily integrated into virtually any other language through FFI (Foreign Function Interface). Want to use it from Python? Rust? Go? No problem.

4. Educational Value

Implementing complex algorithms in C teaches you how computers actually work—memory management, cache efficiency, data structures, and optimization techniques.

5. No Dependencies

The implementation is completely self-contained. No external libraries, no frameworks, no package managers. Just standard C99 and POSIX.

How SymSpell Works: Under the Hood

Let me walk you through the algorithm step by step.

Step 1: Dictionary Preprocessing

When loading the dictionary, for each word we:

  1. Store the original word with its frequency
  2. Generate all possible deletions up to edit distance N
  3. Store each deletion as a key pointing to the original word
Dictionary word: "example"
Edit distance: 2

Generated deletes:
- "xample" → points to "example"
- "eample" → points to "example"
- "exmple" → points to "example"
- "exaple" → points to "example"
... (and many more)

All stored in a hash table for O(1) lookup.
            

Step 2: Fast Path - Correct Words

When a user enters a word, we first check if it's in the dictionary:

Input: "example"
Hash lookup: Found! ✓
Result: Return immediately (0.7µs)
            

This is the fast path. Since most words users type are spelled correctly (about 85%), this optimization is crucial. It's why our real-world average is 5µs, not 30µs.

Step 3: Correction Path - Misspelled Words

If the word isn't found, we generate deletions:

Input: "exampl" (missing 'e')
Generate deletes:
- "xampl" → lookup in hash table
- "eampl" → lookup in hash table
- "exmpl" → lookup in hash table
- "examl" → lookup in hash table
- "examp" → lookup in hash table ✓ Found!

Hash table returns: "example"
Calculate edit distance: 1
Return: "example" (30µs worst case)
            

Step 4: Ranking Results

When multiple candidates are found, we rank by:

  1. Edit distance (prefer closer matches)
  2. Word frequency (prefer common words)
  3. IWF score (Inverse Word Frequency - for filtering common words)

Why This is Fast

The beauty of SymSpell:

Key Features of SymSpell C99

1. Blazing Fast Performance

2. High Accuracy

3. Production-Ready Code

4. Optimized Dictionary

5. Complete Tooling

6. POSIX Compliant

Implementation Journey: Challenges and Solutions

Building this wasn't without challenges. Here are some of the interesting problems I encountered and solved:

Challenge 1: Hash Table Performance

Problem: With 688,710 delete entries for 86k words, hash collisions were a concern.

Solution: Implemented a custom hash table using:


// Using xxHash3 for superior hash distribution
uint64_t hash = XXH3_64bits(word, word_len);

// Open addressing with linear probing
uint32_t index = hash % table->capacity;
while (table->entries[index].key != NULL) {
    if (strcmp(table->entries[index].key, word) == 0) {
        return &table->entries[index];  // Found
    }
    index = (index + 1) % table->capacity;  // Linear probe
}
            

Challenge 2: Memory Efficiency

Problem: Storing 688k+ entries requires careful memory management.

Solution:

Challenge 3: ARM64 vs x86 Compatibility

Problem: Ensuring consistent behavior across architectures.

Solution: While working on this project, I discovered and fixed a critical bug in the ARM64 inline assembly code in the POSIX compatibility layer:


// BEFORE (buggy):
__asm__ volatile (
    "mov w0, #0\n\t"           // Writing to hardcoded w0
    : "=r" (result)            // Compiler assigns result to w7!
);

// AFTER (fixed):
__asm__ volatile (
    "mov %w0, #0\n\t"          // %w0 = use compiler-assigned register
    : "=r" (result)            // Now correctly uses w7
);
            

This fix improved performance by 34.6% and taught me valuable lessons about inline assembly constraints. Read more about this in my article: "Debugging ARM64 Inline Assembly: A Case Study"

Challenge 4: Dictionary Quality

Problem: Generic word lists don't work well for spell-checking.

Solution: Built a comprehensive dictionary pipeline:

  1. Start with SCOWL (Spell Checker Oriented Word Lists)
  2. Add frequency data from Google N-grams
  3. Filter against real misspelling datasets
  4. Validate accuracy across multiple test corpora
  5. Iterate until achieving 82-84% accuracy

Challenge 5: Real-World Performance vs Benchmarks

Problem: Benchmarks showed 30µs, but that's only for misspellings.

Solution: Implemented a two-tier approach:

Performance Benchmarks: Numbers that Matter

Lookup Time Breakdown

Scenario Time Frequency
Correctly spelled word (fast path) 0.7µs ~85%
Misspelled word (correction) 30µs ~15%
Real-world average 5.1µs 100%

Calculation: 0.85 × 0.7µs + 0.15 × 30µs = 5.1µs

Accuracy Across Test Sets

Test Corpus Accuracy Test Size
CodeSpell misspellings 84.2% 2,156 pairs
Microsoft misspellings 82.7% 4,200 pairs
Wikipedia misspellings 83.1% 3,500 pairs
Average 83.3% 9,856 pairs

Memory Usage

Comparison with Other Implementations

While I can't do direct apples-to-apples comparisons (different languages, different test machines), here's how SymSpell C99 compares conceptually:

The C implementation benefits from:

Platform Performance

Apple M4 (ARM64):

x86-64 (Intel Xeon):

Performance is consistent across architectures thanks to portable C99 code.

Getting Started: Quick Start Guide

Installation


# Clone the repository
git clone https://github.com/sumanpokhrel-11/symspell-c99.git
cd symspell-c99

# Build
make

# Run interactive test
./test_symspell dictionaries/dictionary.txt

# Run benchmark
make benchmark
            

Basic Usage


#include "symspell.h"

int main() {
    // Create dictionary with:
    //   max_edit_distance = 2
    //   prefix_length = 7
    symspell_dict_t* dict = symspell_create(2, 7);
    
    // Load dictionary
    int result = symspell_load_dictionary(
        dict, 
        "dictionaries/dictionary.txt",
        0,  // term_index (column in file)
        1   // count_index (frequency column)
    );
    
    if (result < 0) {
        fprintf(stderr, "Failed to load dictionary\n");
        return 1;
    }
    
    // Lookup suggestions
    symspell_suggestion_t suggestions[5];
    int count = symspell_lookup(
        dict,
        "speling",     // misspelled word
        2,             // max_edit_distance
        suggestions,   // output array
        5              // max suggestions
    );
    
    // Print results
    printf("Suggestions for 'speling':\n");
    for (int i = 0; i < count; i++) {
        printf("  %s (distance=%d, freq=%lld)\n",
               suggestions[i].term,
               suggestions[i].distance,
               suggestions[i].count);
    }
    
    // Cleanup
    symspell_destroy(dict);
    
    return 0;
}
            

Output:

Suggestions for 'speling':
  spelling (distance=1, freq=234567)
  spieling (distance=2, freq=1234)
  peeling (distance=2, freq=5678)
        

Interactive Mode

The test program includes an interactive mode for experimentation:

$ ./test_symspell dictionaries/dictionary.txt

Creating SymSpell dictionary...
Loaded 86060 words, 688710 delete entries

=== Interactive Mode ===
Enter words to correct (or 'quit' to exit):

> recieve
  Suggestions:
    receive (distance=1, iwf=7.23, prob=0.0012, freq=234567)

> teh
  Suggestions:
    the (distance=1, iwf=1.05, prob=0.2456, freq=55234891)
    tea (distance=1, iwf=6.12, prob=0.0034, freq=76543)

> quit
            

Building Custom Dictionaries

The project includes a complete dictionary building pipeline:


# Dictionary building scripts in dictionaries/reference/

# 1. Start with SCOWL wordlists
cd dictionaries/reference/wordlists
./build-scowl.sh

# 2. Add frequency data
cd ..
perl scripts/dictionary-build.pl \
    wordlists/wordlist.txt \
    frequencies/google-ngrams.txt \
    > dictionaries/custom-dictionary.txt

# 3. Validate against test data
perl scripts/dictionary-find-misspellings.pl \
    dictionaries/custom-dictionary.txt \
    test/data/symspell/misspellings/

# 4. Test accuracy
./benchmark_symspell \
    dictionaries/custom-dictionary.txt \
    test/data/symspell/misspellings/misspell-codespell.txt
            

API Reference

Core Functions


// Create dictionary
symspell_dict_t* symspell_create(
    int max_edit_distance,  // 1 or 2 recommended
    int prefix_length       // 7 recommended
);

// Load dictionary from file
int symspell_load_dictionary(
    symspell_dict_t* dict,
    const char* filepath,
    int term_index,         // Column number for words
    int count_index         // Column number for frequencies
);

// Lookup suggestions
int symspell_lookup(
    symspell_dict_t* dict,
    const char* input,
    int max_edit_distance,
    symspell_suggestion_t* suggestions,
    int max_suggestions
);

// Get word frequency
int64_t symspell_word_frequency(
    symspell_dict_t* dict,
    const char* word
);

// Get IWF score
float symspell_get_iwf(
    symspell_dict_t* dict,
    const char* word
);

// Cleanup
void symspell_destroy(symspell_dict_t* dict);
            

Data Structures


typedef struct {
    char term[MAX_WORD_LENGTH];  // Suggested word
    int distance;                 // Edit distance from input
    int64_t count;               // Word frequency
    float iwf;                   // Inverse word frequency
    float prob;                  // Probability score
} symspell_suggestion_t;
            

Real-World Use Cases

Where can you use SymSpell C99? Here are some practical applications:

1. Text Editors & IDEs

Integrate real-time spell-checking without UI lag. The 5µs average lookup time means you can check every word as the user types with zero perceptible delay.

Example: A text editor checking 100 words/second = 500µs total (0.0005 seconds)

2. Search Engines

Provide "Did you mean?" suggestions instantly. When a user searches for "pytohn tutorial", suggest "python tutorial" in microseconds.

Benefit: Improves search UX without adding latency.

3. Form Validation

Validate user input in real-time. Check email addresses, names, or technical terms as users type.

Example: A signup form that suggests corrections for common name misspellings.

4. Data Cleaning Pipelines

Process millions of records for data quality. Clean up messy datasets by correcting common misspellings.

Performance: Process 1 million words in ~5 seconds.

5. Chatbots & NLP Systems

Improve intent recognition by correcting user input before processing. Handle typos gracefully in conversational AI.

Example: User types "hw are yu", system understands "how are you".

6. Command-Line Tools

Build CLI tools with spell-check capabilities. Suggest corrections for command typos.

Example: Git-style "Did you mean 'commit'?" messages.

7. Embedded Systems

Run on resource-constrained devices. The small footprint (~45MB) and zero dependencies make it perfect for embedded Linux.

Example: Smart keyboards, IoT devices, automotive systems.

8. Database Query Correction

Fix typos in search queries before hitting the database. Improve search recall without expensive fuzzy matching.

Benefit: Better search results with exact match performance.

How to Contribute

SymSpell C99 is open source and welcomes contributions! Here's how you can help:

Ways to Contribute

1. Bug Reports

Found a bug? Please report it!

2. Performance Improvements

Can you make it faster?

3. Dictionary Enhancements

Improve accuracy by refining the dictionary:

4. Documentation

Help others understand and use the library:

5. Language Bindings

Make SymSpell C99 accessible from other languages:

6. Platform Support

Test and ensure compatibility:

7. Testing

Expand test coverage:

Contribution Guidelines

  1. Fork the repository
    git fork https://github.com/sumanpokhrel-11/symspell-c99.git
  2. Create a feature branch
    git checkout -b feature/your-feature-name
  3. Follow the code style
    • C99 standard compliance
    • Zero warnings with -Wall -Wextra -Werror
    • Clear, descriptive variable names
    • Comments for complex logic
  4. Write tests
    • Add tests for new features
    • Ensure all existing tests pass
    • Run make test before submitting
  5. Document your changes
    • Update README if needed
    • Add docstrings to new functions
    • Update CHANGELOG
  6. Submit a pull request
    • Clear description of changes
    • Reference any related issues
    • Include before/after benchmarks for performance changes

Code of Conduct

This project follows a simple code of conduct:

Recognition

All contributors are recognized in:

Let's build something great together! 🚀

Future Enhancements: What's Next?

Here's my roadmap for future development. These are areas I'm exploring or would love community help with:

Short-Term (Next 3-6 months)

1. CI/CD Pipeline

2. Additional Examples

3. Documentation Improvements

Medium-Term (6-12 months)

4. Language Bindings

5. Advanced Features

6. Performance Optimizations

Long-Term (12+ months)

7. Package Manager Distribution

8. Advanced Use Cases

9. Research Integration

Blue Sky Ideas

Ambitious ideas that would be amazing but require significant work:

Real-Time Collaborative Editing

Integrate with OT/CRDT systems for real-time collaborative spell-checking in Google Docs-style applications.

Browser Extension

Compile to WebAssembly and create a privacy-focused, offline spell-checker browser extension.

Machine Learning Integration

Train a neural ranker to improve suggestion quality beyond frequency-based ranking.

Domain-Specific Dictionaries

Curated dictionaries for: medical terminology, legal documents, programming languages, academic writing.

Your Ideas?

Have an idea for improvement? Open a discussion on GitHub!

Conclusion: Lessons Learned and What's Next

What I Learned

Building and publishing SymSpell C99 taught me invaluable lessons:

Technical Skills

Soft Skills

Professional Growth

Key Takeaways

"Perfect is the enemy of good."

I could have spent months adding every possible feature. Instead, I shipped a solid v1.0 and will iterate based on feedback.

"Performance comes from understanding, not magic."

The 5µs average wasn't luck—it came from profiling, optimization, and deep understanding of how the algorithm and hardware work together.

"Documentation is code for humans."

Writing good docs is as important as writing good code. Nobody will use your library if they can't figure out how.

"Open source is about community, not just code."

The most rewarding part wasn't writing the code—it was sharing it, helping others use it, and learning from their feedback.

What's Next for Me?

This project opened doors I didn't know existed:

Try It Yourself!

I encourage you to:

  1. Clone the repo: github.com/sumanpokhrel-11/symspell-c99
  2. Run the benchmarks: See the performance for yourself
  3. Read the code: It's only 700 lines—very approachable
  4. Integrate it: Use it in your next project
  5. Contribute: Help make it even better

Get in Touch

I'd love to hear from you:

Whether you're:

Let's connect!

Final Thoughts

Building SymSpell C99 reminded me why I fell in love with programming: the joy of creating something useful, the satisfaction of solving hard problems, and the thrill of sharing it with the world.

If you're reading this and thinking about starting your own project:

Do it.

Don't wait for the perfect idea or the perfect time. Pick something that excites you, start small, ship it, and iterate.

The best time to start was yesterday. The second best time is now.

Thank you for reading! Now go build something amazing. 🚀


Suman Pokhrel
Software Developer | Systems Programmer | Open Source Enthusiast
Kathmandu, Nepal

⭐ Star the project on GitHub!

If you found this project useful or interesting, please star the repository:

★ Star SymSpell C99 on GitHub

Comments

Have questions or feedback? Leave a comment below or reach out on GitHub Discussions.