I'm excited to announce the initial release of TextParser, a new Elixir library for extracting and validating structured tokens from text. Whether you need to parse URLs, hashtags, mentions, or custom patterns, TextParser provides a flexible and extensible solution.
Why TextParser?
TextParser was born from real-world needs at justcrosspost.app, where processing tags, mentions, and URLs for Bluesky required precise handling of text tokens. The library has been designed with several key features in mind:
- Accurate Position Tracking: Each extracted token includes exact byte positions in the original text
 - Built-in Token Types: Ready-to-use parsers for URLs, hashtags, and @-mentions
 - Custom Token Support: Easy creation of custom token extractors
 - Validation Rules: Flexible token validation through pattern matching and custom rules
 - Unicode Support: Proper handling of emoji and other Unicode characters
 
Quick Start
Add TextParser to your project's dependencies:
def deps do
  [
    {:text_parser, "~> 0.1"}
  ]
end
Basic usage is straightforward:
text = "Check out https://elixir-lang.org #elixir"
result = TextParser.parse(text)
# Extract URLs
urls = TextParser.get(result, TextParser.Tokens.URL)
# => [%TextParser.Tokens.URL{value: "https://elixir-lang.org", position: {10, 32}}]
# Extract hashtags
tags = TextParser.get(result, TextParser.Tokens.Tag)
# => [%TextParser.Tokens.Tag{value: "#elixir", position: {33, 40}}]
Custom Token Types
One of TextParser's strengths is its extensibility. Here's an example of a custom token for extracting ISO 8601 dates:
defmodule MyParser.Tokens.Date do
  use TextParser.Token,
    pattern: ~r/(?:^|\s)(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]))/,
    trim_chars: [",", ".", "!", "?"]
  def is_valid?(date_text) when is_binary(date_text) do
    case Date.from_iso8601(date_text) do
      {:ok, _date} -> true
      _ -> false
    end
  end
end
# Usage
text = "Meeting on 2024-01-15, party on 2024-12-31!"
result = TextParser.parse(text, extract: [MyParser.Tokens.Date])
Custom Validation Rules
Need custom validation? TextParser provides a behaviour that you can use to implement your own validation rules:
defmodule BlueskyParser do
  use TextParser
  def validate(%TextParser.Tokens.Tag{value: value} = tag) do
    if String.length(value) >= 66,
      do: {:error, "tag too long"},
      else: {:ok, tag}
  end
end
What's Next?
This initial release provides a solid foundation for text token extraction, but this is just a good start 🙂 Here some things I'm planning to work on next:
- Additional built-in token types
 - Integration with NimbleParsec for simpler and more composable extraction rules
 - Integration guides for popular frameworks
 - Removal of a couple of bluesky-specific pieces in Tag handling
 
Get Started
- GitHub: https://github.com/solnic/text_parser
 - Documentation: https://hexdocs.pm/text_parser
 - Issues & Feature Requests: GitHub Issues
 
Contributions and feedback are welcome! Whether you find a bug, have a feature request, or want to contribute code, please feel free to get involved.