Regex vs Search Terms – Finding What You Need In Your Logs

4 MIN READ
MIN READ
TABLE OF CONTENTS
    4 MIN READ
    MIN READ

    Full-text searches are a marvel of modern computing. In less than a second, search engines can match a query against hundreds of millions of documents. In the early days of search engines, you often had to use specific search operators and terms to get accurate results.
    Since then, techniques such as keyword indexing, fuzzy searching, and natural language processing have allowed search engines to return more accurate results from simpler searches.
    Today, text searches are commonly performed using search terms or regular expressions (regex).

    What is Regex?

    Regex, or regular expressions, is a language used to define search patterns based on a set of rules. In order to perform a full-text search using regex, a developer has to explicitly define the pattern that they’re looking for using string literals, delimiters, control characters, operators, and more. This offers greater precision but requires a significant amount of expertise. If you want to learn the basics of regex, read our guide.
    We chose a Google-like search interface so anybody could search through their logs without having to learn a new language. This translates to the quickest turnaround possible, no matter how many log lines exist.
    Although regex was how developers traditionally grep through local log files, search terms offer multiple benefits including simpler expressions, support for indexed fields, easier comparison operations, and better performance.

    Simpler Expressions

    The flexibility of regex is both a strength and a weakness. As a general-purpose utility, regex fits a wide array of use cases including searching, parsing, and string manipulation. The problem is that a significant number of rules, operators, and standards were added to the language in order to support this functionality. This makes even basic expressions difficult to understand and even harder to debug.

    Regex Examples:

    Let’s look at a common use case for logging: searching for logs over a date range. Our logs are sent from a local syslog server to Mezmo, formerly known as LogDNA. With regex, we need to define a search pattern that specifically matches the syslog message format, such as the following event:

    Oct 24 16:43:04 debian-logdna syslog  localhost
    @ - - [2018-10-24 16:43:01.394735381 -0400 EDT]
    "GET / HTTP/1.1" 302 96 "" "curl/7.52.1" 0.410829

    For example, imagine we want to analyze logs that occurred between 3pm and 6pm of last Friday. To do this, we need to create an expression that searches on both date and time. We also need to make sure that we only compare against the timestamp appearing at the start of the message in case the same pattern appears multiple times in the message. We can do this using the following regex:

    ^Oct 19 [1][5-8]:[0-9]{2}:[0-9]{2}

    We start by using an anchor (^) to perform our match on the start of the message. We then search for messages beginning with Oct 19, which limits the range to a specific date. We then search for messages with times between 15:00:00 and 18:00:00 ([1][5-8]:[0-9]{2}:[0-9]{2}). While this works, it’s extremely verbose and requires several additional operators just to make sure that we match on the correct text. We also need to rewrite this expression whenever the current week changes, or if we collect logs stored in different formats.
    Alternatively, we could use natural language to describe the logs that we’re looking for. We can do this in Mezmo using the following query:
    last friday 3pm to last friday 6pm
    This returns the same results, but is much more intuitive and much less cumbersome to write. Additionally, we can reuse the query without having to change it per week.

    Support for Indexed Fields

    Regex is designed for plain text searches, such as finding text within a document. For regex to work with logs, it needs to process the raw log data. This ignores a key benefit of formats such as JSON and syslog, which is the ability to store data in different fields. This could cause a significant amount of overhead, especially with large logs and complex expressions.
    For example, let’s search for messages originating from a specific host. Syslog automatically stores the name of the host in the hostname field. Some messages repeat the hostname throughout the body, meaning we need to define rules for matching specifically on the hostname field:

    Oct 24 16:43:04 debian-logdna syslog  localhost @
    - - [2018-10-24 16:43:01.394735381 -0400 EDT]
    "GET / HTTP/1.1" 302 96 "" "curl/7.52.1" 0.410829
    Oct 24 16:43:04 debian-logdna syslog  My unqualified host
    name (debian-logdna) unknown; sleeping for retry
    Oct 24 16:43:07 debian-logdna auth.log  pam_unix(sudo:session):
    session opened for user root by (uid=0)

    We need to create an expression that searches for text that follows the hostname specification, while also immediately following the date at the start of the message. What we end up with is the following expression, where hostname is the host that we’re looking for:
    (^[a-zA-Z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} \b)hostname\b
    The bulk of this expression ((^[\b]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} \b)) is just to ensure that we retrieve the hostname from the field following the date. However, since Mezmo and other log management tools automatically index syslog fields, we can return the same results much faster by searching on the hostname field:
    source:hostname

    Easier Comparison Operations

    A common use case with logs is searching for a range of values. Consider logging HTTP requests for a web server. If we want to find requests that resulted in an error, we would need to search the HTTP status field for codes falling between 400 (client errors) and 500 (server errors). Three digit numbers are not uncommon in log messages, and the same error message may be repeated in multiple events:

    Oct 24 17:59:58 debian-logdna workhorse info localhost @ - -
    [2018-10-24 17:59:58.074327724 -0400 EDT] "GET /invalid-url
    HTTP/1.1" 404 2440 "" "Mozilla/5.0 (X11; Linux x86_64;
    rv:52.0) Gecko/20100101 Firefox/52.0" 0.054947
    Oct 24 17:59:59 debian-logdna syslog  localhost @ - -
    [2018-10-24 17:59:58.074327724 -0400 EDT] "GET /invalid-url
    HTTP/1.1" 404 2440 "" "Mozilla/5.0 (X11; Linux x86_64;
    rv:52.0) Gecko/20100101 Firefox/52.0" 0.054947
    Oct 24 17:59:59 debian-logdna daemon.log  localhost @ - -
    [2018-10-24 17:59:58.074327724 -0400 EDT] "GET /invalid-url
    HTTP/1.1" 404 2440 "" "Mozilla/5.0 (X11; Linux x86_64;
    rv:52.0) Gecko/20100101 Firefox/52.0" 0.054947

    The goal is to match on the first instance of a three-digit number immediately following the request method, URL, and protocol (“GET /invalid-url HTTP/1.1”). Since these can change, we need to create a rule that is flexible enough to accommodate for this:

    (\"[A-Z]{3,4} \/[a-z0-9-\/]+ HTTP\/[1-2]\.[0-9]" )[4-5][0-9]{2}

    The first part of the expression ((\”[A-Z]{3,4} \/[a-z0-9-\/]+ HTTP\/[1-2]\.[0-9]” )) only searches for the method, URL, and protocol. The second part ([4-5][0-9]{2}) searches for the actual status code and only matches on codes beginning with a 4 or 5. This ensures that we only find 400 and 500 codes placed in a specific location in the message.
    Alternatively, in Mezmo, we can simply search the response field for values greater than or equal to 400 and less than 600:
    response:(>=400

    Better Performance

    Regex performs well enough for small searches, but performance gets significantly worse as the search field grows and the expressions become more complex. With log data, regex is often used on individual entries to parse fields and perform real-time analysis. However, using regex to search through gigabytes or terabytes of log data is incredibly slow and resource-intensive no matter how well optimized the expression is.
    The challenge is writing expressions that are both effective and efficient. Our earlier example on searching hostnames takes 31 steps to complete. If we remove the anchor (^) at the start of the expression, the number of steps more than doubles. Inexperienced developers might use wildcards and lookarounds to add flexibility to their expressions without realizing the performance penalties that both of these incur. Catastrophic backtracking is a dangerous example of this and can even result in denial of service attacks.

    Conclusion

    Both regex and Google-like search terms are two of the most popular ways to search through log files to find what you’re looking for. Regex syntax offers precision for users to pinpoint what they’re looking for but the tradeoff is the ramp-up time to learning how to craft what you need. The second is the slower performance in large sets of log data and penalties in complex queries. Search terms provide speed, accessibility, and support for indexed data. This is how Mezmo manages to provide blazing fast searches regardless of log volume. To learn more about how searching works in Mezmo, visit the search documentation or sign into the Mezmo web app.

    false
    false