I like the regex option, and I _think_ that the anchor at the beginning
(along with the lack of backtracking) shouldn't cause horrible performance
degradation.

On Tue, Jun 9, 2020 at 7:04 AM Nick Burch <[email protected]> wrote:

> Hi All
>
> At the moment, to detect RFC822 emails, we try and check for a bunch of
> common header lines right at the start. If not, we check for a few "could
> be an unusual header, could be some text", followed by checking for common
> headers in a larger area of text below.
>
> For example, starts with "Received:" or starts with "X-" and has
> "\nReceived:" near that, in mime-magic it's
>
> https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100
>
> After a recent bug, we now have 3 different "could be a header not sure"
> blocks at the start (X-, DKIM- or ARC-), all with exactly the same block
> of possible real headers below. These need to be kept in sync between the
> 3 initial matches, and if not could cause bugs
>
> Ideally, I'd like to group those three together to avoid that + simplify +
> make it easier to understand
>
>
> One option might be to make the first big a regexp, so we can do eg
> ^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer,
> nor the performance? Could maybe even then add the other headers to check
> in after, if that doesn't make it too hard to understand?
>
> Alternately, we could maybe tweak the xml to support an or construct, so
> you could give multiple ones to match at one level with multiple "normal
> or's" below?
>
> Or something else?
>
> Any thoughts anyone?
>
> Thanks
> Nick
>

Reply via email to