What needs to be improved with the current file type sniffing?

The current system has a problem if:

  1. two or more formats share the same file extension (e.g. .log), and neither have a magic marker on the first line of the file.
  2. a file does not have an extension nor a magic marker on first line.

So what file types are we talking about?

  • property lists: these have dict or plist extension, but can be binary, ASCII, or XML. The current system handles it by having the XML plist grammar assigned to the two extensions, and then use a first line match of ^\{\s*$ for the ASCII variant. But it is not ideal, since not all ASCII property lists conform to that first line match.

  • shell scripts with a shebang of exec and the actual interpreter on the next line (note to self: dig up an example of this.)

I guess we could add something about HTML/SGML/XML doc types. But I think in practice we always have an extension to go by, so doc type would instead only be used to pick e.g. an XHTML Strict language grammar over the plain HTML grammar, but we didn’t pursue the “have different language grammars for different document types” idea.