Mozilla Charset Detectors

2017-05-22 Thread Gabriel Sandor
Greetings,

I recently came across the Mozilla Charset Detectors tool, at
https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on
a C# project where I could use a port of this library (e.g.
https://github.com/errepi/ude) for advanced charset detection.

I'm not sure however if this tool is deprecated or not, and still
recommended by Mozilla for use in modern applications. The tool page is
archived and most of the links are dead, while the code seems to be at
least 7-8 years old. Could you please tell me what's the status of this
tool and whether I should use it in my project or look for something else?

Thanks in advance.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-23 Thread Gabriel Sandor
Hello Henri,

I was afraid this might be the case, so the library really is deprecated.

The project i'm working on implies multi-lingual environment, users, and
files, so yes, having a good encoding detector is important. Thanks for the
alternate recommendations, i see that they are C/C++ libraries but in
theory they can be wrapped into a managed C++.NET assembly and consumed by
a C# project. I haven't seen yet any existing C# ports that also handle
charset detection.

On Mon, May 22, 2017 at 5:49 PM, Henri Sivonen  wrote:

> On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor
>  wrote:
> > I recently came across the Mozilla Charset Detectors tool, at
> > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working
> on
> > a C# project where I could use a port of this library (e.g.
> > https://github.com/errepi/ude) for advanced charset detection.
>
> It's somewhat unfortunate that chardet got ported over to languages
> like Python and C# with its shortcomings. The main shortcoming is that
> despite the name saying "universal", the detector was rather arbitrary
> in what it detected and what it didn't. Why Hebrew and Thai but not
> Arabic or Vietnamese? Why have a Hungarian-specific frequency model
> (that didn't actually work) but no models for e.g. Polish and Czech
> from the same legacy encoding family?
>
> The remaining detector bits in Firefox are for Japanese, Russian and
> Ukrainian only, and I strongly suspect that the Russian and Ukrainian
> detectors are doing more harm than good.
>
> > I'm not sure however if this tool is deprecated or not, and still
> > recommended by Mozilla for use in modern applications. The tool page is
> > archived and most of the links are dead, while the code seems to be at
> > least 7-8 years old. Could you please tell me what's the status of this
> > tool and whether I should use it in my project or look for something
> else?
>
> I recommend not using it. (I removed most of it from Firefox.)
>
> I recommend avoiding heuristic detection unless your project
> absolutely can't do without. If you *really* need a detector, ICU and
> https://github.com/google/compact_enc_det/ might be worth looking at,
> though this shouldn't be read as an endorsement of either.
>
> With both ICU and https://github.com/google/compact_enc_det/ , watch
> out for the detector's possible guess space containing very rarely
> used encodings that you really don't want content detected as by
> mistake.
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-30 Thread Gabriel Sandor
They can come from arbitrary sources that are out of my control. Therefore
i may not get the charset of the original document, so all i'm left with is
heuristic detection for those fragments. The application must be able to
deal with any XML it receives, it doesn't impose any particular structure
or content (think of XML editors like Notepad++).

Besides XML, there are also plain text files, which don't really have a
standard way of declaring all possible encodings.

No matter how much i'd like to avoid it, there are cases when heuristic
encoding detection is the only option.

On Fri, May 26, 2017 at 9:45 PM, Daniel Veditz  wrote:

> On Fri, May 26, 2017 at 4:12 AM,  wrote:
>
>> Still, sometimes XML fragments come up and even if they are not 100% XML
>> spec compliant i still have to process them. This includes encoding
>> detection as well, when the XML declaration is missing from the fragments.
>>
>
> ​Where do the fragments come from? If you pulled them out of a document
> then you should have a charset (even if we have to guess at the document
> level). If you only get the fragments through an API the charset should be
> passed along as an argument to the API, otherwise treat them as Henri
> described above.
>
> -Dan Veditz
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform