On Mon, Dec 26, 2011 at 12:54 PM, Brian Smith <bsm...@mozilla.com> wrote: > Henri Sivonen wrote: >> I suspect some of our localizations have inappropriate defaults for >> the fallback character encoding. On the conceptual level, telemetry >> could be used to discover >> a) how often pages end up relying on the fallback character encoding >> b) what the fallback encoding is in those cases (i.e. has the user >> changed it and to what) >> c) how often users override the character encoding on a per-page basis >> >> It turns out there are two problems here: >> 1) Telemetry doesn't seem to have nice ready-made tools for dealing >> with enumerations >> 2) These metrics would only make sense to measure on a >> per-localization basis and there doesn't appear to be a way to do this > > Regarding privacy, would we even want to include these two pieces of > information in the telemetry data?: > > * what the fallback encoding was and/or what it was changed to > * what the locale was
How do you suggest we answer the question: “Do we have the most successful encoding defaults for all our localizations?” > This is the kind of data that, in combination with other such data, would > seem to cause trouble along the same lines as the Netflix Prize [1]. For > example, if the locale is Romanian, then you have reduced the search space > for an individual from 7,000,000,000 to 25,000,000. Let's say the platform is > Android. Then, you've probably narrowed the search space down to less than > 1,000 people. With just these two factors, you aren't far from isolating > individual users from their telemetry data, even before you combine it with > other factors. At least in theory, this could be cross-referenced with other > datasets, such as Romanian-language tweets sent from the Twitter for Android > app at times close to the times the Telemetry data was sent, to identify > telemetry-providing users by name with a high degree of accuracy. Do I understand correctly that the problem is that even if we stored the locale-related telemetry data separately from other telemetry data and threw out IP address and time stamp right away, we couldn’t prove that to users? Actual ability to correlate the data could be removed by discarding the originating IP address, the exact time and the association with other telemetry data, right? On Mon, Dec 26, 2011 at 7:30 PM, Doug Turner <doug.tur...@gmail.com> wrote: > Brian, you are right. Sid's teams should audit these closely. Our > privacy policy says: > > """ > Beginning with version 7, Firefox includes functionality that is > turned off by default to send to Mozilla non-personal usage, > performance, and responsiveness statistics about user interface > features, memory, and hardware configuration. > """ The data I’m interested in would fall under “user interface features” first and “performance” second. I'd like to have answers to these questions, and I think telemetry might be able to provide the answers: 1) Do we have the most successful encoding defaults for our locales? The reason I want to know: by inspection, I suspect that the locales turned up by the search http://mxr.mozilla.org/l10n-central/search?string=charset.default\s*%3D\s*UTF-8®exp=1&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-central have inappropriate defaults, because the default encoding exists for misauthored legacy content that predates UTF-8. Inappropriate defaults may lead to user frustration or choosing another browser over Firefox. 2) Instead of having locale-specific defaults, could we decide the fallback encoding based on the top-level domain name of the site? The reason I want to know: Currently, the Web-exposed behavior of Firefox depends on the localization. It's bad that the way sites work depends on the UI language of the browser. In principle, you should be able to read e.g. Russian-language sites as successfully with e.g. Estonian-language Firefox then with a Russian-language Firefox. 3) Could we use a pan-Chinese encoding detector? Currently, it appears that our zh-TW localization turns on the universal detector. The universal detector has various problems (it isn't actually universal), so if the use case is that Taiwanese users often read both Traditional Chinese and Simplified Chinese legacy content while also reading some English content, maybe a detector that doesn't try to detect stuff as Cyrillic encodings could be more successful and perform faster. 4) Do users actually use the Character Encoding menu enough to warrant keeping that UI around? Can we get rid of the menu already? Reasons why it would be nice to get rid of the menu: * We shouldn't be signaling to Web authors that they have the option to leave this problem to users instead of getting their authoring act together. * Less opportunity for users to introduce data corruption by using the menu and then submitting a form. * Less code complexity. * Less code to maintain and fix. (I just wrote a fix in this area. Even though the code changes weren't that difficult, writing the unit tests was quite time-consuming.) Can answers to any of these questions pursued with telemetry given our policies? Have collecting impression data from the sin of the service or the Metrics Data Ping changed the thinking on whether it's okay to measure usage of Firefox features that correlate with language and/or geography? Alternatives to telemetry that I can think of: For #1: Instead of measuring our success, seeing what the most popular browser in each local does and doing the same. For #2: Doing a massive Web crawl. For #3: Shipping a pan-Chinese detector and seeing if anyone complains. For #4: Measuring the menu usage frequency without measuring what gets overridden and to what. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform