Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Henrique de Moraes Holschuh
On Tue, 03 Apr 2018, Michael Lange wrote:
> I believe (please anyone correct me if I am wrong) that "text" files
> won't contain any null byte; many text editors even refuse to open such a

Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
modern encoding AFAIK, other than modified UTF-8), any zero bytes map
one-to-one to the NUL character/code point.  I don't recall how it is on
other common encodings of the 80's and 90's, though.

Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
bytes with the value of zero when encoding characters, so NUL is encoded
by a different sequence, and you can safely use a byte with the value of
zero for some out-of-band control (like zero-terminated strings that can
contain NULs, etc) -- note that NUL is a character, and it might be
represented by a sequence of bytes that has nothing to do with zeroes on
a particular encoding...

(in fact, C strings are *zero-terminated*, not NUL-terminated, but most
of the time this is irrelevant :p).

Also, a text file MAY contain NULs (the character), it is just
considered bad practice (nowadays?).  Don't assume you won't see any.
For example, received e-mail is *more* likely to have NULs in it than
normal text due to the quality of some mail agents out there.  I recall
postfix would reject a *lot* of crap when we configured it to refuse to
accept NULs outside of 8-bit bodies, because Cyrus-IMAPd *refuses* any
such crap, and we wanted it bounced as early as possible.

(note that NULs are forbidden in MIME-compliant email text and ESMTP,
unless encoded or guarded by a 8-bit transfer area of known size, so
there you have it: NULs in one text format that actually forbids them!).

> Probably it is the same with some other control characters like 04 (End
> of Transmission). When I look at https://en.wikipedia.org/wiki/ASCII
> it seems like 1C (File Separator) or 1E (Record Separator) might be 
> appropriate choices for you. I'm no expert on this, though.

Well, ASCII control characters were inherited by ISO-8859-* and Unicode,
so yes, you can use them.  But so could the data file.  It would be
perfectly ok for a text data file to use the record separator control
characters to delimit records in a table, for example...

Here's a good definition of them (follow the hyperlinks for the
definition of each control character):
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)


Here is also a proper solution: use modified UTF-8 (which encodes NUL so
that zero bytes are *never* present in the stream): encode every input
format to modified UTF-8, then add the zero-byte separators you want.

You'll have to normalize the input data set into known charset/encodings
and then recode them to modified UTF-8, of course.  You can't blindly
call any random data "UTF-8" (let alone modified UTF-8) and expect
things not to break horribly.

-- 
  Henrique Holschuh



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-03):
>   and the data is stored in mbox formatted files.

DO NOT DO THAT.

This is the only good advice you can have for that project. Store your
data in a decent format.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
Sorry, I already have 300 MB plus stored in that format.  Where were you in 
2000 when I started the project?

On Wednesday, April 04, 2018 07:23:25 AM Nicolas George wrote:
> rhkra...@gmail.com (2018-04-03):
> > and the data is stored in mbox formatted files.
> 
> DO NOT DO THAT.
> 
> This is the only good advice you can have for that project. Store your
> data in a decent format.
> 
> Regards,



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-04):
> Sorry, I already have 300 MB plus stored in that format.

Then convert. Small extra work now. Many less headaches later.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: utf

2018-04-04 Thread Henrique de Moraes Holschuh
On Tue, 03 Apr 2018, Darac Marjal wrote:
> If these things matter to you, it's better to convert from UTF-8 to Unicode,

UTF-8 *is* Unicode :p  What you mean is either UCS-4 or UTF-32 (which
are just another encoding for Unicode).  But all of them are Unicode.

UTF-* are only used for Unicode encodigs: one implies the other.  You
could encode generic binary data using a bit-packing scheme identical to
UTF-8, but it would have to be called something else.

> first. I tend to think of Unicode as an arbitrarily large code page. Each
> character maps to a number, but that number could be 1, 1000 or 500_000
> (Unicode seems to be growing without might end in sight). Internally, you

It won't go past 32 bits without becoming something else than Unicode,
and the real limit is lower (0x10 from RFC 3629).  What the Unicode
consortium is doing is to *fill in* this range.  There is still a lot of
unallocated/reserved space, but yes, we are perfectly capable of filling
it up with junk given a few decades.

> might store those code points as Integers or QUad Words or whatever you
> like. Only once you're ready to transfer the text to another process (print

"whatever you like" is a rather bad idea.

Use whatever is appropriate for the internal representation of Unicode
on whatever programming language you are dealing with, and fall back to
UCS-4 (unsigned 32-bit integer) if there isn't one.

For C, you'd use uint32_t to store the codepoints, and UCS-4 (or UTF-32)
to encode them (they are the same if you don't care for detecting
illegal code points and are not serializing that data directly to the
outside world).  There is wchar_t, but that thing is bad news if you
need to be portable _and_ handle anything outside of the Unicode BMP.

IMO, you will want to avoid UCS-2 and UTF-16 as much as you can.

> Basically, you consider UTF-8 to be a transfer-only format (like Base64). If
> you want to do anything non-trivial with it, decode it into Unicode.

You mean decode it to UCS-4/UTF-32, but yes, that's the idea.

-- 
  Henrique Holschuh



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 01:23:25PM +0200, Nicolas George wrote:
> rhkra...@gmail.com (2018-04-03):
> > and the data is stored in mbox formatted files.
> 
> DO NOT DO THAT.
> 
> This is the only good advice you can have for that project. Store your
> data in a decent format.

Perhaps an sqlite database.  At least, that is my first thought.
You might come up with a better solution depending on your specific
needs.  That solution won't be "an mbox folder".



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
I'll convert the file format after you convert the programs to work with the 
different file format.  Those programs include kmail, nail, (essentially all 
email programs that use mbox as the file format), recoll (conversion should not 
be difficult), various editors (nedit, kate, for which I've written syntax 
highlighters / folders for the current format), and all scintilla based 
editors, for which I'm working on a highlighter / folder.

Let me know when you're almost finished, so I can make the conversion.

On Wednesday, April 04, 2018 08:09:11 AM Nicolas George wrote:
> rhkra...@gmail.com (2018-04-04):
> > Sorry, I already have 300 MB plus stored in that format.
> 
> Then convert. Small extra work now. Many less headaches later.
> 
> Regards,



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-04):
> I'll convert the file format after you convert the programs to work with the 
> different file format.  Those programs include kmail, nail, (essentially all 
> email programs that use mbox as the file format), recoll (conversion should 
> not 
> be difficult), various editors (nedit, kate, for which I've written syntax 
> highlighters / folders for the current format), and all scintilla based 
> editors, for which I'm working on a highlighter / folder.
> 
> Let me know when you're almost finished, so I can make the conversion.

I have given you advice (for free), you are not taking it. Too bad for
you. Good day.

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 08:26:41 AM Greg Wooledge wrote:
> On Wed, Apr 04, 2018 at 01:23:25PM +0200, Nicolas George wrote:
> > rhkra...@gmail.com (2018-04-03):
> > >   and the data is stored in mbox formatted files.
> > 
> > DO NOT DO THAT.
> > 
> > This is the only good advice you can have for that project. Store your
> > data in a decent format.
> 
> Perhaps an sqlite database.  At least, that is my first thought.
> You might come up with a better solution depending on your specific
> needs.  That solution won't be "an mbox folder".

Past experience with "databases" (before I switched to LInux)--things like 
dBase (III, III+, IV), Microsoft Access, and others that I can't recall atm 
made the fixed (and even variable) length fields.

The key thing for me was making something that worked reasonably like askSam, 
which is / was free format, fully searchable, and other things I can't recall 
atm.

I don't know enough about sqlite to know what capabilities it has for variable 
length fields (if any--how many are possible per record, is the content fully 
searchable, etc.)

But, it really doesn't matter, I am not interested in changing the data 
format.

(Besides, with the current design, almost any email client is a client for my 
mashup.)



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote:
> On Tue, 03 Apr 2018, Michael Lange wrote:
> > I believe (please anyone correct me if I am wrong) that "text" files
> > won't contain any null byte; many text editors even refuse to open such a
> 
> Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
> modern encoding AFAIK, other than modified UTF-8), any zero bytes map
> one-to-one to the NUL character/code point.  I don't recall how it is on
> other common encodings of the 80's and 90's, though.

Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
call "Unicode": in more "Western" contexts every second byte is NULL!

> Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
> bytes with the value of zero when encoding characters, so NUL is encoded
> by a different sequence, and you can safely use a byte with the value of
> zero for some out-of-band control [...]

Yes, the problem is that someone else before you could have been doing
exactly that.

I'd guard against that. It's not exactly difficult, the traditional
"escape" mechanism (aka character stuffing) does it pretty well...

Cheers
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrE3JcACgkQBcgs9XrR2kYW7ACeMG0SQB23RSySoeSJBItB+Eji
QEgAnipwAcoVJuzynJVBO1CR2rrLeuFs
=xhja
-END PGP SIGNATURE-



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Andre Majorel
On 2018-04-04 14:55 +0200, Nicolas George wrote:

> I have given you advice (for free), you are not taking it. Too bad for
> you. Good day.

Is advice that comes with condescension truly free ?

-- 
André Majorel 
I trust bugs.debian.org to not publish my email address for
spammers to harvest.



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 04:15:48PM +0200, Andre Majorel wrote:
> On 2018-04-04 14:55 +0200, Nicolas George wrote:
> 
> > I have given you advice (for free), you are not taking it. Too bad for
> > you. Good day.
> 
> Is advice that comes with condescension truly free ?

Any advice that stops the OP from storing structured data in mbox format
flat files is a bargain at any price.

Sadly, it seems we've failed to find the magic argument to achieve
that result.  At some point, you just have to walk away.



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 10:15:48 AM Andre Majorel wrote:
> On 2018-04-04 14:55 +0200, Nicolas George wrote:
> > I have given you advice (for free), you are not taking it. Too bad for
> > you. Good day.
> 
> Is advice that comes with condescension truly free ?

Thank you!



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 10:24:06 AM Greg Wooledge wrote:
> On Wed, Apr 04, 2018 at 04:15:48PM +0200, Andre Majorel wrote:
> > On 2018-04-04 14:55 +0200, Nicolas George wrote:
> > > I have given you advice (for free), you are not taking it. Too bad for
> > > you. Good day.
> > 
> > Is advice that comes with condescension truly free ?
> 
> Any advice that stops the OP from storing structured data in mbox format
> flat files is a bargain at any price.

Why do you call it "structured data"--it is free format data, no structure, or 
at least no regular / consistent structure.  It is anything that I want to put 
into it.


> 
> Sadly, it seems we've failed to find the magic argument to achieve
> that result.  At some point, you just have to walk away.



Re: utf

2018-04-04 Thread deloptes
Nicolas George wrote:

> No, the length of the string is hardly relevant, and when it is it is
> not enough anyway.

@Nicolas, I think OP does not understand you - perhaps it is not worth the
effort. My impression is that you refer to a string (properly) as sequence
of bytes and other refer to it as number of chars, which is not consistant
with utf.

>From my work with UTF, it is possible but not satisfying to guess encoding.
I wonder why no one suggested a kind of markup (xml) instead of byte
delimiter.

And regarding the mbox thing, well mbox was depreciated for many reasons. I
guess if it was that good it wouldn't be depreciated.

@OP, at some point of time everyone has to redesign and reimplement because
technology evolves and all the tools listed can be updated to the new
format.
Recent example of redesign I worked with is gnupg - huge changes from v1.x
to v2.1

regards



Re: utf

2018-04-04 Thread Nicolas George
deloptes (2018-04-04):
> @Nicolas, I think OP does not understand you - perhaps it is not worth the
> effort. My impression is that you refer to a string (properly) as sequence
> of bytes and other refer to it as number of chars, which is not consistant
> with utf.

Not at all, I am well speaking of text string, made up of characters,
expressed as sequences of Unicode code point, and encoded as any data
structure convenient.

What I am trying to explain (not to the OP who is focussing on another
thing entirely and obviously beyond help anyway), is that if you are
thinking of stings in terms of "access the n-th char", "find the length
of the string in chars", etc., then you completely have missed the point
of them.

Find me a case where you need to access the n-th char of a string, with
n completely out of the blue, and I will explain how somebody botched
their design.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: utf

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 07:07:01PM +0200, Nicolas George wrote:
> Find me a case where you need to access the n-th char of a string, with
> n completely out of the blue, and I will explain how somebody botched
> their design.

Does it count if we want the 1st char, then the 2nd char, then the 3rd
char, then the 4th char, and so on?  Or is that not blue enough?

How about the last char?  Or the last two chars?  Or all chars starting
just after the last slash or period?

How about performing a checksum like the
 on a user input string
which is supposed to be a 10-digit
 ?

Might it be useful to check the length of an input string before
bothering to decompose it into individual digits and perform the
arithmetic?  And here, by "length", I mean "number of characters".
You can see how that might be a handy thing, right?

Length as in "number of bytes required to store it" is also an
important value, of course.

Character length is also useful when displaying strings on
CHARACTER-ORIENTED OUTPUT MEDIA.  Like terminals.  You know, those
things that Unix-like systems use all the time?  How else are you
going to space-pad the fields so that the output columns line up,
if you don't know how many extra spaces you need, because you don't
know the length of the string?

(It's frankly disturbing to me that when I talked about length being
relevant when printing strings, you immediately jumped to "pixels" and
"fonts".  This tells me that you no longer accept the terminal as your
lord and savior.  If you ever did.)

All of these things matter, and are real, and don't necessarily indicate
"botched design".



Re: utf

2018-04-04 Thread Nicolas George
Greg Wooledge (2018-04-04):
> Does it count if we want the 1st char, then the 2nd char, then the 3rd
> char, then the 4th char, and so on?  Or is that not blue enough?

It is not out of the blue, it is in sequence.

> How about the last char?  Or the last two chars?

Ditto.

>   Or all chars starting
> just after the last slash or period?

There is no n here, you have to first look for the last slash.


> How about performing a checksum like the
>  on a user input string
> which is supposed to be a 10-digit
>  ?

Again, in sequence.

> Might it be useful to check the length of an input string before
> bothering to decompose it into individual digits and perform the
> arithmetic?

And to check it is made of all digits. Checking the length is only a
byproduct.

>  And here, by "length", I mean "number of characters".
> You can see how that might be a handy thing, right?

Still, no.

> Length as in "number of bytes required to store it" is also an
> important value, of course.

You have to compute it to store it, indeed.

> Character length is also useful when displaying strings on
> CHARACTER-ORIENTED OUTPUT MEDIA.  Like terminals.  You know, those
> things that Unix-like systems use all the time?  How else are you
> going to space-pad the fields so that the output columns line up,
> if you don't know how many extra spaces you need, because you don't
> know the length of the string?

I do not know. Please tell me, how do you handle control characters,
escape sequence, double-width characters, etc., without walking the
string in sequence?

> (It's frankly disturbing to me that when I talked about length being
> relevant when printing strings, you immediately jumped to "pixels" and
> "fonts".  This tells me that you no longer accept the terminal as your
> lord and savior.  If you ever did.)

I do not have a "lord and savior". I use the terminal a lot, but I am
aware of the hidden complexities.

> All of these things matter, and are real, and don't necessarily indicate
> "botched design".

All these things matter, but they do not require random access in a
string by char number.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: utf

2018-04-04 Thread deloptes
Nicolas George wrote:

> Find me a case where you need to access the n-th char of a string, with
> n completely out of the blue, and I will explain how somebody botched
> their design.

ok, thanks. I understood the part above, but not sure if I understand this
part. A standard text editing operation is find and replace, where you get
the start and end point in the string. Of course it is not "n completely
out of the blue".

regards



Re: utf

2018-04-04 Thread Nicolas George
deloptes (2018-04-04):
> ok, thanks. I understood the part above, but not sure if I understand this
> part. A standard text editing operation is find and replace, where you get
> the start and end point in the string. Of course it is not "n completely
> out of the blue".

I am not sure exactly what is your example, but you got its flaw right:
n is not out of the blue, it was obtained by previously walking the
string. And in that case, you have all freedom to express n as a more
convenient entity than an index expressed in terms of chars. A pointer
maybe, or a pair with both the char index and the octet offset.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Don Armstrong
On Tue, 03 Apr 2018, rhkra...@gmail.com wrote:
> I am building (have built several iterations) of a free format
> database to work something like askSam. It is a mashup of several
> applications, things like recol, kmail, nail, kate and the data is
> stored in mbox formatted files.
> 
> Each record is treated as an email.

You should consider looking at using Maildir with notmuch and using
things which integrate notmuch.[1]

> Most likely this would be only a temporary addition, and I would need
> to do things like make sure that one byte will be unique in the file.
> It sounds like there are at least a few candidates.

Maildir is the solution to this. While you *can* handle mbox and do all
the escape rules properly (From to >From and back), it's a pain. Let
your filesystem handle it for you.

[I'm speaking from experience; I currently maintain debbugs, which
basically stores everything in a custom format mbox. This inevitably
makes things slow, as you have to search through the mbox linearly to
find any message in the mbox unless you also write indexes for the
mailbox.]

1: Notmuch itself uses xapian to do the heavy lifting.
-- 
Don Armstrong  https://www.donarmstrong.com

Cheop's Law: Nothing ever gets built on schedule or within budget.
 -- Robert Heinlein _Time Enough For Love_ p242



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
Don Armstrong (2018-04-04):
> You should consider looking at using Maildir with notmuch and using
> things which integrate notmuch.[1]

Maildir is not that much better than mbox. Sure, it eliminates most of
its worse flaws, but it brings flaws of its own, like trashing the inode
and dentries caches, requiring extra disk reads and cache due to partial
file ends, or causing much more seeking (granted, this one is becoming
less of an issue with non-mechanical storage).

The filesystem is really the least common factor of database systems
(no, I did not mix LCM and GCD). There is a reason people designed more
advanced and optimized formats on top of it.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: utf

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 07:35:37PM +0200, Nicolas George wrote:
> I am not sure exactly what is your example, but you got its flaw right:
> n is not out of the blue, it was obtained by previously walking the
> string. And in that case, you have all freedom to express n as a more
> convenient entity than an index expressed in terms of chars. A pointer
> maybe, or a pair with both the char index and the octet offset.

The problem is, you reject every single example that everyone gives
you.  I don't know what you expect from us.

You just seem to have Decided, for reasons known only to you, that
The Character Length Of A String Is Not Useful.  Despite literally
decades of programs that have used strlen() in various ways.

Have you never been given ANY kind of problem that involves analysis
of character strings?  Ever?  At all?

What if the question is "Find all the English words that have an E
in the 5th position and a U in the 7th"?

I mean, seriously, at some point you either have to accept that one
of our examples is good enough to justify the existence of strlen()
and character-based string indexing, or we just label you a loon and
ignore everything you say henceforth.

I think most of the REST of us can agree that the character length
and the byte length of a string are BOTH useful quantities, important
in countless ways to countless programs.  And that sometimes your
algorithm wants to treat a string as an indexed array of characters,
and retrieve the nth character.  And that we shouldn't have to dig
up an entire textbook worth of examples to explain this.



Re: Unknown Systemd version

2018-04-04 Thread Laurent Lyaudet
2018-04-03 22:14 GMT+02:00 Abdullah Ramazanoglu :
> On Tue, 3 Apr 2018 14:24:54 -0500 David Wright said:
>
>> On Tue 03 Apr 2018 at 19:58:23 (+0200), Laurent Lyaudet wrote:
>>
>>> I don't understand why I have apache2-bin installed but apache is
>>> not there???
>>
>> $ aptitude why apache2-bin
>
> Or, if you have not installed aptitude,
>
> $ apt-cache --installed rdepends apache2-bin
>
> Regards
> --
> Abdullah Ramazanoglu
>
>
Thanks David and Abdullah.
It's good to know that gnome-user-share requires apache2-bin.
I've seen that file sharing on the network is off by default anyway.

Best regards,
   Laurent Lyaudet



Re: utf

2018-04-04 Thread Nicolas George
Greg Wooledge (2018-04-04):
> The problem is, you reject every single example that everyone gives
> you.

I do not reject them, I refute them.

> I don't know what you expect from us.

Acknowledge that I am right once I have refuted all your examples and
you have eventually understood my point.

At this time, you have not yet understood.

> You just seem to have Decided, for reasons known only to you, that
> The Character Length Of A String Is Not Useful.  Despite literally
> decades of programs that have used strlen() in various ways.

Decades of programs that were variously limited or flawed. Most of them
working only with a subset of English and English-like languages.

> Have you never been given ANY kind of problem that involves analysis
> of character strings?  Ever?  At all?

Analysis? Yes, of course. Tons of them. They are all about SCANNING the
string, not jumping randomly in it.

> What if the question is "Find all the English words that have an E
> in the 5th position and a U in the 7th"?

Yes, what? Who would ever ask such a question? What is the point of such
a question?

The point of such a question is only to try and disprove my point, but
my point is about useful operations, and therefore artificial questions
like that will not dent it.

> I mean, seriously, at some point you either have to accept that one
> of our examples is good enough to justify the existence of strlen()
> and character-based string indexing, or we just label you a loon and
> ignore everything you say henceforth.

To be honest, I do not care much what "you" think about me.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


mbox vs maildir vs better formats [Re: Invalid UTF-8 byte? (was: Re: utf)]

2018-04-04 Thread Don Armstrong
On Wed, 04 Apr 2018, Nicolas George wrote:
> Don Armstrong (2018-04-04):
> > You should consider looking at using Maildir with notmuch and using
> > things which integrate notmuch.[1]
> 
> Maildir is not that much better than mbox. Sure, it eliminates most of
> its worse flaws, but it brings flaws of its own, like trashing the
> inode and dentries caches, requiring extra disk reads and cache due to
> partial file ends, or causing much more seeking (granted, this one is
> becoming less of an issue with non-mechanical storage).

There are definitely better formats than Maildir, like Dovecot's
multi-dbox.[1]

These issues are why almost everyone who uses Maildir just uses it as
the backing message store and uses the index on top to do avoid ever
reading all of the messages in the Maildir.

1: https://wiki2.dovecot.org/MailboxFormat/dbox
-- 
Don Armstrong  https://www.donarmstrong.com

A Bill of Rights that means what the majority wants it to mean is worthless. 
 -- U.S. Supreme Court Justice Antonin Scalia



Re: mbox vs maildir vs better formats [Re: Invalid UTF-8 byte? (was: Re: utf)]

2018-04-04 Thread Nicolas George
Don Armstrong (2018-04-04):
> There are definitely better formats than Maildir, like Dovecot's
> multi-dbox.[1]
> 
> These issues are why almost everyone who uses Maildir just uses it as
> the backing message store and uses the index on top to do avoid ever
> reading all of the messages in the Maildir.

Glad to read this. There are too many maildir zealots out there.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature


Re: utf

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 12:58:57 PM deloptes wrote:
> And regarding the mbox thing, well mbox was depreciated for many reasons. I
> guess if it was that good it wouldn't be depreciated.

Oh, I wasn't aware that mbox was deprecated--can you shed more light on that.  
AFAIK, it is not defined in an RFC and is used by quite a few email programs.



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard

rhkramer:

The reason I wanted such a byte was to use it as a record separator in 
a set of text files (that I use as an askSam "workalike" (or 
"worksimilar") so that I could use msort (which depends on a 1 byte 
record separator to --separate the records ;-) while sorting. Some of 
the files already include UTF-8, and, in the future, I anticpate all 
will be in UTFF-8.


Note that ISO 646, hence ISO 8859, hence ISO 10646, has had a 
single-byte Record Separator character since the 1960s.  (-:




Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 01:36:15 PM Don Armstrong wrote:
> On Tue, 03 Apr 2018, rhkra...@gmail.com wrote:
> > I am building (have built several iterations) of a free format
> > database to work something like askSam. It is a mashup of several
> > applications, things like recol, kmail, nail, kate and the data is
> > stored in mbox formatted files.
> > 
> > Each record is treated as an email.
> 
> You should consider looking at using Maildir with notmuch and using
> things which integrate notmuch.[1]
> 
> > Most likely this would be only a temporary addition, and I would need
> > to do things like make sure that one byte will be unique in the file.
> > It sounds like there are at least a few candidates.
> 
> Maildir is the solution to this. While you *can* handle mbox and do all
> the escape rules properly (From to >From and back), it's a pain. Let
> your filesystem handle it for you.
> 
> [I'm speaking from experience; I currently maintain debbugs, which
> basically stores everything in a custom format mbox. This inevitably
> makes things slow, as you have to search through the mbox linearly to
> find any message in the mbox unless you also write indexes for the
> mailbox.]
> 
> 1: Notmuch itself uses xapian to do the heavy lifting.

I'll probably look into notmuch, just for kicks.

I've considered maildir--it meets some of my requirements (that is, to make 
something close to an askSam workalike), but one drawback is that it is 
essentially one email (i.e., my "record").  One of the desirable features of 
askSam is that you did not have to create a new file to add a new note / 
record, you just start typing in an existing open record and then, as time or 
other constraints allow, you can add more "tags" or a record separator.  (It's 
been so long since I've used askSam I actually forget what had to be done (f 
anything) to separate a new record from the previous record).

askSam basically stores all it's records in one file, although it is (of 
course) possible to separate them.



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Henrique de Moraes Holschuh
On Wed, 04 Apr 2018, to...@tuxteam.de wrote:
> On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote:
> > On Tue, 03 Apr 2018, Michael Lange wrote:
> > > I believe (please anyone correct me if I am wrong) that "text" files
> > > won't contain any null byte; many text editors even refuse to open such a
> > 
> > Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
> > modern encoding AFAIK, other than modified UTF-8), any zero bytes map
> > one-to-one to the NUL character/code point.  I don't recall how it is on
> > other common encodings of the 80's and 90's, though.
> 
> Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
> call "Unicode": in more "Western" contexts every second byte is NULL!

Ah, yes.  I forgot about them, indeed.  UTF-16BE and UTF-16LE will have
zero bytes in the resulting byte stream.  And I suppose one could call
them "modern encodings", even if they are horrifying to use when
compared to UTF-8 (UTF-16 has byte-order issues) or UTF-32 (UTF-16 has
surrogate pairs).

> > Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
> > bytes with the value of zero when encoding characters, so NUL is encoded
> > by a different sequence, and you can safely use a byte with the value of
> > zero for some out-of-band control [...]
> 
> Yes, the problem is that someone else before you could have been doing
> exactly that.

You can modified-UTF-8 bit-packing to encode anything, and the result
will be zero-free (and it will restore the zeroes when decoded).  The
price is a size increase (it is a variant of UTF-8 that uses two bytes
to encode NUL, which would take just one byte in normal UTF-8).  There
are much better bit packing schemes if you just need to escape zeroes
;-)

That said, it is always safe to break valid "modified UTF-8" into
records using zeroes, as long as you don't expect the result to be valid
UTF-8 (it isn't valid UTF-8 because NULs will be encoded using a
non-minimal byte sequence that *will* decode to a zero even if it is
invalid) or valid modified UTF-8 (it isn't valid modified UTF-8 because
0 is not valid as an encoding for NUL in modified UTF-8).  But a lax
UTF-8 or modified UTF-8 *would* parse "modified UTF-8 with zero as
record separators" and reconstruct the unicode text properly (but it
would read the record separators as NULs, so you'd get extra NULs in the
resulting text).

That, of course, assumes you have unicode text as the input (encoding
doesn't matter, as long as you know it), and recode it to modified UTF-8
before you add the zeroes as end-of-record marks.  This is not about
bit-packing generic binary data.

> I'd guard against that. It's not exactly difficult, the traditional
> "escape" mechanism (aka character stuffing) does it pretty well...

Yes, any bitstuffing/escape-based wrapping would do.

-- 
  Henrique Holschuh



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard

rhkramer:


Where were you in 2000 when I started the project?

I cannot speak for anyone else, but I was probably once again giving a 
frequently given answer that I eventually put up on a WWW page.


http://jdebp.eu./FGA/mail-mbox-formats.html



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Don Armstrong
On Wed, 04 Apr 2018, rhkra...@gmail.com wrote:
> I've considered maildir--it meets some of my requirements (that is, to
> make something close to an askSam workalike), but one drawback is that
> it is essentially one email (i.e., my "record"). One of the desirable
> features of askSam is that you did not have to create a new file to
> add a new note / record, you just start typing in an existing open
> record and then, as time or other constraints allow, you can add more
> "tags" or a record separator. (It's been so long since I've used
> askSam I actually forget what had to be done (f anything) to separate
> a new record from the previous record).

You might want to consider looking at org-mode too.[1] There are
even integrations for notmuch+mutt+Maildir there.

1: https://www.youtube.com/watch?v=oJTwQvgfgMM

-- 
Don Armstrong  https://www.donarmstrong.com

I really wanted to talk to her.
I just couldn't find an algorithm that fit.
 -- Peter Watts _Blindsight_ p294



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard

Henrique de Moraes Holschuh:

Also, a text file MAY contain NULs (the character), it is just 
considered bad practice (nowadays?). Don't assume you won't see any. 
For example, received e-mail is *more* likely to have NULs in it than 
normal text due to the quality of some mail agents out there.


I suspect not as likely as anything that was in the process of being 
appended to on a not-fully-journalling filesystem when a dirty shutdown 
happens.  (-:


* https://askubuntu.com/questions/356981/

Or anything that "rotates" output files by truncating them and pulls the 
rug out from underneath an old-style simplistic indefinitely-running 
text output writer.


* http://jdebp.eu./FGA/do-not-use-logrotate.html#Background



Re: utf

2018-04-04 Thread Joel Roth
On Wed, Apr 04, 2018 at 02:20:17PM -0400, rhkra...@gmail.com wrote:
> On Wednesday, April 04, 2018 12:58:57 PM deloptes wrote:
> > And regarding the mbox thing, well mbox was depreciated for many reasons. I
> > guess if it was that good it wouldn't be depreciated.

> Oh, I wasn't aware that mbox was deprecated--can you shed more light on that. 
>  
> AFAIK, it is not defined in an RFC and is used by quite a few email programs.
 
Not exactly deprecated, but it's considered a less reliable storage format, 
because of potential problems. 

https://en.wikipedia.org/wiki/Mbox

I converted to Maildir for better compatibility with the mu
indexing programs (package maildir-utils). 

cheers,
 

-- 
Joel Roth
  



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Apr 04, 2018 at 03:44:23PM -0300, Henrique de Moraes Holschuh wrote:

[...]

> That said, it is always safe to break valid "modified UTF-8" into
> records using zeroes, as long as you don't expect the result to be valid
> UTF-8 (it isn't valid UTF-8 because NULs will be encoded using a
> non-minimal byte sequence that *will* decode to a zero even if it is
> invalid) or valid modified UTF-8 (it isn't valid modified UTF-8 because
> 0 is not valid as an encoding for NUL in modified UTF-8).  But a lax
> UTF-8 or modified UTF-8 *would* parse "modified UTF-8 with zero as
> record separators" and reconstruct the unicode text properly (but it
> would read the record separators as NULs, so you'd get extra NULs in the
> resulting text).

You are a nasty guy, aren't you ;-)

Pretty cunning...

Cheers
- -- t
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrFJngACgkQBcgs9XrR2kZqLgCdEuap+rqSU6HCrXpkL6XHl3Az
lRUAnjwGhiMNNlY+SXwIxpd/kfnvst1z
=kHBa
-END PGP SIGNATURE-



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 02:10:16 PM Jonathan de Boyne Pollard wrote:
> rhkramer:
> > The reason I wanted such a byte was to use it as a record separator in
> > a set of text files (that I use as an askSam "workalike" (or
> > "worksimilar") so that I could use msort (which depends on a 1 byte
> > record separator to --separate the records ;-) while sorting. Some of
> > the files already include UTF-8, and, in the future, I anticpate all
> > will be in UTFF-8.
> 
> Note that ISO 646, hence ISO 8859, hence ISO 10646, has had a
> single-byte Record Separator character since the 1960s.  (-:

Ok, thanks, I see that is Dec 30, Hex 1e.  

A quick look at the UTF-8 table in the Wikipedia article on UTF-8 seems to 
indicate that byte is a valid UTF-9 byte, which makes it unsuitabe for my use.



Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 01:36:15 PM Don Armstrong wrote:
> On Tue, 03 Apr 2018, rhkra...@gmail.com wrote:
> > I am building (have built several iterations) of a free format
> > database to work something like askSam. It is a mashup of several
> > applications, things like recol, kmail, nail, kate and the data is
> > stored in mbox formatted files.
> > 
> > Each record is treated as an email.
> 
> You should consider looking at using Maildir with notmuch and using
> things which integrate notmuch.[1]

Ahh, OK, notmuch looks like it could be an alternative to recol as part of my 
mashup.  Thanks, I'll probably dig deeper as time goes on.



tcp_probe module missing

2018-04-04 Thread Ireneusz Szcześniak

Hi,

I'm running an up-to-date Debian Stretch on an AMD64 computer.  I 
would like to use the tcpprobe module, and so I'm trying to do:


sudo modprobe tcp_probe

But I get:

modprobe: FATAL: Module tcp_probe not found in directory ...

Why is this module missing?

Is there a quick way of getting it?


Thanks & best,
Irek



Re: utf

2018-04-04 Thread deloptes
Nicolas George wrote:

>> What if the question is "Find all the English words that have an E
>> in the 5th position and a U in the 7th"?
> 
> Yes, what? Who would ever ask such a question? What is the point of such
> a question?
> 
> The point of such a question is only to try and disprove my point, but
> my point is about useful operations, and therefore artificial questions
> like that will not dent it.

I agree with you, I get the point and it is correct.
In the above example it is not clear first of all where do we look for those
English words. Assume you have them in string, you again need the offset -
start of word and then check the fifth and seventh position, so further
iteration. 
I never bothered to look in stdc++ or libc how it is implemented - for
example c++ string at() operation. Can someone enlight us pls?



Re: utf

2018-04-04 Thread deloptes
rhkra...@gmail.com wrote:

> Oh, I wasn't aware that mbox was deprecated--can you shed more light on
> that. AFAIK, it is not defined in an RFC and is used by quite a few email
> programs.

yes but Maildir format was introduced for couple of reasons (as well as
other formats). I wouldn't store my mail in mbox anyway. For local
system/user mails as a simple default storage perhaps yes - it might be OK,
but for public mail, where you have 1000+ mails and perhaps multiple
interfaces ... no chance.

https://wiki2.dovecot.org/MailboxFormat

https://wiki2.dovecot.org/MailboxFormat/mbox
https://wiki2.dovecot.org/MailboxFormat/Maildir

I have worked on cloud mail solution using dovecot with mysql backend for
18mil customers. Another company was using dbmail with mysql with very good
results. 
But this goes somehow off topic in regards of original UTF

The only advantage I see with mbox is that it is really simple.

regards




Re: Invalid UTF-8 byte?

2018-04-04 Thread Ben Caradoc-Davies

On 05/04/18 02:09, to...@tuxteam.de wrote:

Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
call "Unicode": in more "Western" contexts every second byte is NULL!


The Java platform uses UTF-16 internally:

"The char data type (and therefore the value that a Character object 
encapsulates) are based on the original Unicode specification, which 
defined characters as fixed-width 16-bit entities."

https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html

Kind regards,

--
Ben Caradoc-Davies 
Director
Transient Software Limited 
New Zealand



Re: utf

2018-04-04 Thread Stefan Monnier
> You just seem to have Decided, for reasons known only to you, that
> The Character Length Of A String Is Not Useful.  Despite literally
> decades of programs that have used strlen() in various ways.

strlen was mostly used in a context where char-length = byte-length =
display-width.  Most of those calls to strlen have nothing to do with
char-length but are more interested in display-width or byte-length.

In the context of Unicode, using utf-8 doesn't make byte-length any
harder than with ASCII.  And in the context of Unicode, display-width
is a lot more complex than strlen regardless of which encoding you use
because any given Unicode char can have a display-width of 0, 1, or
2 (even if you disregard proportional fonts and other fancy rendering
tricks).  So utf-8 doesn't make the computation of display-width any
more complex than utf-32.

> What if the question is "Find all the English words that have an E
> in the 5th position and a U in the 7th"?

That can be answered just as easily and efficiently from a utf-8
representation of the string as from a utf-32 representation.


Stefan



Re: tcp_probe module missing

2018-04-04 Thread deloptes
Ireneusz Szcześniak wrote:

> Hi,
> 
> I'm running an up-to-date Debian Stretch on an AMD64 computer.  I
> would like to use the tcpprobe module, and so I'm trying to do:
> 
> sudo modprobe tcp_probe
> 
> But I get:
> 
> modprobe: FATAL: Module tcp_probe not found in directory ...
> 
> Why is this module missing?
> 
> Is there a quick way of getting it?
> 
> 
> Thanks & best,
> Irek

Because you normally don't use it (99.99%). It is part of testing utility
and flagged with "use with caution".

If you are asking here, I doubt that you are qualified to use it. You can
however rebuild the kernel after enabling this module from "Network
testing".


 Symbol: NET_TCPPROBE [=n]  
   
│
  │ Type  : tristate
  
│
  │ Prompt: TCP connection probing  
  
│
  │   Location: 
  
│
  │ -> Networking support (NET [=y])
  
│
  │   -> Networking options 
  
│
  │ (1) -> Network testing  
  
│
  │   Defined at net/Kconfig:339
  
│
  │   Depends on: NET [=y] && INET [=y] && PROC_FS [=y] && KPROBES [=n] 

CONFIG_NET_PKTGEN:  
  
│
  │ 
  
│
  │ This module will inject preconfigured packets, at a configurable
  
│
  │ rate, out of a given interface.  It is used for network interface   
  
│
  │ stress testing and performance analysis.  If you don't understand   
  
│
  │ what was just said, you don't need it: say N.   
  
│
  │ 
  
│
  │ Documentation on how to use the packet generator can be found   
  
│
  │ at .  
  
│
  │ 
  
│
  │ To compile this code as a module, choose M here: the
  
│
  │ module will be called pktgen.   
  
│
  │ 
  
│
  │ Symbol: NET_PKTGEN [=n] 
  
│
  │ Type  : tristate
  
│
  │ Prompt: Packet Generator (USE WITH CAUTION) 
  
│
  │   Location: 
  
│
  │ -> Networking support (NET [=y])
  
│
  │   -> Networking options 
  
│
  │ -> Network testing  
  
│
  │   Def

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread deloptes
rhkra...@gmail.com wrote:

> I'll probably look into notmuch, just for kicks.
> 
> I've considered maildir--it meets some of my requirements (that is, to
> make something close to an askSam workalike), but one drawback is that it
> is essentially one email (i.e., my "record").  One of the desirable
> features of askSam is that you did not have to create a new file to add a
> new note / record, you just start typing in an existing open record and
> then, as time or other constraints allow, you can add more "tags" or a
> record separator.  (It's been so long since I've used askSam I actually
> forget what had to be done (f anything) to separate a new record from the
> previous record).
> 
> askSam basically stores all it's records in one file, although it is (of
> course) possible to separate them.

I still don't understand why not use XML. If file is not getting too big
(which is also a problem with mbox). You may need to adapt your
applications, but it won't be more effort then making shit out of shit -
sorry for my language.

regards



Re: Invalid UTF-8 byte?

2018-04-04 Thread Michael Stone

On Thu, Apr 05, 2018 at 09:42:19AM +1200, Ben Caradoc-Davies wrote:

On 05/04/18 02:09, to...@tuxteam.de wrote:

Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
call "Unicode": in more "Western" contexts every second byte is NULL!


The Java platform uses UTF-16 internally:


Yes, many people thought UCS-2 was the answer back when 16 bits was 
enough for anybody.


Mike Stone



Re: utf

2018-04-04 Thread Richard Hector
On 05/04/18 05:53, Nicolas George wrote:
>> What if the question is "Find all the English words that have an E
>> in the 5th position and a U in the 7th"?
>
> Yes, what? Who would ever ask such a question? What is the point of such
> a question?

Solving a crossword puzzle?

Richard



signature.asc
Description: OpenPGP digital signature


Re: utf

2018-04-04 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Apr 04, 2018 at 11:33:13PM +0200, deloptes wrote:

[...]
> other formats). I wouldn't store my mail in mbox anyway. For local
> system/user mails as a simple default storage perhaps yes - it might be OK,
> but for public mail, where you have 1000+ mails and perhaps multiple
> interfaces ... no chance.

Increase that by 2-3 orders of magnitude and I'd agree (FWIW, I'm working
off an mbox with ~15K mails. Works fine).

Cheers
- -- t
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrFwXkACgkQBcgs9XrR2kaAsACfa1vlRYeQJDH/mFkn8NWViU5J
JKMAn1AvJGRmB1tGHKxWiFT15z9NqnQy
=K1gZ
-END PGP SIGNATURE-