From glen at delfi.ee Tue Mar 6 10:35:36 2007 From: glen at delfi.ee (Elan =?utf-8?q?Ruusam=C3=A4e?=) Date: Tue Mar 6 10:36:10 2007 Subject: [cvsspam-devel] diffs not character safe Message-ID: <200703061235.36645.glen@delfi.ee> appears that when passed --charset utf-8 to collect_diffs the diffs are not characterwise but bytewise and as cvsspamm appears to make diffs on same line coloured darker, it breaks multibytes so if the diff would be: - 'map_tab_label' => 'карта', + 'map_tab_label' => 'Карта', cvsspam hilights after first byte of letter 'k' because it's unicode first part is the same byte. i've attached the mail fragment as i it can't be displayed properly in this utf8-encoded email. -- glen -------------- next part --------------
- 'map_tab_label' => 'карта',
+ 'map_tab_label' => 'Карта',
From dave at badgers-in-foil.co.uk Wed Mar 7 16:05:54 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Wed Mar 7 16:06:39 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703061235.36645.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
Message-ID: <20070307160554.GA25917@badgers-in-foil.co.uk>
On Tue, Mar 06, 2007 at 12:35:36PM +0200, Elan Ruusam??e wrote:
> appears that when passed --charset utf-8 to collect_diffs the diffs are not
> characterwise but bytewise
You are correct. The --charset option only sets up the email headers
with the given value; it's not used during processing at all.
> and as cvsspamm appears to make diffs on same line coloured darker, it breaks
> multibytes
>
> so if the diff would be:
> - 'map_tab_label' => '??????????',
> + 'map_tab_label' => '??????????',
>
> cvsspam hilights after first byte of letter 'k' because it's unicode first
> part is the same byte.
I hadn't considered that possibility. Maybe the within-a-line colouring
should be disabled when a multibyte encoding is detected?
I don't know a huge amount about handling multibyte encodings in Ruby,
but have the impression that it's a bit of a black art (until Ruby 2
comes out). Fixing this might require a rewrite of the highlighting
code, and that code is a horrible mess. I am scared of it :(
--
http://david.holroyd.me.uk/
From glen at delfi.ee Wed Mar 7 17:20:19 2007
From: glen at delfi.ee (Elan =?iso-8859-1?q?Ruusam=E4e?=)
Date: Wed Mar 7 17:20:32 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307160554.GA25917@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
<20070307160554.GA25917@badgers-in-foil.co.uk>
Message-ID: <200703071920.19896.glen@delfi.ee>
On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > cvsspam hilights after first byte of letter 'k' because it's unicode
> > first part is the same byte.
>
> I hadn't considered that possibility. ?Maybe the within-a-line colouring
> should be disabled when a multibyte encoding is detected?
as quick fix, would be nice. but how you detect the charset is multibyte? just
match /utf-?.+/i ?
--
glen
From dave at badgers-in-foil.co.uk Wed Mar 7 17:41:45 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Wed Mar 7 17:41:47 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703071920.19896.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
<20070307160554.GA25917@badgers-in-foil.co.uk>
<200703071920.19896.glen@delfi.ee>
Message-ID: <20070307174145.GA29725@badgers-in-foil.co.uk>
On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > first part is the same byte.
> >
> > I hadn't considered that possibility. Maybe the within-a-line
> > colouring should be disabled when a multibyte encoding is detected?
>
> as quick fix, would be nice. but how you detect the charset is
> multibyte? just match /utf-?.+/i ?
My use of 'detect' was incorrect :)
Yeah, a regexp or just a simple list of encodings was about what I had
in mind.
--
http://david.holroyd.me.uk/
From glen at delfi.ee Wed Mar 7 19:06:28 2007
From: glen at delfi.ee (Elan =?iso-8859-1?q?Ruusam=E4e?=)
Date: Wed Mar 7 19:06:38 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307174145.GA29725@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
<200703071920.19896.glen@delfi.ee>
<20070307174145.GA29725@badgers-in-foil.co.uk>
Message-ID: <200703072106.28077.glen@delfi.ee>
On Wednesday 07 March 2007 19:41:45 David Holroyd wrote:
> On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> > On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > > first part is the same byte.
> > >
> > > I hadn't considered that possibility. Maybe the within-a-line
> > > colouring should be disabled when a multibyte encoding is detected?
> >
> > as quick fix, would be nice. but how you detect the charset is
> > multibyte? just match /utf-?.+/i ?
>
> My use of 'detect' was incorrect :)
>
> Yeah, a regexp or just a simple list of encodings was about what I had
> in mind.
ok. waiting for patch :)
--
glen
From dave at badgers-in-foil.co.uk Wed Mar 7 23:59:59 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Thu Mar 8 00:00:12 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703072106.28077.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
<200703071920.19896.glen@delfi.ee>
<20070307174145.GA29725@badgers-in-foil.co.uk>
<200703072106.28077.glen@delfi.ee>
Message-ID: <20070307235959.GA3600@badgers-in-foil.co.uk>
On Wed, Mar 07, 2007 at 09:06:28PM +0200, Elan Ruusam?e wrote:
> On Wednesday 07 March 2007 19:41:45 David Holroyd wrote:
> > On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> > > On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > > > first part is the same byte.
> > > >
> > > > I hadn't considered that possibility. Maybe the within-a-line
> > > > colouring should be disabled when a multibyte encoding is detected?
> > >
> > > as quick fix, would be nice. but how you detect the charset is
> > > multibyte? just match /utf-?.+/i ?
> >
> > My use of 'detect' was incorrect :)
> >
> > Yeah, a regexp or just a simple list of encodings was about what I had
> > in mind.
>
> ok. waiting for patch :)
Please test...
--
http://david.holroyd.me.uk/
-------------- next part --------------
Index: cvsspam.rb
===================================================================
--- cvsspam.rb (revision 255)
+++ cvsspam.rb (working copy)
@@ -936,7 +936,10 @@
addInfixSize = line.length - (prefixLen+suffixLen)
oversize_change = deleteInfixSize*100/@lineJustDeleted.length>33 || addInfixSize*100/line.length>33
- if prefixLen==1 && suffixLen==0 || deleteInfixSize<=0 || oversize_change
+ # avoid doing 'within-a-line highlighting' if a multibyte encoding
+ # is suspected, as all the suffix/prefix stuff above is byte, not
+ # character based
+ if multibyte_encoding? || prefixLen==1 && suffixLen==0 || deleteInfixSize<=0 || oversize_change
print(htmlEncode(@lineJustDeleted))
else
print(htmlEncode(@lineJustDeleted[0,prefixLen]))
@@ -1297,6 +1300,11 @@
end
end
+# guess if the users selected encoding is multibyte, since some CVSspam code
+# isn't multibyte-safe, and needs to be disabled.
+def multibyte_encoding?
+ $charset && ["utf-8", "utf-16"].include?($charset.downcase)
+end
cvsroot_dir = "#{ENV['CVSROOT']}/CVSROOT"
$config = "#{cvsroot_dir}/cvsspam.conf"
From glen at delfi.ee Fri Mar 16 00:17:32 2007
From: glen at delfi.ee (Elan =?utf-8?q?Ruusam=C3=A4e?=)
Date: Fri Mar 16 00:17:46 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307235959.GA3600@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
<200703072106.28077.glen@delfi.ee>
<20070307235959.GA3600@badgers-in-foil.co.uk>
Message-ID: <200703160217.32258.glen@delfi.ee>
On Thursday 08 March 2007, David Holroyd wrote:
> Please test...
not sure what happened, but it still seems broken.
my loginfo contains:
^test/utf8 /usr/share/cvsspam/collect_diffs.rb --charset utf-8 --from $USER --to glen@delfi.ee %{sVv}
the email contains (seen with less):
-+ 'map_tab_label' => 'карта',
++ 'map_tab_label' => '<9A>арта' so my guess is, it worked for "removed", but did not for "added" -- glen -------------- next part --------------
-+ 'map_tab_label' => 'карта',
++ 'map_tab_label' => 'Карта',