From glen at delfi.ee Tue Mar 6 10:35:36 2007 From: glen at delfi.ee (Elan =?utf-8?q?Ruusam=C3=A4e?=) Date: Tue Mar 6 10:36:10 2007 Subject: [cvsspam-devel] diffs not character safe Message-ID: <200703061235.36645.glen@delfi.ee> appears that when passed --charset utf-8 to collect_diffs the diffs are not characterwise but bytewise and as cvsspamm appears to make diffs on same line coloured darker, it breaks multibytes so if the diff would be: - 'map_tab_label' => 'карта', + 'map_tab_label' => 'Карта', cvsspam hilights after first byte of letter 'k' because it's unicode first part is the same byte. i've attached the mail fragment as i it can't be displayed properly in this utf8-encoded email. -- glen -------------- next part --------------
-	'map_tab_label'			=> 'карта',
+	'map_tab_label'			=> 'Карта',
From dave at badgers-in-foil.co.uk  Wed Mar  7 16:05:54 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Wed Mar  7 16:06:39 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703061235.36645.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
Message-ID: <20070307160554.GA25917@badgers-in-foil.co.uk>

On Tue, Mar 06, 2007 at 12:35:36PM +0200, Elan Ruusam??e wrote:
> appears that when passed --charset utf-8 to collect_diffs the diffs are not 
> characterwise but bytewise

You are correct.  The --charset option only sets up the email headers
with the given value; it's not used during processing at all.


> and as cvsspamm appears to make diffs on same line coloured darker, it breaks 
> multibytes
> 
> so if the diff would be:
> -	'map_tab_label'			=> '??????????',
> +	'map_tab_label'			=> '??????????',
> 
> cvsspam hilights after first byte of letter 'k' because it's unicode first 
> part is the same byte.

I hadn't considered that possibility.  Maybe the within-a-line colouring
should be disabled when a multibyte encoding is detected?

I don't know a huge amount about handling multibyte encodings in Ruby,
but have the impression that it's a bit of a black art (until Ruby 2
comes out).  Fixing this might require a rewrite of the highlighting
code, and that code is a horrible mess.  I am scared of it  :(


-- 
http://david.holroyd.me.uk/

From glen at delfi.ee  Wed Mar  7 17:20:19 2007
From: glen at delfi.ee (Elan =?iso-8859-1?q?Ruusam=E4e?=)
Date: Wed Mar  7 17:20:32 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307160554.GA25917@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
	<20070307160554.GA25917@badgers-in-foil.co.uk>
Message-ID: <200703071920.19896.glen@delfi.ee>

On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > cvsspam hilights after first byte of letter 'k' because it's unicode
> > first part is the same byte.
>
> I hadn't considered that possibility. ?Maybe the within-a-line colouring
> should be disabled when a multibyte encoding is detected?

as quick fix, would be nice. but how you detect the charset is multibyte? just 
match /utf-?.+/i ?

-- 
glen

From dave at badgers-in-foil.co.uk  Wed Mar  7 17:41:45 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Wed Mar  7 17:41:47 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703071920.19896.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
	<20070307160554.GA25917@badgers-in-foil.co.uk>
	<200703071920.19896.glen@delfi.ee>
Message-ID: <20070307174145.GA29725@badgers-in-foil.co.uk>

On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > first part is the same byte.
> >
> > I hadn't considered that possibility. Maybe the within-a-line
> > colouring should be disabled when a multibyte encoding is detected?
> 
> as quick fix, would be nice. but how you detect the charset is
> multibyte? just match /utf-?.+/i ?

My use of 'detect' was incorrect :)

Yeah, a regexp or just a simple list of encodings was about what I had
in mind.

-- 
http://david.holroyd.me.uk/

From glen at delfi.ee  Wed Mar  7 19:06:28 2007
From: glen at delfi.ee (Elan =?iso-8859-1?q?Ruusam=E4e?=)
Date: Wed Mar  7 19:06:38 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307174145.GA29725@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
	<200703071920.19896.glen@delfi.ee>
	<20070307174145.GA29725@badgers-in-foil.co.uk>
Message-ID: <200703072106.28077.glen@delfi.ee>

On Wednesday 07 March 2007 19:41:45 David Holroyd wrote:
> On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> > On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > > first part is the same byte.
> > >
> > > I hadn't considered that possibility. Maybe the within-a-line
> > > colouring should be disabled when a multibyte encoding is detected?
> >
> > as quick fix, would be nice. but how you detect the charset is
> > multibyte? just match /utf-?.+/i ?
>
> My use of 'detect' was incorrect :)
>
> Yeah, a regexp or just a simple list of encodings was about what I had
> in mind.

ok. waiting for patch :)

-- 
glen

From dave at badgers-in-foil.co.uk  Wed Mar  7 23:59:59 2007
From: dave at badgers-in-foil.co.uk (David Holroyd)
Date: Thu Mar  8 00:00:12 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <200703072106.28077.glen@delfi.ee>
References: <200703061235.36645.glen@delfi.ee>
	<200703071920.19896.glen@delfi.ee>
	<20070307174145.GA29725@badgers-in-foil.co.uk>
	<200703072106.28077.glen@delfi.ee>
Message-ID: <20070307235959.GA3600@badgers-in-foil.co.uk>

On Wed, Mar 07, 2007 at 09:06:28PM +0200, Elan Ruusam?e wrote:
> On Wednesday 07 March 2007 19:41:45 David Holroyd wrote:
> > On Wed, Mar 07, 2007 at 07:20:19PM +0200, Elan Ruusam?e wrote:
> > > On Wednesday 07 March 2007 18:05:54 David Holroyd wrote:
> > > > > cvsspam hilights after first byte of letter 'k' because it's unicode
> > > > > first part is the same byte.
> > > >
> > > > I hadn't considered that possibility. Maybe the within-a-line
> > > > colouring should be disabled when a multibyte encoding is detected?
> > >
> > > as quick fix, would be nice. but how you detect the charset is
> > > multibyte? just match /utf-?.+/i ?
> >
> > My use of 'detect' was incorrect :)
> >
> > Yeah, a regexp or just a simple list of encodings was about what I had
> > in mind.
> 
> ok. waiting for patch :)

Please test...

-- 
http://david.holroyd.me.uk/
-------------- next part --------------
Index: cvsspam.rb
===================================================================
--- cvsspam.rb	(revision 255)
+++ cvsspam.rb	(working copy)
@@ -936,7 +936,10 @@
         addInfixSize = line.length - (prefixLen+suffixLen)
         oversize_change = deleteInfixSize*100/@lineJustDeleted.length>33 || addInfixSize*100/line.length>33
 
-        if prefixLen==1 && suffixLen==0 || deleteInfixSize<=0 || oversize_change
+        # avoid doing 'within-a-line highlighting' if a multibyte encoding
+        # is suspected, as all the suffix/prefix stuff above is byte, not
+        # character based
+        if multibyte_encoding? || prefixLen==1 && suffixLen==0 || deleteInfixSize<=0 || oversize_change
           print(htmlEncode(@lineJustDeleted))
         else
           print(htmlEncode(@lineJustDeleted[0,prefixLen]))
@@ -1297,6 +1300,11 @@
   end
 end
 
+# guess if the users selected encoding is multibyte, since some CVSspam code
+# isn't multibyte-safe, and needs to be disabled.
+def multibyte_encoding?
+  $charset && ["utf-8", "utf-16"].include?($charset.downcase)
+end
 
 cvsroot_dir = "#{ENV['CVSROOT']}/CVSROOT"
 $config = "#{cvsroot_dir}/cvsspam.conf"
From glen at delfi.ee  Fri Mar 16 00:17:32 2007
From: glen at delfi.ee (Elan =?utf-8?q?Ruusam=C3=A4e?=)
Date: Fri Mar 16 00:17:46 2007
Subject: [cvsspam-devel] diffs not character safe
In-Reply-To: <20070307235959.GA3600@badgers-in-foil.co.uk>
References: <200703061235.36645.glen@delfi.ee>
	<200703072106.28077.glen@delfi.ee>
	<20070307235959.GA3600@badgers-in-foil.co.uk>
Message-ID: <200703160217.32258.glen@delfi.ee>

On Thursday 08 March 2007, David Holroyd wrote:
> Please test...

not sure what happened, but it still seems broken.

my loginfo contains:
^test/utf8  /usr/share/cvsspam/collect_diffs.rb --charset utf-8  --from $USER --to glen@delfi.ee %{sVv}

the email contains (seen with less):
 
-+       'map_tab_label'                 => 'карта',
 
++       'map_tab_label'                 => '<9A>арта'

so my guess is, it worked for "removed", but did not for "added"

-- 
glen
-------------- next part --------------
-+       'map_tab_label'                 => 'карта',
++       'map_tab_label'                 => 'Карта',