perl - 如何计算文件中的所有字符，包括 Control 和 Unicode？

转载作者：行者123 更新时间：2023-12-01 14:09:45

首先，对于一个很长的问题，我深表歉意。我一直在寻找一个脚本，该脚本可以按字符逐项列出文件中的所有内容。我遇到了一个脚本并决定扩展它以显示控制字符和 unicode。以下是我对此的尝试，但这并不完全正确。所以我寻求一些帮助。我一直在研究如何正确读取 UTF-8 格式的文件，有很多关于如何不这样做的评论，但很少有适合我的方法。

使用我的 mac 中的 .DS_Store 文件，我得到以下输出。我想了解如何解决警告(即不仅仅是忽略它们，而是正确处理它们)。我也在寻找一种方法来验证我是否做对了。例如。 od -c .DS_Store 是一种方法，但我没有看到与我的输出一对一匹配。

>charlist_v4 .DS_Store
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
               Dec     Hex  Letter   Count  Desc

     1       0  0x0000  [NUL]        6,020  C0 Control Character Set - Null (^@ \0)                                         
     2       1  0x0001  [SOH]           59  C0 Control Character Set - Start of Header (^A)                                 
     3       2  0x0002  [STX]            8  C0 Control Character Set - Start of Text (^B)                                   
     4       3  0x0003  [ETX]            1  C0 Control Character Set - End of Text (^C)                                     
     5       4  0x0004  [EOT]            7  C0 Control Character Set - End of Transmission (^D)                             
     6       8  0x0008  [BS]             9  C0 Control Character Set - Backspace (^H \b)                                    
     7      11  0x000B  [VT]             2  C0 Control Character Set - Vertical Tabulation (^K \v)                          
     8      16  0x0010  [DLE]            9  C0 Control Character Set - Data Line Escape (^P)                                
     9      24  0x0018  [CAN]            1  C0 Control Character Set - Cancel (^X)                                          
    10      32  0x0020  [SP]             7  Space                                                                           
    11      37  0x0025  [%]          2  PERCENT SIGN                                                                    
    12      48  0x0030  [ ]          6  DIGIT ZERO                                                                      
    13      49  0x0031  [1]          1  DIGIT ONE                                                                       
    14      56  0x0038  [8]          6  DIGIT EIGHT                                                                     
    15      64  0x0040  [@]          7  COMMERCIAL AT                                                                   
    16      66  0x0042  [B]          2  LATIN CAPITAL LETTER B                                                          
    17      68  0x0044  [D]          2  LATIN CAPITAL LETTER D                                                          
    18      69  0x0045  [E]          1  LATIN CAPITAL LETTER E                                                          
    19      83  0x0053  [S]          1  LATIN CAPITAL LETTER S                                                          
    20      92  0x005C  [\]          6  REVERSE SOLIDUS                                                                 
    21      96  0x0060  [`]          1  GRAVE ACCENT                                                                    
    22     100  0x0064  [d]          1  LATIN SMALL LETTER D                                                            
    23     117  0x0075  [u]          1  LATIN SMALL LETTER U                                                            
    24     120  0x0078  [x]          6  LATIN SMALL LETTER X

  #!/usr/bin/perl
  # ========== ========== ========== ========== ========== ========== ==========
  # charlist2.pl
  #
  # count every character in a file
  #
  # Version 1: 16 Aug 05  bb
  # Version 2: 21 Sep 05 jw v2 modified layout of output file
  # Version 3: 2005-10-15 bh Added -f and -r options
  # Version 4: 31 Jan 2010 EDP - added UTF-8 functionality
  # ========== ========== ========== ========== ========== ========== ==========
  $| = 1;             # Do not buffer output
  use strict;
  use warnings;
  use Encode qw(encode :fallbacks);


  #use open IO => ':utf8'; # all I/O in utf8
  #no warnings 'utf8'; # but ignore utf-8 warnings
  #binmode( STDIN, ":utf8" );
  #binmode( STDOUT, ":utf8" );
  #binmode( STDERR, ":utf8" );

  use Unicode::UCD 'charinfo';
  use Cwd 'abs_path'; # get full absolute path to files, regardless of where it is ran from
  {
    no warnings;      # warnings doesn't like $0 below
    use constant {
      PROGRAM  => abs_path( $0 ),  # get full path, not relative path
      DEBUG    => $ENV{ 'DEBUG' }  # to turn on debugging:  export DEBUG=1
    };
  }

  # ---------- ---------- ----------
  our $Version = "4.0";


  # ---------- ---------- ----------
  use Getopt::Std;
  our ( $opt_f, $opt_r );
  getopts( 'fr' );

  # ---------- ---------- ----------
  die <<"eof" unless $#ARGV >= 0;
  Usage:
    charlist2.pl [-f] [-r]  infile > outfile

  Given a text file, count the number of times each character occurs.
  Print out the count, also giving the decimal equivalent of each character.

  -f sort by frequency

  -r reverse sort order

  Version $Version
  eof
  my $file = $ARGV[0];
  my %ctrls;




    sub commify {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : commify a number
    #
    # Arguments   : number
    #
    # Returns     : string equivalent with commas every three numbers to the
    #               left of the decimal
    #
    # Example     : $num_str = commify 1234.5678  # == 1,234.5678
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

      my $text = reverse $_[0];
      $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
      return scalar reverse $text;

    } # commify


    sub trim {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : Trim spaces before and after a string
    #
    # Arguments   : string
    #
    # Returns     : regex out any leading/trailing spaces
    #
    # Example     : print trim( '     a  ' )  # 'a'
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

      my ( $str ) = shift =~ m!^\s*(.+?)\s*$!i;
      defined $str ? return $str : return '';

    } # trim

    sub ident {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : Identify everything about this character
    #
    # Arguments   : line counter
    #               character code (i.e. space = 32)
    #               count of how many we found
    #
    # Returns     : output line to STDOUT
    #
    # Example     : ident( line_num=>$cnt,
    #                      char_code=>$idx,
    #                      count=>$count[$idx] );
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

    my %args = @_;
    my $line_num = $args{line_num} || die 'ident( line_num=> ) paramer required';
    my $char_code = $args{char_code} ;#|| die 'ident( char_code=> ) paramer required';
    my $count = $args{count} || die 'ident( count=> ) paramer required';

    my ( $c, $h, $n );

    # ---------- ---------- ----------
    # Gather what unicode information about this character
    # ---------- ---------- ----------
    my $info=eval { charinfo( $char_code ) };

    # ---------- ---------- ----------
    # and we find something
    # ---------- ---------- ----------
    if ( defined $info )
    {

      # ---------- ---------- ----------
      # what if it is one of the control
      # characters defined at the end of
      # this file?
      # ---------- ---------- ----------
      if ( defined $ctrls{$char_code} )
      {

        $c = trim( $ctrls{$char_code}[0] );
        $h = $info->{code};
        $n = trim( $ctrls{$char_code}[1] );

      }
      else
      {

        # ---------- ---------- ----------
        # what did we find?
        # ---------- ---------- ----------
        $c = chr( $char_code ) || ' ';
        eval {

          no warnings;
          if ( $info->{combining} > 0 )
          {
            $c = ' ' . $c;
          }

        };
        $h = $info->{code} || ' ';
        $n = trim( $info->{name} ) || ' ';

      }

    }
    else
    {

      # ---------- ---------- ----------
      # we didn't find anything in the system files.
      # it may not be up-to-date
      # ---------- ---------- ----------
      $n = '<undef>';

    }
    print sprintf( "%6d", $line_num ) . "\t";
    print sprintf( "%6d", $char_code ) ."\t";
    print '0x' . $h . "\t";
    print sprintf( "[%-1s]\t", $c );
    print sprintf( "%10s", commify( $count ) ) . "\t";
    print sprintf( "%-80s", $n );
    print "\n";
    } # ident



  # ---------- ---------- ----------
  # Load special control characters from DATA below
  # ---------- ---------- ----------
  while ( <DATA> )
  {

    chomp;
    last unless /\S/;
    my ( $key, @data ) = split /,/;
    $ctrls{$key} = \@data;

  }



  # ---------- ---------- ----------
  # Read the file
  # ---------- ---------- ----------
  my $line;
  my @count;

  #open( my $fh, '<', $file ) or die "Unable to open $file - $!\n";
  #while ( $line = <$fh> )

  open( my $fh, '<:encoding( UTF-8 )', $file ) or die "Unable to open $file - $!\n";
  while ( $line = encode( 'UTF-8', <$fh>, FB_PERLQQ ) )
  {

    my @chars = split( //, $line );
    foreach my $char ( @chars )
    {

  #    utf8::decode( $char ) or die "unable to change [$char] to utf8";
      $count[ ord( $char ) ]++;

    }

  }
  close $fh or die "Unable to close $file: $!\n";


  # ---------- ---------- ----------
  #  http://unicode.org/faq/utf_bom.html#gen6
  #  1114111 = 0x10FFFF - max possible value in Unicode UTF-8 v.5.2.
  # ---------- ---------- ----------
  my @list = ( 0 .. 1114111 );
  @list = sort { $count[$a] || 0 <=> $count[$b] || 0 } @list if $opt_f;
  @list = reverse @list if $opt_r;

  # ---------- ---------- ----------
  # Show what we found
  # ---------- ---------- ----------
  print "\t   Dec\t   Hex\tLetter\t Count\tDesc\n\n";
  my $cnt = 1;
  for my $idx ( @list )
  {

    if ( $count[$idx] )
    {

      print "line_num=>$cnt\tchar_code=>$idx\tcount=>$count[$idx]\n" if DEBUG;
      ident( line_num=>$cnt,
             char_code=>$idx,
             count=>$count[$idx] );
      $cnt++;

    }

  }

  # ---------- ---------- ----------
  # All done
  # ---------- ---------- ----------
  exit;

  # ========== ========== ========== ========== ========== ========== ==========

  # ---------- ---------- ----------
  # These special characters don't have all
  # this extra definition, so let's make this list
  # ---------- ---------- ----------
  __DATA__
  0,NUL,C0 Control Character Set - Null (^@ \0)
  1,SOH,C0 Control Character Set - Start of Header (^A)
  2,STX,C0 Control Character Set - Start of Text (^B)
  3,ETX,C0 Control Character Set - End of Text (^C)
  4,EOT,C0 Control Character Set - End of Transmission (^D)
  5,ENQ,C0 Control Character Set - Enquiry (^E)
  6,ACK,C0 Control Character Set - Acknowledge (^F)
  7,BEL,C0 Control Character Set - Bell(^G \a)
  8,BS,C0 Control Character Set - Backspace (^H \b)
  9,HT,C0 Control Character Set - Horizontal Tabulation (^I \t)
  10,LF,C0 Control Character Set - Line Feed (^J \n)
  11,VT,C0 Control Character Set - Vertical Tabulation (^K \v)
  12,FF,C0 Control Character Set - Form Feed (^L \f)
  13,CR,C0 Control Character Set - Carriage Return (^M \r)
  14,SO,C0 Control Character Set - Shift Out (^N)
  15,SI,C0 Control Character Set - Shift In (^O)
  16,DLE,C0 Control Character Set - Data Line Escape (^P)
  17,DC1,C0 Control Character Set - Device Control One (^Q) - XON
  18,DC2,C0 Control Character Set - Device Control Two (^R)
  19,DC3,C0 Control Character Set - Device Control Three (^S) - XOFF
  20,DC4,C0 Control Character Set - Device Control Four (^T)
  21,NAK,C0 Control Character Set - Negative Acknowledge (^U)
  22,SYN,C0 Control Character Set - Synchronous Idle (^V)
  23,ETB,C0 Control Character Set - End of Transmission Block (^W)
  24,CAN,C0 Control Character Set - Cancel (^X)
  25,EM,C0 Control Character Set - End of Medium (^Y)
  26,SUB,C0 Control Character Set - Substitute (^Z)
  27,ESC,C0 Control Character Set - Escape (^[, \e)
  28,FS,C0 Control Character Set - File Separator (^\)
  29,GS,C0 Control Character Set - Group Separator (^])
  30,RS,C0 Control Character Set - Record Separator (^^)
  31,US,C0 Control Character Set - Unit Separator (^_)
  32,SP,Space
  127,DEL,Delete (^?)
  128,PAD,C1 Control Character Set - Padding Character
  129,HOP,C1 Control Character Set - High Octet Preset
  130,BPH,C1 Control Character Set - Break Permitted Here
  131,NBH,C1 Control Character Set - No Break Here
  132,IND,C1 Control Character Set - Index
  133,NEL,C1 Control Character Set - Next Line
  134,SSA,C1 Control Character Set - Start of Selected Area
  135,ESA,C1 Control Character Set - End of Selected Area
  136,HTS,C1 Control Character Set - Horizontal Tabulation Set
  137,HTJ,C1 Control Character Set - Horizontal Tabulation with Justification
  138,VTS,C1 Control Character Set - Vertical Tabulation Set
  139,PLD,C1 Control Character Set - Partial Line Down
  140,PLU,C1 Control Character Set - Partial Line Up
  141,RI,C1 Control Character Set - Reverse Index
  142,SS2,C1 Control Character Set - Single-Shift Two
  143,SS3,C1 Control Character Set - Single-Shift Three
  144,DCS,C1 Control Character Set - Device Control String
  145,PU1,C1 Control Character Set - Private Use One
  146,PU2,C1 Control Character Set - Private Use Two
  147,STS,C1 Control Character Set - Set Transmit State
  148,CCH,C1 Control Character Set - Cancel Character
  149,MW,C1 Control Character Set - Message Waiting
  150,SPA,C1 Control Character Set - Start of Guarded Protected Area
  151,EPA,C1 Control Character Set - End of Guarded Protected Area
  152,SOS,C1 Control Character Set - Start of String
  153,SGCI,C1 Control Character Set - Single Graphic Character Introducer
  154,SCI,C1 Control Character Set - Single Character Introducer
  155,CSI,C1 Control Character Set - Control Sequence Introducer
  156,ST,C1 Control Character Set - String Terminator
  157,OSC,C1 Control Character Set - Operating System Command
  158,PM,C1 Control Character Set - Privacy Message
  159,APC,C1 Control Character Set - Application Program Command
  __END__

  # ========== ========== ========== ========== ========== ========== ==========

最佳答案

简单的回答

这是一个大纲。 永远不要自己手动解码!我唯一一次不得不这样做是处理一个文件，其中的编码从一行到下一行都不一样。相反，始终在流上设置编码，无论是通过以下方式之一:

PERLUNICODE 环境变量:std{in,out,err} 的标准 S 和危险的 D对于文件
use open 编译指示。
在 3⁺‐arg open 的模式参数中。
在 binmode 的第二个参数中。

这是一个大纲:

use warnings;
use warnings FATAL => "utf8";
use charnames ();
my %seen = ();
binmode(STDOUT, ":utf8") || die "binmode failed";
binmode(STDIN, ":encoding(UTF-8)") || die "binmode failed";

while (<STDIN>) {
    $seen{$_}++ for split //;
}
close(STDIN) || die "can't close STDIN: $!";

现在您有一个 %seen 哈希，它由每个字符索引，其关联值为实例计数。

简单回答

这是一个完整的解决方案，假设所有输入都是 UTF-8。如果您不喜欢代码点顺序，它可以生成漂亮的输出，您可以对不同的列进行排序。

#!/usr/bin/env perl
#
# unicount - count code points in input
# Tom Christiansen <tchrist@perl.com>

use v5.12;
use strict;
use sigtrap;
use warnings;
use open qw( :encoding(UTF-8) :std );
use charnames ();

use List::Util          qw(max);
use Unicode::UCD        qw(charinfo charblock);

my $total = 0;
my %seen = ();

while (<>) {
    $total += length;
    $seen{$_}++ for split //;
};

my $dec_width = length($total);
my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen));

for (sort keys %seen) {
    my $count = $seen{$_};
    my $gcat  = charinfo(ord())->{category};
    my $name  = charnames::viacode(ord())
             || "<unnamed code point in @{[charblock(ord())]}>";

    printf "%*d U+%0*X GC=%2s %s\n",
            $dec_width => $count,
            $hex_width => ord(),
            $gcat      => $name;
}

exit;

奢侈的回答

这不再假设输入是 UTF-8。

它使用 magic open 修剪掉 .gz 类型的扩展。
它在 podfiles 中查找嵌入的 =encoding。这可以扩展到 html 和 xml 文件。
如果文件的扩展名与有效的编码别名匹配，则使用该编码。例如，foo.latin1、foo.utf8、foo.cp1252、foo.utf16、foo.utf16be、foo.macroman。我坚信没有纯文本文件这样的东西，因此应该立即禁止使用 .txt 扩展名。
否则二进制文件假定为字节，否则假定为 utf8。

处理可以按行而不是整个文件，但我将其作为练习留给读者。

#!/usr/bin/env perl
#
# unicount - count code points in input
# Tom Christiansen <tchrist@perl.com>

use v5.12;
use strict;
use sigtrap;
use warnings;
use charnames ();

use Carp                qw(carp croak confess cluck);
use List::Util          qw(max);
use Unicode::UCD        qw(charinfo charblock);

sub fix_extension;
sub process_input   (&) ;
sub set_encoding    (*$);
sub yuck            ($) ;

my $total = 0;
my %seen = ();

# deep magic here
process_input {
    $total += length;
    $seen{$_}++ for split //;
};

my $dec_width = length($total);
my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen));

for (sort keys %seen) {
    my $count = $seen{$_};
    my $gcat  = charinfo(ord())->{category};
    my $name  = charnames::viacode(ord())
             || "<unnamed code point in @{[charblock(ord())]}>";

    printf "%*d U+%0*X GC=%2s %s\n",
            $dec_width => $count,
            $hex_width => ord(),
            $gcat      => $name;
}

exit;

##################################################

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

sub process_input(&) {
    my $function = shift();
    my $enc;

    if (@ARGV == 0 && -t STDIN && -t STDERR) {
        print STDERR "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle
        next if -e $file && ! -f _;
        my $quasi_filename = fix_extension($file);
        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ###98#
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            # could just do this a line at a time, but not if counting \R's
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        do {
            # much faster to alias than to copy
            local *_ = \$whole_file;
            &$function;
        };

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

# Encoding set to (after unzipping):
#    if file.pod => use whatever =encoding says
#    elsif file.ENCODING for legal encoding name -> use that one
#    elsif file is binary => use bytes
#    else => use utf8
#
# Note that gzipped stuff always shows up as bytes this way, but
#   it internal unzipped bytes are still counted after unzipping
#
sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = (-f $path && -B $path) ? "bytes" : "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;

        if ($ext eq "pod") {
            my $int_enc = qx{
                perl -C0 -lan -00 -e 'next unless /^=encoding/; print \$F[1]; exit' $path
            };
            if ($int_enc) {
                chomp $int_enc;
                $ext = $int_enc;
              ##print STDERR "$0: reset encoding to $ext on $path\n";
            }
        }

        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
      ##print STDERR qq(binmode($handle, ":$enc_name")\n);
        binmode($handle, ":$enc_name") || die "binmode to $enc_name failed";
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            return "$prog $path |";
        }
    }

    return $path;
}

END {
    close(STDIN)  || die "couldn't close stdin: $!";
    close(STDOUT) || die "couldn't close stdout: $!";
}

UNITCHECK {
    $SIG{  PIPE  } = sub { exit };
    $SIG{__WARN__} = sub {
        confess "trapped uncaught warning" unless $^S;
    };
}

关于perl - 如何计算文件中的所有字符，包括 Control 和 Unicode？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7246501/

文章推荐： java - 只返回某些字段的查询？

文章推荐： java - 打印按键事件

文章推荐： java - 从一个 ArrayList 提取并修改到另一个 ArrayList

文章推荐： java - GEF EditParts 中的直接编辑确认

unicode - 何时使用 Unicode(除了非 unicode!)
我还没有找到太多关于何时使用 Unicode 的(简明)信息。我知道很多人说最佳实践是始终使用 Unicode。但 Unicode 字符串确实有更多的内存占用。我是否正确地说，必须仅在以下情况下使用
unicode - 如何从 Unicode 网站获取完整的 Unicode 表情符号列表
我正在构建一个需要使用表情符号的应用程序，特别是生成大量随机表情符号序列。这需要有一个大列表可供选择。而不是采取方法 detailed here通过循环硬编码十六进制范围，我决定采用不同的方法并从 t
unicode - 组合任意 Unicode 符号或具有负宽度的 Unicode 字符
早在 ZX Spectrum 的早期，就有一种方法可以将一个字形打印在另一个字形之上，从而在 OVER 1 指令的帮助下创建复合字形。我想知道是否有 Unicode 方法可以在现代计算机上执行相同的
unicode - 如何将 Unicode 代码点转换为 Unicode 字符串？
我有一个表示 Unicode 代码点的字符串，例如 "272d"。如何将其转换为 "✭"？ Elixir 当然理解 Unicode: iex> > "✭" iex> "x{272d}" "✭" 但我需
unicode - 在源文件中使用 Unicode 并且缺少 unicode 符号
自从我了解到 clang 能够编译用 Unicode 编写的 c++ 源文件后，我在编写与数学相关的代码时就开始大量使用它。比较 uₙ₊₁ᵖ = A*uₙ + B*uₙ₋₁; uₙ₊₁ᶜ = π *
unicode - Unicode 拉丁下标字母列表
感谢jmcnamara我发现了一种在 xlsxwriter 图表中使用 Unicode 字符的好方法:xlsxwrter: rich text format in chart title 我需要一个所
unicode - Unicode 中可能的组合字符序列
有些字符不包含在 Unicode 中(即带重音的西里尔字母)，但可以使用组合序列创建。据我了解，可能的组合字符序列是在布局引擎和/或使用的字体中定义的。我对吗？那么，如何得到所有可能的组合序列呢？最
unicode - Unicode 的换行规则是否要求最后一个字符是强制换行符？
我正在尝试使用 libunibreak ( https://github.com/adah1972/libunibreak ) 来标记某些给定 unicode 文本中可能的换行符。 Libunibre
unicode - Unicode 字母字符列表
我需要具有属性 Alphabetic 的 Unicode 字符范围列表如 http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic 中所定
unicode - Unicode 中视觉相同的字符
我想为 Unicode 中的特定字符找到视觉上相同的字符。我知道如何找到一个字符的规范或兼容性分解；但他们没有给我我想要的。我想找到视觉上相同(不相似)的字符，它们唯一的区别可能是它们的大小。例
unicode - 规范化 unicode
假设我有包含此字符串的 Apache Solr 索引文档: Klüft skräms inför 我希望能够使用此关键字通过搜索找到它(注意“u”-“ü”): kluft 有没有办法做到这一点？最
unicode - Unicode 是否有定义的最大代码点数？
我已经阅读了很多文章以了解 Unicode 代码点的最大数量，但我没有找到最终答案。我知道 Unicode 代码点已最小化，以使所有 UTF-8 UTF-16 和 UTF-32 编码都能够处理相同数
unicode - 通过浏览器支持 Unicode
我正在使用 CSS Buttons With Icons But No Images . 图标是使用 unicode 值生成的。在这方面，我遇到了一些浏览器不支持某些 unicode 值的问题。因此，
unicode - Unicode 文本的罗马化
我正在寻找一种方法将 Unicode 字母字符从任何语言音译为带重音的拉丁字母。目的是让外国人深入了解以任何非拉丁文字书写的姓名和单词的发音。例子: 希腊语:Romanize("Αλφαβητικό
unicode - Unicode 标准中的颜色？
Unicode 6.0 添加了几个带有描述的字符，表明这些字符应该以特定颜色呈现: 红苹果 U+1F34E 青苹果 U+1F34F 蓝心U+1F499 绿心U+1F49A 黄心U+1F49B 紫心U+
unicode - Unicode 字符如何映射到字体中的字形？
我想知道，Unicode 中的每个字符都有一个代码点；字体中字符的类似术语是什么？当解码文件需要映射到字体(或字体，通过一些现代字体替换技术)时，我从来没有理解过程的一部分。例如，当文本编辑器从其
unicode - Unicode 可打印字符的范围是多少？
谁能告诉我 Unicode 可打印字符的范围是多少？ [例如。 Ascii 可打印字符范围为\u0020 -\u007f] 最佳答案参见，http://en.wikipedia.org/wiki/U
unicode - 为什么我们所做的一切都不是 Unicode？
鉴于Unicode有been around for 18 years ，为什么还有不支持 Unicode 的应用程序？甚至我对某些操作系统和 Unicode 的体验至少可以说是痛苦的。正如乔尔·斯波尔
unicode - Unicode 可以映射多少个字符？
我要求计算 Unicode 中所有可能的有效组合的数量并附上解释。我知道一个 char 可以编码为 1、2、3 或 4 个字节。我也不明白为什么连续字节有限制，即使该字符的起始字节清除了它应该有多长。
unicode - Unicode 中汉字的完整范围是多少？
Unicode 为中文字符分配了 U+4E00..U+9FFF。这是全套的一部分，但不是全部。最佳答案最终列表可以在 Unicode Character Code Charts 找到;在页面中搜索

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

perl - 如何计算文件中的所有字符，包括 Control 和 Unicode？

简单的回答

简单回答

奢侈的回答