[PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
This series contain basically a cleanup from all those years of converting files to ReST.
During the conversion period, several tools like LaTeX, pandoc, DocBook and some specially-written scripts were used in order to convert existing documents.
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons:
1. it makes life easier for tools like grep; 2. they easier to edit with the some commonly used text/source code editors.
Also, Sphinx already do such conversion automatically outside literal blocks, as described at:
https://docutils.sourceforge.io/docs/user/smartquotes.html
In this series, the following UTF-8 symbols are replaced:
- U+00a0 (' '): NO-BREAK SPACE - U+00ad (''): SOFT HYPHEN - U+00b4 ('´'): ACUTE ACCENT - U+00d7 ('×'): MULTIPLICATION SIGN - U+2010 ('‐'): HYPHEN - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK - U+2212 ('−'): MINUS SIGN - U+2217 ('∗'): ASTERISK OPERATOR - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
---
v2: - removed EM/EN DASH conversion from this patchset; - removed a few fixes, as those were addressed on a separate series.
PS.: The first version of this series was posted with a different name:
https://lore.kernel.org/lkml/cover.1620641727.git.mchehab+huawei@kernel.org/
I also changed the patch texts, in order to better describe the patches goals.
Mauro Carvalho Chehab (40): docs: hwmon: Use ASCII subset instead of UTF-8 alternate symbols docs: admin-guide: Use ASCII subset instead of UTF-8 alternate symbols docs: admin-guide: media: ipu3.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: admin-guide: perf: imx-ddr.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: admin-guide: pm: Use ASCII subset instead of UTF-8 alternate symbols docs: trace: coresight: coresight-etm4x-reference.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: driver-api: ioctl.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: driver-api: thermal: Use ASCII subset instead of UTF-8 alternate symbols docs: driver-api: media: drivers: Use ASCII subset instead of UTF-8 alternate symbols docs: driver-api: firmware: other_interfaces.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: fault-injection: nvme-fault-injection.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: usb: Use ASCII subset instead of UTF-8 alternate symbols docs: process: code-of-conduct.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: userspace-api: media: fdl-appendix.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: userspace-api: media: v4l: Use ASCII subset instead of UTF-8 alternate symbols docs: userspace-api: media: dvb: Use ASCII subset instead of UTF-8 alternate symbols docs: vm: zswap.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: filesystems: f2fs.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate symbols docs: kernel-hacking: Use ASCII subset instead of UTF-8 alternate symbols docs: hid: Use ASCII subset instead of UTF-8 alternate symbols docs: security: tpm: tpm_event_log.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: security: keys: trusted-encrypted.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: networking: scaling.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: networking: devlink: devlink-dpipe.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: networking: device_drivers: Use ASCII subset instead of UTF-8 alternate symbols docs: x86: Use ASCII subset instead of UTF-8 alternate symbols docs: scheduler: sched-deadline.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: power: powercap: powercap.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: ABI: Use ASCII subset instead of UTF-8 alternate symbols docs: PCI: acpi-info.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols docs: sound: kernel-api: writing-an-alsa-driver.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: arm64: arm-acpi.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: infiniband: tag_matching.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: misc-devices: ibmvmc.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: firmware-guide: acpi: lpit.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: firmware-guide: acpi: dsd: graph.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: virt: kvm: api.rst: Use ASCII subset instead of UTF-8 alternate symbols docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols
...sfs-class-chromeos-driver-cros-ec-lightbar | 2 +- .../ABI/testing/sysfs-devices-platform-ipmi | 2 +- .../testing/sysfs-devices-platform-trackpoint | 2 +- Documentation/ABI/testing/sysfs-devices-soc | 4 +- Documentation/PCI/acpi-info.rst | 22 +- .../Data-Structures/Data-Structures.rst | 52 ++-- .../Expedited-Grace-Periods.rst | 40 +-- .../Tree-RCU-Memory-Ordering.rst | 10 +- .../RCU/Design/Requirements/Requirements.rst | 122 ++++----- Documentation/admin-guide/media/ipu3.rst | 2 +- Documentation/admin-guide/perf/imx-ddr.rst | 2 +- Documentation/admin-guide/pm/intel_idle.rst | 4 +- Documentation/admin-guide/pm/intel_pstate.rst | 4 +- Documentation/admin-guide/ras.rst | 86 +++--- .../admin-guide/reporting-issues.rst | 2 +- Documentation/arm64/arm-acpi.rst | 8 +- .../driver-api/firmware/other_interfaces.rst | 2 +- Documentation/driver-api/ioctl.rst | 8 +- .../media/drivers/sh_mobile_ceu_camera.rst | 8 +- .../driver-api/media/drivers/zoran.rst | 2 +- .../driver-api/thermal/cpu-idle-cooling.rst | 14 +- .../driver-api/thermal/intel_powerclamp.rst | 6 +- .../thermal/x86_pkg_temperature_thermal.rst | 2 +- .../fault-injection/nvme-fault-injection.rst | 2 +- Documentation/filesystems/ext4/attributes.rst | 20 +- Documentation/filesystems/ext4/bigalloc.rst | 6 +- Documentation/filesystems/ext4/blockgroup.rst | 8 +- Documentation/filesystems/ext4/blocks.rst | 2 +- Documentation/filesystems/ext4/directory.rst | 16 +- Documentation/filesystems/ext4/eainode.rst | 2 +- Documentation/filesystems/ext4/inlinedata.rst | 6 +- Documentation/filesystems/ext4/inodes.rst | 6 +- Documentation/filesystems/ext4/journal.rst | 8 +- Documentation/filesystems/ext4/mmp.rst | 2 +- .../filesystems/ext4/special_inodes.rst | 4 +- Documentation/filesystems/ext4/super.rst | 10 +- Documentation/filesystems/f2fs.rst | 4 +- .../firmware-guide/acpi/dsd/graph.rst | 2 +- Documentation/firmware-guide/acpi/lpit.rst | 2 +- Documentation/gpu/i915.rst | 2 +- Documentation/gpu/komeda-kms.rst | 2 +- Documentation/hid/hid-sensor.rst | 70 ++--- Documentation/hid/intel-ish-hid.rst | 246 +++++++++--------- Documentation/hwmon/ir36021.rst | 2 +- Documentation/hwmon/ltc2992.rst | 2 +- Documentation/hwmon/pm6764tr.rst | 2 +- Documentation/infiniband/tag_matching.rst | 4 +- Documentation/kernel-hacking/hacking.rst | 2 +- Documentation/kernel-hacking/locking.rst | 2 +- Documentation/misc-devices/ibmvmc.rst | 8 +- .../device_drivers/ethernet/intel/i40e.rst | 8 +- .../device_drivers/ethernet/intel/iavf.rst | 4 +- .../device_drivers/ethernet/netronome/nfp.rst | 12 +- .../networking/devlink/devlink-dpipe.rst | 2 +- Documentation/networking/scaling.rst | 18 +- Documentation/power/powercap/powercap.rst | 210 +++++++-------- Documentation/process/code-of-conduct.rst | 2 +- Documentation/scheduler/sched-deadline.rst | 2 +- .../security/keys/trusted-encrypted.rst | 4 +- Documentation/security/tpm/tpm_event_log.rst | 2 +- .../kernel-api/writing-an-alsa-driver.rst | 68 ++--- .../coresight/coresight-etm4x-reference.rst | 16 +- Documentation/usb/ehci.rst | 2 +- Documentation/usb/gadget_printer.rst | 2 +- Documentation/usb/mass-storage.rst | 36 +-- .../media/dvb/audio-set-bypass-mode.rst | 2 +- .../userspace-api/media/dvb/audio.rst | 2 +- .../userspace-api/media/dvb/dmx-fopen.rst | 2 +- .../userspace-api/media/dvb/dmx-fread.rst | 2 +- .../media/dvb/dmx-set-filter.rst | 2 +- .../userspace-api/media/dvb/intro.rst | 6 +- .../userspace-api/media/dvb/video.rst | 2 +- .../userspace-api/media/fdl-appendix.rst | 64 ++--- .../userspace-api/media/v4l/crop.rst | 16 +- .../userspace-api/media/v4l/dev-decoder.rst | 6 +- .../userspace-api/media/v4l/diff-v4l.rst | 2 +- .../userspace-api/media/v4l/open.rst | 2 +- .../media/v4l/vidioc-cropcap.rst | 4 +- Documentation/virt/kvm/api.rst | 28 +- Documentation/vm/zswap.rst | 4 +- Documentation/x86/resctrl.rst | 2 +- Documentation/x86/sgx.rst | 4 +- 82 files changed, 693 insertions(+), 693 deletions(-)
The conversion tools used during DocBook/LaTeX/Markdown->ReST conversion and some automatic rules which exists on certain text editors like LibreOffice turned ASCII characters into some UTF-8 alternatives that are better displayed on html and PDF.
While it is OK to use UTF-8 characters in Linux, it is better to use the ASCII subset instead of using an UTF-8 equivalent character as it makes life easier for tools like grep, and are easier to edit with the some commonly used text/source code editors.
Also, Sphinx already do such conversion automatically outside literal blocks: https://docutils.sourceforge.io/docs/user/smartquotes.html
So, replace the occurences of the following UTF-8 characters:
- U+00a0 (' '): NO-BREAK SPACE - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org --- .../kernel-api/writing-an-alsa-driver.rst | 68 +++++++++---------- 1 file changed, 34 insertions(+), 34 deletions(-)
diff --git a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst index e6365836fa8b..201ced3bba6e 100644 --- a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst +++ b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst @@ -533,7 +533,7 @@ Management of Cards and Components Card Instance -------------
-For each soundcard, a “card” record must be allocated. +For each soundcard, a "card" record must be allocated.
A card record is the headquarters of the soundcard. It manages the whole list of devices (components) on the soundcard, such as PCM, mixers, @@ -980,7 +980,7 @@ The role of destructor is simple: disable the hardware (if already activated) and release the resources. So far, we have no hardware part, so the disabling code is not written here.
-To release the resources, the “check-and-release” method is a safer way. +To release the resources, the "check-and-release" method is a safer way. For the interrupt, do like this:
:: @@ -1133,7 +1133,7 @@ record:
The ``probe`` and ``remove`` functions have already been defined in the previous sections. The ``name`` field is the name string of this -device. Note that you must not use a slash “/” in this string. +device. Note that you must not use a slash "/" in this string.
And at last, the module entries:
@@ -1692,8 +1692,8 @@ Typically, you'll have a hardware descriptor as below:
The other possible flags are ``SNDRV_PCM_INFO_PAUSE`` and ``SNDRV_PCM_INFO_RESUME``. The ``PAUSE`` bit means that the pcm - supports the “pause” operation, while the ``RESUME`` bit means that - the pcm supports the full “suspend/resume” operation. If the + supports the "pause" operation, while the ``RESUME`` bit means that + the pcm supports the full "suspend/resume" operation. If the ``PAUSE`` flag is set, the ``trigger`` callback below must handle the corresponding (pause push/release) commands. The suspend/resume trigger commands can be defined even without the ``RESUME`` @@ -1731,7 +1731,7 @@ Typically, you'll have a hardware descriptor as below: ``periods_min`` define the maximum and minimum number of periods in the buffer.
- The “period” is a term that corresponds to a fragment in the OSS + The "period" is a term that corresponds to a fragment in the OSS world. The period defines the size at which a PCM interrupt is generated. This size strongly depends on the hardware. Generally, the smaller period size will give you more interrupts, that is, @@ -1756,7 +1756,7 @@ application. This field contains the enum value ``SNDRV_PCM_FORMAT_XXX``.
One thing to be noted is that the configured buffer and period sizes -are stored in “frames” in the runtime. In the ALSA world, ``1 frame = +are stored in "frames" in the runtime. In the ALSA world, ``1 frame = channels * samples-size``. For conversion between frames and bytes, you can use the :c:func:`frames_to_bytes()` and :c:func:`bytes_to_frames()` helper functions. @@ -1999,7 +1999,7 @@ prepare callback
static int snd_xxx_prepare(struct snd_pcm_substream *substream);
-This callback is called when the pcm is “prepared”. You can set the +This callback is called when the pcm is "prepared". You can set the format type, sample rate, etc. here. The difference from ``hw_params`` is that the ``prepare`` callback will be called each time :c:func:`snd_pcm_prepare()` is called, i.e. when recovering after @@ -2436,8 +2436,8 @@ size is aligned with the period size.
The hw constraint is a very much powerful mechanism to define the preferred PCM configuration, and there are relevant helpers. -I won't give more details here, rather I would like to say, “Luke, use -the source.” +I won't give more details here, rather I would like to say, "Luke, use +the source."
Control Interface ================= @@ -2518,50 +2518,50 @@ Control Names -------------
There are some standards to define the control names. A control is -usually defined from the three parts as “SOURCE DIRECTION FUNCTION”. +usually defined from the three parts as "SOURCE DIRECTION FUNCTION".
The first, ``SOURCE``, specifies the source of the control, and is a -string such as “Master”, “PCM”, “CD” and “Line”. There are many +string such as "Master", "PCM", "CD" and "Line". There are many pre-defined sources.
The second, ``DIRECTION``, is one of the following strings according to -the direction of the control: “Playback”, “Capture”, “Bypass Playback” -and “Bypass Capture”. Or, it can be omitted, meaning both playback and +the direction of the control: "Playback", "Capture", "Bypass Playback" +and "Bypass Capture". Or, it can be omitted, meaning both playback and capture directions.
The third, ``FUNCTION``, is one of the following strings according to -the function of the control: “Switch”, “Volume” and “Route”. +the function of the control: "Switch", "Volume" and "Route".
-The example of control names are, thus, “Master Capture Switch” or “PCM -Playback Volume”. +The example of control names are, thus, "Master Capture Switch" or "PCM +Playback Volume".
There are some exceptions:
Global capture and playback ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-“Capture Source”, “Capture Switch” and “Capture Volume” are used for the -global capture (input) source, switch and volume. Similarly, “Playback -Switch” and “Playback Volume” are used for the global output gain switch +"Capture Source", "Capture Switch" and "Capture Volume" are used for the +global capture (input) source, switch and volume. Similarly, "Playback +Switch" and "Playback Volume" are used for the global output gain switch and volume.
Tone-controls ~~~~~~~~~~~~~
-tone-control switch and volumes are specified like “Tone Control - XXX”, -e.g. “Tone Control - Switch”, “Tone Control - Bass”, “Tone Control - -Center”. +tone-control switch and volumes are specified like "Tone Control - XXX", +e.g. "Tone Control - Switch", "Tone Control - Bass", "Tone Control - +Center".
3D controls ~~~~~~~~~~~
-3D-control switches and volumes are specified like “3D Control - XXX”, -e.g. “3D Control - Switch”, “3D Control - Center”, “3D Control - Space”. +3D-control switches and volumes are specified like "3D Control - XXX", +e.g. "3D Control - Switch", "3D Control - Center", "3D Control - Space".
Mic boost ~~~~~~~~~
-Mic-boost switch is set as “Mic Boost” or “Mic Boost (6dB)”. +Mic-boost switch is set as "Mic Boost" or "Mic Boost (6dB)".
More precise information can be found in ``Documentation/sound/designs/control-names.rst``. @@ -3368,7 +3368,7 @@ This ensures that the device can be closed and the driver unloaded without losing data.
This callback is optional. If you do not set ``drain`` in the struct -snd_rawmidi_ops structure, ALSA will simply wait for 50 milliseconds +snd_rawmidi_ops structure, ALSA will simply wait for 50 milliseconds instead.
Miscellaneous Devices @@ -3506,20 +3506,20 @@ fixed as 4 bytes array (value.iec958.status[x]). For the ``info`` callback, you don't specify the value field for this type (the count field must be set, though).
-“IEC958 Playback Con Mask” is used to return the bit-mask for the IEC958 -status bits of consumer mode. Similarly, “IEC958 Playback Pro Mask” +"IEC958 Playback Con Mask" is used to return the bit-mask for the IEC958 +status bits of consumer mode. Similarly, "IEC958 Playback Pro Mask" returns the bitmask for professional mode. They are read-only controls, and are defined as MIXER controls (iface = ``SNDRV_CTL_ELEM_IFACE_MIXER``).
-Meanwhile, “IEC958 Playback Default” control is defined for getting and +Meanwhile, "IEC958 Playback Default" control is defined for getting and setting the current default IEC958 bits. Note that this one is usually defined as a PCM control (iface = ``SNDRV_CTL_ELEM_IFACE_PCM``), although in some places it's defined as a MIXER control.
In addition, you can define the control switches to enable/disable or to set the raw bit mode. The implementation will depend on the chip, but -the control should be named as “IEC958 xxx”, preferably using the +the control should be named as "IEC958 xxx", preferably using the :c:func:`SNDRV_CTL_NAME_IEC958()` macro.
You can find several cases, for example, ``pci/emu10k1``, @@ -3547,7 +3547,7 @@ function.
Usually, ALSA drivers try to allocate and reserve a large contiguous physical space at the time the module is loaded for the later use. This -is called “pre-allocation”. As already written, you can call the +is called "pre-allocation". As already written, you can call the following function at pcm instance construction time (in the case of PCI bus).
@@ -4163,7 +4163,7 @@ The typical coding would be like below:
Also, don't forget to define the module description and the license. Especially, the recent modprobe requires to define the -module license as GPL, etc., otherwise the system is shown as “tainted”. +module license as GPL, etc., otherwise the system is shown as "tainted".
::
@@ -4181,7 +4181,7 @@ So far, you've learned how to write the driver codes. And you might have a question now: how to put my own driver into the ALSA driver tree? Here (finally :) the standard procedure is described briefly.
-Suppose that you create a new PCI driver for the card “xyz”. The card +Suppose that you create a new PCI driver for the card "xyz". The card module name would be snd-xyz. The new driver is usually put into the alsa-driver tree, ``sound/pci`` directory in the case of PCI cards.
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
v2:
- removed EM/EN DASH conversion from this patchset;
Are you still thinking about doing the
EN DASH --> "--" EM DASH --> "---"
conversion? That's not going to change what the documentation will look like in the HTML and PDF output forms, and I think it would make life easier for people are reading and editing the Documentation/* files in text form.
- Ted
Em Wed, 12 May 2021 10:14:44 -0400 "Theodore Ts'o" tytso@mit.edu escreveu:
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
v2:
- removed EM/EN DASH conversion from this patchset;
Are you still thinking about doing the
EN DASH --> "--" EM DASH --> "---"
conversion?
Yes, but I intend to submit it on a separate patch series, probably after having this one merged. Let's first cleanup the large part of the conversion-generated UTF-8 char noise ;-)
That's not going to change what the documentation will look like in the HTML and PDF output forms, and I think it would make life easier for people are reading and editing the Documentation/* files in text form.
Agreed. I'm also considering to add a couple of cases of this char:
- U+2026 ('…'): HORIZONTAL ELLIPSIS
As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.
-
Anyway, I'm opting to submitting those in separate because it seems that at least some maintainers added EM/EN DASH intentionally.
So, it may generate case-per-case discussions.
Also, IMO, at least a couple of EN/EM DASH cases would be better served with a single hyphen.
Thanks, Mauro
On Wed, 2021-05-12 at 17:17 +0200, Mauro Carvalho Chehab wrote:
Em Wed, 12 May 2021 10:14:44 -0400 "Theodore Ts'o" tytso@mit.edu escreveu:
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
v2:
- removed EM/EN DASH conversion from this patchset;
Are you still thinking about doing the
EN DASH --> "--" EM DASH --> "---"
conversion?
Yes, but I intend to submit it on a separate patch series, probably after having this one merged. Let's first cleanup the large part of the conversion-generated UTF-8 char noise ;-)
That's not going to change what the documentation will look like in the HTML and PDF output forms, and I think it would make life easier for people are reading and editing the Documentation/* files in text form.
Agreed. I'm also considering to add a couple of cases of this char:
- U+2026 ('…'): HORIZONTAL ELLIPSIS
As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.
Er, what?
The *only* part of this whole enterprise that actually seemed to make even a tiny bit of sense — rather than seeming like a thinly veiled retrospective excuse for dragging us back in time by 30 years — was the bit about making it easier to grep.
But if I understand you correctly, you're talking about using something like C trigraphs to represent the perfectly reasonable text emdash character ("—") as two hyphen-minuses ("--") in the source code of the documentation? Isn't that going to achieve precisely the *opposite*? If I select some text in the HTML output of the docs and then search for it in the source code, that's going to *stop* it matching my search?
Your title 'Use ASCII subset' is now at least a bit *closer* to describing what the patches are actually doing, but it's still a bit misleading because you're only doing it for *some* characters.
And the wording is still indicative of a fundamentally *misguided* motivation for doing any of this. Your commit comments should be about fixing a specific thing, nothing to do with "use ASCII subset", which is pointless in itself.
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.
And don't we render our documentation into HTML or PDF formats? Are some of those non-breaking spaces not actually *useful* for their intended purpose?
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons:
- it makes life easier for tools like grep;
Barely, as noted, because of things like line feeds.
- they easier to edit with the some commonly used text/source code editors.
That is nonsense. Any but the most broken and/or anachronistic environments and editors will be just fine.
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse dwmw2@infradead.org escreveu:
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.
And don't we render our documentation into HTML or PDF formats?
Yes.
Are some of those non-breaking spaces not actually *useful* for their intended purpose?
No.
The thing is: non-breaking space can cause a lot of problems.
We even had to disable Sphinx usage of non-breaking space for PDF outputs, as this was causing bad LaTeX/PDF outputs.
See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
The afore mentioned patch disables Sphinx default behavior of using NON-BREAKABLE SPACE on literal blocks and strings, using this special setting: "parsedliteralwraps=true".
When NON-BREAKABLE SPACE were used on PDF outputs, several parts of the media uAPI docs were violating the document margins by far, causing texts to be truncated.
So, please **don't add NON-BREAKABLE SPACE**, unless you test (and keep testing it from time to time) if outputs on all formats are properly supporting it on different Sphinx versions.
-
Also, most of those came from conversion tools, together with other eccentricities, like the usage of U+FEFF (BOM) character at the start of some documents. The remaining ones seem to came from cut-and-paste.
For instance, bibliographic references (there are a couple of those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty sure that those came from cut-and-pasting the document titles from their names at the original PDF documents or web pages that are referenced.
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons:
- it makes life easier for tools like grep;
Barely, as noted, because of things like line feeds.
You can use grep with "-z" to seek for multi-line strings(*), Like:
$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) Documentation/RCU/Design/Data-Structures/Data-Structures.rst
(*) Unfortunately, while "git grep" also has a "-z" flag, it seems that this is (currently?) broken with regards of handling multilines:
$ git grep -Pzl 'grace period started,\s*then' $
- they easier to edit with the some commonly used text/source code editors.
That is nonsense. Any but the most broken and/or anachronistic environments and editors will be just fine.
Not really.
I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely on the US-intl keyboard settings, that allow me to type as "'a" for á. However, there's no shortcut for non-Latin UTF-codes, as far as I know.
So, if would need to type a curly comma on the text editors I normally use for development (vim, nano, kate), I would need to cut-and-paste it from somewhere[1].
[1] If I have a table with UTF-8 codes handy, I could type the UTF-8 number manually... However, it seems that this is currently broken at least on Fedora 33 (with Mate Desktop and US intl keyboard with dead keys).
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread.
In practice, on the very rare cases where I needed to write non-Latin utf-8 chars (maybe once in a year or so, Like when I would need to use a Greek letter or some weird symbol), there changes are high that I wouldn't remember its UTF-8 code.
So, If I need to spend time to seek for an specific symbol, after finding it, I just cut-and-paste it.
But even in the best case scenario where I know the UTF-8 and <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly comma, the keystroke sequence would be:
<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
That's a lot harder than typing and has a higher chances of mistakenly add a wrong symbol than just typing:
"some string"
Knowing that both will produce *exactly* the same output, why should I bother doing it the hard way?
-
Now, I'm not arguing that you can't use whatever UTF-8 symbol you want on your docs. I'm just saying that, now that the conversion is over and a lot of documents ended getting some UTF-8 characters by accident, it is time for a cleanup.
Thanks, Mauro
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse dwmw2@infradead.org escreveu:
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.
And don't we render our documentation into HTML or PDF formats?
Yes.
Are some of those non-breaking spaces not actually *useful* for their intended purpose?
No.
The thing is: non-breaking space can cause a lot of problems.
We even had to disable Sphinx usage of non-breaking space for PDF outputs, as this was causing bad LaTeX/PDF outputs.
See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
The afore mentioned patch disables Sphinx default behavior of using NON-BREAKABLE SPACE on literal blocks and strings, using this special setting: "parsedliteralwraps=true".
When NON-BREAKABLE SPACE were used on PDF outputs, several parts of the media uAPI docs were violating the document margins by far, causing texts to be truncated.
So, please **don't add NON-BREAKABLE SPACE**, unless you test (and keep testing it from time to time) if outputs on all formats are properly supporting it on different Sphinx versions.
And there you have a specific change with a specific fix. Nothing to do with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to do with the fact that, like *every* character in every kernel file except the *binary* files, it's representable in UTF-8.
By all means fix the specific characters which are typographically wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering the documentation.
Also, most of those came from conversion tools, together with other eccentricities, like the usage of U+FEFF (BOM) character at the start of some documents. The remaining ones seem to came from cut-and-paste.
... or which are just entirely redundant and gratuitous, like a BOM in an environment where all files are UTF-8 and never 16-bit encodings anyway.
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons:
- it makes life easier for tools like grep;
Barely, as noted, because of things like line feeds.
You can use grep with "-z" to seek for multi-line strings(*), Like:
$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) Documentation/RCU/Design/Data-Structures/Data-Structures.rst
Yeah, right. That works if you don't just use the text that you'll have seen in the HTML/PDF "grace period started, then", and if you instead craft a *regex* for it, replacing the spaces with '\s*'. Or is that [[:space:]]* if you don't want to use the experimental Perl regex feature?
$ grep -zlr 'grace[[:space:]]+period[[:space:]]+started,[[:space:]]+then' Documentation/RCU Documentation/RCU/Design/Data-Structures/Data-Structures.rst
And without '-l' it'll obviously just give you the whole file. No '-A5 -B5' to see the surroundings... it's hardly a useful thing, is it?
(*) Unfortunately, while "git grep" also has a "-z" flag, it seems that this is (currently?) broken with regards of handling multilines:
$ git grep -Pzl 'grace period started,\s*then' $
Even better. So no, multiline grep isn't really a commonly usable feature at all.
This is why we prefer to put user-visible strings on one line in C source code, even if it takes the lines over 80 characters — to allow for grep to find them.
- they easier to edit with the some commonly used text/source code editors.
That is nonsense. Any but the most broken and/or anachronistic environments and editors will be just fine.
Not really.
I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely on the US-intl keyboard settings, that allow me to type as "'a" for á. However, there's no shortcut for non-Latin UTF-codes, as far as I know.
So, if would need to type a curly comma on the text editors I normally use for development (vim, nano, kate), I would need to cut-and-paste it from somewhere[1].
That's entirely irrelevant. You don't need to be able to *type* every character that you see in front of you, as long as your editor will render it correctly and perhaps let you cut/paste it as you're editing the document if you're moving things around.
[1] If I have a table with UTF-8 codes handy, I could type the UTF-8 number manually... However, it seems that this is currently broken at least on Fedora 33 (with Mate Desktop and US intl keyboard with dead keys).
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread.
Please provide the bug number for this; I'd like to track it.
But even in the best case scenario where I know the UTF-8 and <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly comma, the keystroke sequence would be:
<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
That's a lot harder than typing and has a higher chances of mistakenly add a wrong symbol than just typing:
"some string"
Knowing that both will produce *exactly* the same output, why should I bother doing it the hard way?
Nobody's asked you to do it the "hard way". That's completely irrelevant to the discussion we were having.
Now, I'm not arguing that you can't use whatever UTF-8 symbol you want on your docs. I'm just saying that, now that the conversion is over and a lot of documents ended getting some UTF-8 characters by accident, it is time for a cleanup.
All text documents are *full* of UTF-8 characters. If there is a file in the source code which has *any* non-UTF8, we call that a 'binary file'.
Again, if you want to make specific fixes like removing non-breaking spaces and byte order marks, with specific reasons, then those make sense. But it's got very little to do with UTF-8 and how easy it is to type them. And the excuse you've put in the commit comment for your patches is utterly bogus.
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely on the US-intl keyboard settings, that allow me to type as "'a" for á. However, there's no shortcut for non-Latin UTF-codes, as far as I know.
So, if would need to type a curly comma on the text editors I normally use for development (vim, nano, kate), I would need to cut-and-paste it from somewhere
For anyone who doesn't know about it: X has this wonderful thing called the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. Much more mnemonic than Unicode codepoints; and you can extend it with user-defined sequences in your ~/.XCompose file. (I assume Wayland supports all this too, but don't know the details.)
On 14/05/2021 10:06, David Woodhouse wrote:
Again, if you want to make specific fixes like removing non-breaking spaces and byte order marks, with specific reasons, then those make sense. But it's got very little to do with UTF-8 and how easy it is to type them. And the excuse you've put in the commit comment for your patches is utterly bogus.
+1
-ed
Em Fri, 14 May 2021 12:08:36 +0100 Edward Cree ecree.xilinx@gmail.com escreveu:
For anyone who doesn't know about it: X has this wonderful thing called the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. Much more mnemonic than Unicode codepoints; and you can extend it with user-defined sequences in your ~/.XCompose file.
Good tip. I haven't use composite for years, as US-intl with dead keys is enough for 99.999% of my needs.
Btw, at least on Fedora with Mate, Composite is disabled by default. It has to be enabled first using the same tool that allows changing the Keyboard layout[1].
Yet, typing an EN DASH for example, would be "<composite>--.", with is 4 keystrokes instead of just two ('--'). It means twice the effort ;-)
[1] KDE, GNome, Mate, ... have different ways to enable it and to select what key would be considered <composite>:
https://dry.sailingissues.com/us-international-keyboard-layout.html https://help.ubuntu.com/community/ComposeKey
Thanks, Mauro
Em Fri, 14 May 2021 10:06:01 +0100 David Woodhouse dwmw2@infradead.org escreveu:
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse dwmw2@infradead.org escreveu:
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.
And don't we render our documentation into HTML or PDF formats?
Yes.
Are some of those non-breaking spaces not actually *useful* for their intended purpose?
No.
The thing is: non-breaking space can cause a lot of problems.
We even had to disable Sphinx usage of non-breaking space for PDF outputs, as this was causing bad LaTeX/PDF outputs.
See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
The afore mentioned patch disables Sphinx default behavior of using NON-BREAKABLE SPACE on literal blocks and strings, using this special setting: "parsedliteralwraps=true".
When NON-BREAKABLE SPACE were used on PDF outputs, several parts of the media uAPI docs were violating the document margins by far, causing texts to be truncated.
So, please **don't add NON-BREAKABLE SPACE**, unless you test (and keep testing it from time to time) if outputs on all formats are properly supporting it on different Sphinx versions.
And there you have a specific change with a specific fix. Nothing to do with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to do with the fact that, like *every* character in every kernel file except the *binary* files, it's representable in UTF-8.
By all means fix the specific characters which are typographically wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering the documentation.
Also, most of those came from conversion tools, together with other eccentricities, like the usage of U+FEFF (BOM) character at the start of some documents. The remaining ones seem to came from cut-and-paste.
... or which are just entirely redundant and gratuitous, like a BOM in an environment where all files are UTF-8 and never 16-bit encodings anyway.
Agreed.
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons:
- it makes life easier for tools like grep;
Barely, as noted, because of things like line feeds.
You can use grep with "-z" to seek for multi-line strings(*), Like:
$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) Documentation/RCU/Design/Data-Structures/Data-Structures.rst
Yeah, right. That works if you don't just use the text that you'll have seen in the HTML/PDF "grace period started, then", and if you instead craft a *regex* for it, replacing the spaces with '\s*'. Or is that [[:space:]]* if you don't want to use the experimental Perl regex feature?
$ grep -zlr 'grace[[:space:]]+period[[:space:]]+started,[[:space:]]+then' Documentation/RCU Documentation/RCU/Design/Data-Structures/Data-Structures.rst
And without '-l' it'll obviously just give you the whole file. No '-A5 -B5' to see the surroundings... it's hardly a useful thing, is it?
(*) Unfortunately, while "git grep" also has a "-z" flag, it seems that this is (currently?) broken with regards of handling multilines:
$ git grep -Pzl 'grace period started,\s*then' $
Even better. So no, multiline grep isn't really a commonly usable feature at all.
This is why we prefer to put user-visible strings on one line in C source code, even if it takes the lines over 80 characters — to allow for grep to find them.
Makes sense, but in case of documentation, this is a little more complex than that.
Btw, the theme used when building html by default[1] has a search box (written in Javascript) that could be able to find multi-line patterns, working somewhat similar to "git grep foo -a bar".
[1] https://github.com/readthedocs/sphinx_rtd_theme
[1] If I have a table with UTF-8 codes handy, I could type the UTF-8 number manually... However, it seems that this is currently broken at least on Fedora 33 (with Mate Desktop and US intl keyboard with dead keys).
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread.
Please provide the bug number for this; I'd like to track it.
Just opened a BZ and added you as c/c.
Now, I'm not arguing that you can't use whatever UTF-8 symbol you want on your docs. I'm just saying that, now that the conversion is over and a lot of documents ended getting some UTF-8 characters by accident, it is time for a cleanup.
All text documents are *full* of UTF-8 characters. If there is a file in the source code which has *any* non-UTF8, we call that a 'binary file'.
Again, if you want to make specific fixes like removing non-breaking spaces and byte order marks, with specific reasons, then those make sense. But it's got very little to do with UTF-8 and how easy it is to type them. And the excuse you've put in the commit comment for your patches is utterly bogus.
Let's take one step back, in order to return to the intents of this UTF-8, as the discussions here are not centered into the patches, but instead, on what to do and why.
-
This discussion started originally at linux-doc ML.
While discussing about an issue when machine's locale was not set to UTF-8 on a build VM, we discovered that some converted docs ended with BOM characters. Those specific changes were introduced by some of my convert patches, probably converted via pandoc.
So, I went ahead in order to check what other possible weird things were introduced by the conversion, where several scripts and tools were used on files that had already a different markup.
I actually checked the current UTF-8 issues, and asked people at linux-doc to comment what of those are valid usecases, and what should be replaced by plain ASCII.
Basically, this is the current situation (at docs/docs-next), for the ReST files under Documentation/, excluding translations is:
1. Spaces and BOM
- U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
Based on the discussions there and on this thread, those should be dropped, as BOM is useless and NO-BREAK SPACE can cause problems at the html/pdf output;
2. Symbols
- U+00a9 ('©'): COPYRIGHT SIGN - U+00ac ('¬'): NOT SIGN - U+00ae ('®'): REGISTERED SIGN - U+00b0 ('°'): DEGREE SIGN - U+00b1 ('±'): PLUS-MINUS SIGN - U+00b2 ('²'): SUPERSCRIPT TWO - U+00b5 ('µ'): MICRO SIGN - U+03bc ('μ'): GREEK SMALL LETTER MU - U+00b7 ('·'): MIDDLE DOT - U+00bd ('½'): VULGAR FRACTION ONE HALF - U+2122 ('™'): TRADE MARK SIGN - U+2264 ('≤'): LESS-THAN OR EQUAL TO - U+2265 ('≥'): GREATER-THAN OR EQUAL TO - U+2b0d ('⬍'): UP DOWN BLACK ARROW
Those seem OK on my eyes.
On a side note, both MICRO SIGN and GREEK SMALL LETTER MU are used several docs to represent microseconds, micro-volts and micro-ampères. If we write an orientation document, it probably makes sense to recommend using MICRO SIGN on such cases.
3. Latin
- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA - U+00df ('ß'): LATIN SMALL LETTER SHARP S - U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE - U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS - U+00e6 ('æ'): LATIN SMALL LETTER AE - U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA - U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE - U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX - U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS - U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE - U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX - U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS - U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE - U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE - U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS - U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE - U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE - U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
Those should be kept as well, as they're used for non-English names.
4. arrows and box drawing symbols: - U+2191 ('↑'): UPWARDS ARROW - U+2192 ('→'): RIGHTWARDS ARROW - U+2193 ('↓'): DOWNWARDS ARROW
- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL - U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL - U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT - U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
Also should be kept.
In summary, based on the discussions we have so far, I suspect that there's not much to be discussed for the above cases.
So, I'll post a v3 of this series, changing only:
- U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
---
Now, this specific patch series address also this extra case:
5. curly commas:
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
IMO, those should be replaced by ASCII commas: ' and ".
The rationale is simple:
- most were introduced during the conversion from Docbook, markdown and LaTex; - they don't add any extra value, as using "foo" of “foo” means the same thing; - Sphinx already use "fancy" commas at the output.
I guess I will put this on a separate series, as this is not a bug fix, but just a cleanup from the conversion work.
I'll re-post those cleanups on a separate series, for patch per patch review.
---
The remaining cases are future work, outside the scope of this v2:
6. Hyphen/Dashes and ellipsis
- U+2212 ('−'): MINUS SIGN - U+00ad (''): SOFT HYPHEN - U+2010 ('‐'): HYPHEN
Those three are used on places where a normal ASCII hyphen/minus should be used instead. There are even a couple of C files which use them instead of '-' on comments.
IMO are fixes/cleanups from conversions and bad cut-and-paste.
- U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2026 ('…'): HORIZONTAL ELLIPSIS
Those are auto-replaced by Sphinx from "--", "---" and "...", respectively.
I guess those are a matter of personal preference about weather using ASCII or UTF-8.
My personal preference (and Ted seems to have a similar opinion) is to let Sphinx do the conversion.
For those, I intend to post a separate series, to be reviewed patch per patch, as this is really a matter of personal taste. Hardly we'll reach a consensus here.
7. math symbols:
- U+00d7 ('×'): MULTIPLICATION SIGN
This one is used mostly do describe video resolutions, but this is on a smaller changeset than the ones that use "x" letter.
- U+2217 ('∗'): ASTERISK OPERATOR
This is used only here: Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
Probably added by some conversion tool. IMO, this one should also be replaced by an ASCII asterisk.
I guess I'll post a patch for the ASTERISK OPERATOR. Thanks, Mauro
On Sat, 2021-05-15 at 10:22 +0200, Mauro Carvalho Chehab wrote:
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread.
Please provide the bug number for this; I'd like to track it.
Just opened a BZ and added you as c/c.
Thanks.
Let's take one step back, in order to return to the intents of this UTF-8, as the discussions here are not centered into the patches, but instead, on what to do and why.
This discussion started originally at linux-doc ML.
While discussing about an issue when machine's locale was not set to UTF-8 on a build VM,
Stop. Stop *right* there before you go any further.
The machine's locale should have *nothing* to do with anything.
When you view this email, it comes with a Content-Type: header which explicitly tells you the character set that the message is encoded in, which I think I've set to UTF-7.
When showing you the mail, your system has to interpret the bytes of the content using *that* character set encoding. Anything else is just fundamentally broken. Your system locale has *nothing* to do with it.
If your local system is running EBCDIC that doesn't *matter*.
Now, the character set encoding of the kernel source and documentation text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the legacy crap. It isn't system locale either, unless your system locale *happens* to be UTF-8.
UTF-8 *happens* to be compatible with ASCII for the limited subset of characters which ASCII contains, sure — just as *many*, but not all, of the legacy 8-bit character sets are also a superset of ASCII's 7 bits.
But if the docs contain *any* characters which aren't ASCII, and you build them with a broken build system which assumes ASCII, you are going to produce wrong output. There is *no* substitute for fixing the *actual* bug which started all this, and ensuring your build system (or whatever) uses the *actual* encoding of the text files it's processing, instead of making stupid and bogus assumptions based on a system default.
You concede keeping U+00a9 © COPYRIGHT SIGN. And that's encoded in UTF- 8 as two bytes 0xC2 0xA9. If some broken build system *assumes* those bytes are ISO8859-15 it'll take them to mean two separate characters
U+00C2 Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX U+00A9 © COPYRIGHT SIGN
Your broken build system that started all this is never going to be *anything* other than broken. You can only paper over the cracks and make it slightly less likely that people will notice in the common case, perhaps? That's all you do by *reducing* the use of non-ASCII, unless you're going to drag us all the way back to the 1980s and strictly limit us to pure ASCII, using the equivalent of trigraphs for *anything* outside the 0-127 character ranges.
And even if you did that, systems which use EBCDIC as their local encoding would *still* be broken, if they have the same bug you started from. Because EBCDIC isn't compatible with ASCII *even* for the first 7 bits.
we discovered that some converted docs ended with BOM characters. Those specific changes were introduced by some of my convert patches, probably converted via pandoc.
So, I went ahead in order to check what other possible weird things were introduced by the conversion, where several scripts and tools were used on files that had already a different markup.
I actually checked the current UTF-8 issues, and asked people at linux-doc to comment what of those are valid usecases, and what should be replaced by plain ASCII.
No, these aren't "UTF-8 issues". Those are *conversion* issues, and would still be there if the output of the conversion had been UTF-7, UCS-16, etc. Or *even* if the output of the conversion had been trigraph-like stuff like '--' for emdash. It's *nothing* to do with the encoding that we happen to be using.
Fixing the conversion issues makes a lot of sense. Try to do it without making *any* mention of UTF-8 at all.
In summary, based on the discussions we have so far, I suspect that there's not much to be discussed for the above cases.
So, I'll post a v3 of this series, changing only:
- U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
Ack, as long as those make *no* mention of UTF-8. Except perhaps to note that BOM is redundant because UTF-8 doesn't have a byteorder.
Now, this specific patch series address also this extra case:
curly commas:
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
IMO, those should be replaced by ASCII commas: ' and ".
The rationale is simple:
- most were introduced during the conversion from Docbook, markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means the same thing;
- Sphinx already use "fancy" commas at the output.
I guess I will put this on a separate series, as this is not a bug fix, but just a cleanup from the conversion work.
I'll re-post those cleanups on a separate series, for patch per patch review.
Makes sense.
The left/right quotation marks exists to make human-readable text much easier to read, but the key point here is that they are redundant because the tooling already emits them in the *output* so they don't need to be in the source, yes?
As long as the tooling gets it *right* and uses them where it should, that seems sane enough.
However, it *does* break 'grep', because if I cut/paste a snippet from the documentation and try to grep for it, it'll no longer match.
Consistency is good, but perhaps we should actually be consistent the other way round and always use the left/right versions in the source *instead* of relying on the tooling, to make searches work better? You claimed to care about that, right?
The remaining cases are future work, outside the scope of this v2:
Hyphen/Dashes and ellipsis
- U+2212 ('−'): MINUS SIGN - U+00ad (''): SOFT HYPHEN - U+2010 ('‐'): HYPHEN Those three are used on places where a normal ASCII hyphen/minus should be used instead. There are even a couple of C files which use them instead of '-' on comments. IMO are fixes/cleanups from conversions and bad cut-and-paste.
That seems to make sense.
- U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2026 ('…'): HORIZONTAL ELLIPSIS Those are auto-replaced by Sphinx from "--", "---" and "...", respectively. I guess those are a matter of personal preference about weather using ASCII or UTF-8. My personal preference (and Ted seems to have a similar opinion) is to let Sphinx do the conversion. For those, I intend to post a separate series, to be reviewed patch per patch, as this is really a matter of personal taste. Hardly we'll reach a consensus here.
Again using the trigraph-like '--' and '...' instead of just using the plain text '—' and '…' breaks searching, because what's in the output doesn't match the input. Again consistency is good, but perhaps we should standardise on just putting these in their plain text form instead of the trigraphs?
math symbols:
- U+00d7 ('×'): MULTIPLICATION SIGN This one is used mostly do describe video resolutions, but this is on a smaller changeset than the ones that use "x" letter.
I think standardising on × for video resolutions in documentation would make it look better and be easier to read.
- U+2217 ('∗'): ASTERISK OPERATOR This is used only here: Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. Probably added by some conversion tool. IMO, this one should also be replaced by an ASCII asterisk.
I guess I'll post a patch for the ASTERISK OPERATOR.
That makes sense.
Em Sat, 15 May 2021 10:24:28 +0100 David Woodhouse dwmw2@infradead.org escreveu:
On Sat, 2021-05-15 at 10:22 +0200, Mauro Carvalho Chehab wrote:
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread.
Please provide the bug number for this; I'd like to track it.
Just opened a BZ and added you as c/c.
Thanks.
Let's take one step back, in order to return to the intents of this UTF-8, as the discussions here are not centered into the patches, but instead, on what to do and why.
This discussion started originally at linux-doc ML.
While discussing about an issue when machine's locale was not set to UTF-8 on a build VM,
Stop. Stop *right* there before you go any further.
The machine's locale should have *nothing* to do with anything.
When you view this email, it comes with a Content-Type: header which explicitly tells you the character set that the message is encoded in, which I think I've set to UTF-7.
When showing you the mail, your system has to interpret the bytes of the content using *that* character set encoding. Anything else is just fundamentally broken. Your system locale has *nothing* to do with it.
If your local system is running EBCDIC that doesn't *matter*.
Now, the character set encoding of the kernel source and documentation text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the legacy crap. It isn't system locale either, unless your system locale *happens* to be UTF-8.
UTF-8 *happens* to be compatible with ASCII for the limited subset of characters which ASCII contains, sure — just as *many*, but not all, of the legacy 8-bit character sets are also a superset of ASCII's 7 bits.
But if the docs contain *any* characters which aren't ASCII, and you build them with a broken build system which assumes ASCII, you are going to produce wrong output. There is *no* substitute for fixing the *actual* bug which started all this, and ensuring your build system (or whatever) uses the *actual* encoding of the text files it's processing, instead of making stupid and bogus assumptions based on a system default.
You concede keeping U+00a9 © COPYRIGHT SIGN. And that's encoded in UTF- 8 as two bytes 0xC2 0xA9. If some broken build system *assumes* those bytes are ISO8859-15 it'll take them to mean two separate characters
U+00C2 Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX U+00A9 © COPYRIGHT SIGN
Your broken build system that started all this is never going to be *anything* other than broken. You can only paper over the cracks and make it slightly less likely that people will notice in the common case, perhaps? That's all you do by *reducing* the use of non-ASCII, unless you're going to drag us all the way back to the 1980s and strictly limit us to pure ASCII, using the equivalent of trigraphs for *anything* outside the 0-127 character ranges.
And even if you did that, systems which use EBCDIC as their local encoding would *still* be broken, if they have the same bug you started from. Because EBCDIC isn't compatible with ASCII *even* for the first 7 bits.
Now, you're making a lot of wrong assumptions here ;-)
1. I didn't report the bug. Another person reported it at linux-doc; 2. I fully agree with you that the building system should work fine whatever locate the machine has; 3. Sphinx supported charset for the REST input and its output is UTF-8.
Despite of that, it seems that there are some issues at the building tool set, at least under certain circunstances. One of the hypothesis that it was mentioned there is that the Sphinx logger crashes when it tries to print an UTF-8 message when the machine's locale is not UTF-8.
That's said, I tried forcing a non-UTF-8 on some tests I did to try to reproduce, but the build went fine.
So, I was not able to reproduce the issue.
This series doesn't address the issue. It is just a side effect of the discussions, where, while trying to understand the bug, we noticed several UTF-8 characters introduced during the conversion that were't the original author's intent.
So, with regards to the original but report, if I find a way to reproduce it and to address it, I'll post a separate series.
If you want to discuss this issue further, let's not discuss here, but instead, at the linux-doc thread:
https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/
we discovered that some converted docs ended with BOM characters. Those specific changes were introduced by some of my convert patches, probably converted via pandoc.
So, I went ahead in order to check what other possible weird things were introduced by the conversion, where several scripts and tools were used on files that had already a different markup.
I actually checked the current UTF-8 issues, and asked people at linux-doc to comment what of those are valid usecases, and what should be replaced by plain ASCII.
No, these aren't "UTF-8 issues". Those are *conversion* issues, and would still be there if the output of the conversion had been UTF-7, UCS-16, etc. Or *even* if the output of the conversion had been trigraph-like stuff like '--' for emdash. It's *nothing* to do with the encoding that we happen to be using.
Yes. That's what I said.
Fixing the conversion issues makes a lot of sense. Try to do it without making *any* mention of UTF-8 at all.
In summary, based on the discussions we have so far, I suspect that there's not much to be discussed for the above cases.
So, I'll post a v3 of this series, changing only:
- U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
Ack, as long as those make *no* mention of UTF-8. Except perhaps to note that BOM is redundant because UTF-8 doesn't have a byteorder.
I need to tell what UTF-8 codes are replaced, as otherwise the patch wouldn't make much sense to reviewers, as both U+00a0 and whitespaces are displayed the same way, and BOM is invisible.
Now, this specific patch series address also this extra case:
curly commas:
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
IMO, those should be replaced by ASCII commas: ' and ".
The rationale is simple:
- most were introduced during the conversion from Docbook, markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means the same thing;
- Sphinx already use "fancy" commas at the output.
I guess I will put this on a separate series, as this is not a bug fix, but just a cleanup from the conversion work.
I'll re-post those cleanups on a separate series, for patch per patch review.
Makes sense.
The left/right quotation marks exists to make human-readable text much easier to read, but the key point here is that they are redundant because the tooling already emits them in the *output* so they don't need to be in the source, yes?
Yes.
As long as the tooling gets it *right* and uses them where it should, that seems sane enough.
However, it *does* break 'grep', because if I cut/paste a snippet from the documentation and try to grep for it, it'll no longer match.
Consistency is good, but perhaps we should actually be consistent the other way round and always use the left/right versions in the source *instead* of relying on the tooling, to make searches work better? You claimed to care about that, right?
That's indeed a good point. It would be interesting to have more opinions with that matter.
There are a couple of things to consider:
1. It is (usually) trivial to discover what document produced a certain page at the documentation.
For instance, if you want to know where the text under this file came from, or to grep a text from it:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
You can click at the "View page source" button at the first line. It will show the .rst file used to produce it:
https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.tx...
2. If all you want is to search for a text inside the docs, you can click at the "Search docs" box, which is part of the Read the Docs theme.
3. Kernel has several extensions for Sphinx, in order to make life easier for Kernel developers:
Documentation/sphinx/automarkup.py Documentation/sphinx/cdomain.py Documentation/sphinx/kernel_abi.py Documentation/sphinx/kernel_feat.py Documentation/sphinx/kernel_include.py Documentation/sphinx/kerneldoc.py Documentation/sphinx/kernellog.py Documentation/sphinx/kfigure.py Documentation/sphinx/load_config.py Documentation/sphinx/maintainers_include.py Documentation/sphinx/rstFlatTable.py
Those (in particular automarkup and kerneldoc) will also dynamically change things during ReST conversion, which may cause grep to not work.
5. some PDF tools like evince will match curly commas if you type an ASCII comma on their search boxes.
6. Some developers prefer to only deal with the files inside the Kernel tree. Those are very unlikely to do grep with curly aspas.
My opinion on that matter is that we should make life easier for developers to grep on text files, as the ones using the web interface are already served by the search box in html format or by tools like evince.
So, my vote here is to keep aspas as plain ASCII.
The remaining cases are future work, outside the scope of this v2:
Hyphen/Dashes and ellipsis
- U+2212 ('−'): MINUS SIGN - U+00ad (''): SOFT HYPHEN - U+2010 ('‐'): HYPHEN Those three are used on places where a normal ASCII hyphen/minus should be used instead. There are even a couple of C files which use them instead of '-' on comments. IMO are fixes/cleanups from conversions and bad cut-and-paste.
That seems to make sense.
- U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2026 ('…'): HORIZONTAL ELLIPSIS Those are auto-replaced by Sphinx from "--", "---" and "...", respectively. I guess those are a matter of personal preference about weather using ASCII or UTF-8. My personal preference (and Ted seems to have a similar opinion) is to let Sphinx do the conversion. For those, I intend to post a separate series, to be reviewed patch per patch, as this is really a matter of personal taste. Hardly we'll reach a consensus here.
Again using the trigraph-like '--' and '...' instead of just using the plain text '—' and '…' breaks searching, because what's in the output doesn't match the input. Again consistency is good, but perhaps we should standardise on just putting these in their plain text form instead of the trigraphs?
Good point.
While I don't have any strong preferences here, there's something that annoys me with regards to EM/EN DASH:
With the monospaced fonts I'm using here - both at my e-mailer and on my terminals, both EM and EN DASH are displayed look *exactly* the same.
math symbols:
- U+00d7 ('×'): MULTIPLICATION SIGN This one is used mostly do describe video resolutions, but this is on a smaller changeset than the ones that use "x" letter.
I think standardising on × for video resolutions in documentation would make it look better and be easier to read.
- U+2217 ('∗'): ASTERISK OPERATOR This is used only here: Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. Probably added by some conversion tool. IMO, this one should also be replaced by an ASCII asterisk.
I guess I'll post a patch for the ASTERISK OPERATOR.
That makes sense.
Thanks, Mauro
On Sat, 2021-05-15 at 13:23 +0200, Mauro Carvalho Chehab wrote:
Em Sat, 15 May 2021 10:24:28 +0100 David Woodhouse dwmw2@infradead.org escreveu:
Let's take one step back, in order to return to the intents of this UTF-8, as the discussions here are not centered into the patches, but instead, on what to do and why.
This discussion started originally at linux-doc ML.
While discussing about an issue when machine's locale was not set to UTF-8 on a build VM,
Stop. Stop *right* there before you go any further.
The machine's locale should have *nothing* to do with anything.
Now, you're making a lot of wrong assumptions here ;-)
- I didn't report the bug. Another person reported it at linux-doc;
- I fully agree with you that the building system should work fine whatever locate the machine has;
- Sphinx supported charset for the REST input and its output is UTF-8.
OK, fine. So that's an unrelated issue really, and just happened to be what historically triggered the discussion. Let's set it aside.
I actually checked the current UTF-8 issues …
No, these aren't "UTF-8 issues". Those are *conversion* issues, and … *nothing* to do with the encoding that we happen to be using.
Yes. That's what I said.
Er… I'm fairly sure you *did* call them "UTF-8 issues". Whatever.
Fixing the conversion issues makes a lot of sense. Try to do it without making *any* mention of UTF-8 at all.
In summary, based on the discussions we have so far, I suspect that there's not much to be discussed for the above cases.
So, I'll post a v3 of this series, changing only:
- U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
Ack, as long as those make *no* mention of UTF-8. Except perhaps to note that BOM is redundant because UTF-8 doesn't have a byteorder.
I need to tell what UTF-8 codes are replaced, as otherwise the patch wouldn't make much sense to reviewers, as both U+00a0 and whitespaces are displayed the same way, and BOM is invisible.
No. Again, this is *nothing* to do with UTF-8. The encoding we choose to map between byte in the file and characters is *utterly* irrelevant here. If we were using UTF-7, UTF-16, or even (in the case of non- breaking space) one of the legacy 8-bit charsets that includes it like ISO8859-1, the issue would be precisely the same.
It's about the *character* U+00A0 NO-BREAK SPACE; nothing to do with UTF-8 at all. Don't mention UTF-8. It's *irrelevant* and just shows that you can't actually bothered to stop and do any critical thinking about the matter at all.
As I said, the only time that it makes sense to mention UTF-8 in this context is when talking about *why* the BOM is not needed. And even then, you could say "because we *aren't* using an encoding where endianness matters, such as UTF-16", instead of actually mentioning UTF-8. Try it ☺
Now, this specific patch series address also this extra case:
curly commas:
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
IMO, those should be replaced by ASCII commas: ' and ".
The rationale is simple:
- most were introduced during the conversion from Docbook, markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means the same thing;
- Sphinx already use "fancy" commas at the output.
I guess I will put this on a separate series, as this is not a bug fix, but just a cleanup from the conversion work.
I'll re-post those cleanups on a separate series, for patch per patch review.
Makes sense.
The left/right quotation marks exists to make human-readable text much easier to read, but the key point here is that they are redundant because the tooling already emits them in the *output* so they don't need to be in the source, yes?
Yes.
As long as the tooling gets it *right* and uses them where it should, that seems sane enough.
However, it *does* break 'grep', because if I cut/paste a snippet from the documentation and try to grep for it, it'll no longer match.
Consistency is good, but perhaps we should actually be consistent the other way round and always use the left/right versions in the source *instead* of relying on the tooling, to make searches work better? You claimed to care about that, right?
That's indeed a good point. It would be interesting to have more opinions with that matter.
There are a couple of things to consider:
It is (usually) trivial to discover what document produced a certain page at the documentation.
For instance, if you want to know where the text under this file came from, or to grep a text from it:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
You can click at the "View page source" button at the first line. It will show the .rst file used to produce it:
https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.tx...
If all you want is to search for a text inside the docs, you can click at the "Search docs" box, which is part of the Read the Docs theme.
Kernel has several extensions for Sphinx, in order to make life easier for Kernel developers:
Documentation/sphinx/automarkup.py Documentation/sphinx/cdomain.py Documentation/sphinx/kernel_abi.py Documentation/sphinx/kernel_feat.py Documentation/sphinx/kernel_include.py Documentation/sphinx/kerneldoc.py Documentation/sphinx/kernellog.py Documentation/sphinx/kfigure.py Documentation/sphinx/load_config.py Documentation/sphinx/maintainers_include.py Documentation/sphinx/rstFlatTable.py
Those (in particular automarkup and kerneldoc) will also dynamically change things during ReST conversion, which may cause grep to not work.
some PDF tools like evince will match curly commas if you type an ASCII comma on their search boxes.
Some developers prefer to only deal with the files inside the Kernel tree. Those are very unlikely to do grep with curly aspas.
My opinion on that matter is that we should make life easier for developers to grep on text files, as the ones using the web interface are already served by the search box in html format or by tools like evince.
So, my vote here is to keep aspas as plain ASCII.
OK, but all your reasoning is about the *character* used, not the encoding. So try to do it without mentioning ASCII, and especially without mentioning UTF-8.
Your point is that the *character* is the one easily reachable on standard keyboard layouts, and the one which people are most likely to enter manually. It has *nothing* to do with charset encodings, so don't conflate is with talking about charset encodings.
The remaining cases are future work, outside the scope of this v2:
Hyphen/Dashes and ellipsis
- U+2212 ('−'): MINUS SIGN - U+00ad (''): SOFT HYPHEN - U+2010 ('‐'): HYPHEN Those three are used on places where a normal ASCII hyphen/minus should be used instead. There are even a couple of C files which use them instead of '-' on comments. IMO are fixes/cleanups from conversions and bad cut-and-paste.
That seems to make sense.
- U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2026 ('…'): HORIZONTAL ELLIPSIS Those are auto-replaced by Sphinx from "--", "---" and "...", respectively. I guess those are a matter of personal preference about weather using ASCII or UTF-8. My personal preference (and Ted seems to have a similar opinion) is to let Sphinx do the conversion. For those, I intend to post a separate series, to be reviewed patch per patch, as this is really a matter of personal taste. Hardly we'll reach a consensus here.
Again using the trigraph-like '--' and '...' instead of just using the plain text '—' and '…' breaks searching, because what's in the output doesn't match the input. Again consistency is good, but perhaps we should standardise on just putting these in their plain text form instead of the trigraphs?
Good point.
While I don't have any strong preferences here, there's something that annoys me with regards to EM/EN DASH:
With the monospaced fonts I'm using here - both at my e-mailer and on my terminals, both EM and EN DASH are displayed look *exactly* the same.
Interesting. They definitely show differently in my terminal, and in the monospaced font in email.
participants (4)
-
David Woodhouse
-
Edward Cree
-
Mauro Carvalho Chehab
-
Theodore Ts'o