The newline Guide to Bash Scripting - Part 2

slaider13 · 08.Июль.2021 19:45:30

Text#

The most useful abstraction ever devised for computers is arguably human–readable text. At the lowest level, computers manipulate electrons. Above that level are bits, then numbers, and then text. This chapter will look at how text is structured and some useful details around the mapping from numbers to text.

Newlines#

In one of the great tragedies of computing history, the three main operating system classes chose three different ways to indicate the separation between one line of text and the next one. Two of these are still with us:

Linux uses a line feed, Unicode code point U+000A, as the line terminator, meaning that each line including the last one in a file has a trailing line feed character.
Windows uses a carriage return, Unicode code point U+000D followed by a line feed as a two–character line separator, meaning that the last line in a file by convention does not have a trailing carriage return + line feed. This means that a Windows file which ends in a carriage return and line feed has an empty last line.
The third style, a carriage return terminator, was used by older Apple operating systems .

Synonyms for “line feed” include:

The abbreviation LF

The symbol for line feed

“Newline”, abbreviated as NL

The symbol for newline

ASCII/UTF–8 hexadecimal byte value 0A

“End of line”, abbreviated as EOL

Carriage return is a valid character within a Bash script, but maybe not in the way you might think. Try creating a simple script like this, containing Windows newlines:

$ printf '%s\r\n' '#!/usr/bin/env bash' '(( "$?" == 0 ))' > test.bash
$ chmod u+x test.bash
$ ./test.bash
/usr/bin/env: ‘bash\r’: No such file or directory

That carriage return character is now part of the shebang line! In general Bash will treat carriage return like any other character in your script, so this can wreak various kinds of havoc depending on how the script is run and whether every line has a Windows newline. Say for example you bypass the shebang line:

$ bash test.bash
test.bash: line 2: syntax error in conditional expression
'est.bash: line 2: syntax error near `))
'est.bash: line 2: `(( "$?" == 0 ))

By now you might be thinking “‘est.bash? What happened to the file name?” In the previous error message the carriage return was printed as \r , a character which returns the carriage (the position where the next character will be printed at) to the origin, the left margin. Unfortunately Bash does not escape the code in this error message, so it actually prints “test.bash: line 2: syntax error near `]]”, then returns the carriage to the left margin, and then prints the “’” character which is part of the error message, overwriting the initial “t” character in the terminal output.

A similar warning goes for processing files originating on a Windows system, where the results might still contain carriage returns. If the input has been shuffled in any way, such as reorganizing CSV columns, you might find that the carriage return isn’t even at the end of the line.

All this is to say that Windows newlines in Bash scripts can cause inscrutable errors . Fortunately there is a tool which solves most such issues. To convert from Windows to Unix newlines use dos2unix FILE… . Conversely, unix2dos FILE… converts from Unix to Windows newlines, adding the .

Remember how Linux has line terminators and Windows has line separators? This means a “normal” file on Linux should have a trailing newline and on Windows it should not have a trailing newline. But unix2dos and dos2unix do not add and remove the newline at the end of the file. This can cause problems with certain tools, most notably read . After running dos2unix you can check the end of a file by converting the last character of the file to a human readable representation. Let’s try with a simple comma–separated value file:

$ printf '%s,%s\r\n%s,%s' 'Key' 'Value' 'pi' '3.14' > example.csv
$ dos2unix example.csv
dos2unix: converting file example.csv to Unix format...
$ tail --bytes=1 example.csv | od --address-radix=n --format=c --width=1
   4

In the last line, tail --bytes=1 prints just the last (“tail end”) byte of example.csv before passing it to od , which we’ve encountered before.

The last character in the file is “4”, not “\n”, as we would expect for processing with Linux tools. To fix this simply add a single newline to the end of the file using `echo >> FILE:

$ echo >> example.csv
$ tail --bytes=1 example.csv | od --address-radix=n --format=c --width=1
  \n

If instead you want to unconditionally make sure a file ends with a line feed, just run sed --in-place '$a\' FILE… .

The sed script $a\ can be read as “on the last line ( $ ) append ( a\ ) nothing (the empty string after a\ ). sed implicitly adds a newline to every line it processes, so this ends up adding a newline. It can basically be considered as a no–op which happens to have a useful side effect.

Encoding#

One of the first data types you learn to work with as a programmer is strings. But this seemingly simple data type has very complex depths, so complex that it’s still changing, and various programming languages still handle strings in very different ways. In Bash, the contents of a string variable is “simply” stored as a sequence of bytes (values 0 through 255) with a NUL byte at the end. For scripting purposes the NUL byte is not part of the variable value, and this terminator means that if you try to store arbitrary binary data in a variable the value will be cut off at the first occurrence of a NUL byte:

$ value=$'foo\0bar'
$ echo "$value"
foo
$ echo "${#value}"
3

As you can see, the NUL byte at the end is not considered part of the string.

That takes care of Bash variables: series of bytes with no special meaning, internally terminated by a NUL byte. To get to what humans would consider a string you have to add an encoding: a mapping from byte values to code units, and in the case of multi–byte encodings another mapping from code units to code points (often called “characters” although this is a heavily overloaded word). Let’s first check which encoding the current shell is using:

$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

locale prints the settings rather than the variable assignments. So if you want to get the current collation setting in a script you should inspect the output of locale rather than the value of $LC_COLLATE (“collate” is synonymous with “sort” and “order”). Even if $LC_COLLATE is set it may be overridden by $LANG or $LC_ALL .

The values except for LANGUAGE are formatted as language[_territory][.codeset][@modifier] (documented in man 3 setlocale ). We’re only interested in the LC_CTYPE (locale character type) “codeset” part, “UTF-8”, which tells the shell how to interpret byte sequences as code points. Let’s see what it does:

$ currency='€'
$ echo "${#currency}"
1
$ printf '%s' "$currency" | wc --bytes
3
$ printf '%s' "$currency" | xxd -groupsize 1
00000000: e2 82 ac

So in Unicode “€” is one code point (the Bash variable length in the first command), and when encoded as UTF–8 it takes up three bytes whose hexadecimal representation is 0xe282ac. And, crucially, those bytes need to be treated as UTF–8 to get back the original code point!

You can also enter Unicode code points by their hexadecimal representations. For example, the Euro sign “€” is U+20AC. This can be represented in Bash as either a literal string € or as (lowercase “u” can be used for code points up to FFFF), meaning that `[[ '€' == ` . Uppercase “U” works with the entire Unicode range, so U+1F600, the grinning face emoji, is .