Merge pull request #8 from Bihlerben/FixModeLoss
[pathie.git] / README.md
1 PATHIE.
2 =======
3
4 This is the Pathie project. It aims to provide a C++ library that covers
5 all needs of pathname manipulation and filename fiddling, without
6 having to worry about the underlying platform. That is, it is a glue
7 library that allows you to create platform-independent filename
8 handling code with special regard to Unicode path names.
9
10 Supported systems
11 -----------------
12
13 Currently supported platforms are Linux and Windows, the latter via
14 MSYS2 GCC. Any other compiler or system might or might not work. Mac
15 OS should work as well, but I cannot test this due to lack of a Mac. I
16 gladly accept contributions for any system or compiler.
17
18 Pathie's source code itself is written conforming to C++98. On UNIX
19 systems, it assumes the system supports POSIX.1-2001. On Windows
20 systems, the minimum supported Windows version is Windows Vista.
21
22 Installation
23 ------------
24
25 See INSTALL.md.
26
27 The library
28 -----------
29
30 The entire world is using UTF-8 as the primary Unicode encoding. The
31 entire world? No, a little company from Redmond resists the temptation
32 and instead uses UTF-16LE, causing cross-platform handling of Unicode
33 paths to be a nightmare.
34
35 One of the main problems the author ran into was compiler-dependant
36 code that was not marked as such. Many sites on the Internet claim
37 Unicode path handling on Windows is easy, but in fact, it only is if
38 you define “development for Windows” as “development with MSVC”,
39 Microsoft’s proprietary C/C++ compiler, which provides nonstandard
40 interfaces to allow for handling UTF-16LE filenames. The Pathie
41 library has been developed with a focus on MinGW and crosscompilation
42 from Linux to Windows and thus does not suffer from this problem.
43
44 The Pathie library has been developed to release the programmer from
45 the burden of handling the different encodings in use for filenames,
46 and does so by focusing its API on UTF-8 regardless of the platform in
47 use. Thus, if you use UTF-8 as your preferred encoding inside your
48 program (take a look at the [UTF8 Everywhere
49 website](http://www.utf8everywhere.org) for reasons why you should do
50 that), Pathie will be of the most use for you, since it transparently
51 converts whatever filesystem encoding is encountered to UTF-8 in its
52 public interface. Likewise, any pathname you pass to the library is
53 assumed to be UTF-8 and is transcoded transparently to the filesystem
54 encoding before invoking the respective OS' filesystem access
55 methods. Of course, explicit conversion functions are also provided,
56 in case you do need a string in the native encoding or need to
57 construct a path from a string in the native encoding.
58
59 General Usage
60 -------------
61
62 First thing is to include the main header:
63
64 ~~~~~~~~~~~~~~~~~~{.cpp}
65 #include <pathie/path.hpp>
66 ~~~~~~~~~~~~~~~~~~
67
68 Now consider the simple task to get all children of a directory, which
69 have Unicode filenames. Doing that manually will result in you having
70 to convert between UTF-8 and UTF-16 all the time. With pathie, you can
71 just do this:
72
73 ~~~~~~~~~~~~~~~~~~~{.cpp}
74 std::vector<Pathie::Path> children = your_path.children();
75 ~~~~~~~~~~~~~~~~~~~
76
77 Done. Retrieving the parent directory of your directory is pretty easy:
78
79 ~~~~~~~~~~~~~~~~~~~{.cpp}
80 Pathie::Path yourpath("foo/bar/baz");
81 Pathie::Path parent = yourpath.parent();
82 ~~~~~~~~~~~~~~~~~~~
83
84 But Pathie is much more than just an abstraction of different filepath
85 encodings. It is a utility library for pathname manipulation, i.e. it
86 allows you to do things like finding the parent directory, expanding
87 relative to absolute paths, decomposing a filename into basename,
88 dirname, and extension, and so on. See the documentation of the
89 central Pathie::Path class on what you can do.
90
91 ~~~~~~~~~~~~~~~~~~~~~~{.cpp}
92 // Assume current directory is /tmp
93 Pathie::Path p("foo/bar/../baz");
94 p.expand(); // => /tmp/foo/baz
95 ~~~~~~~~~~~~~~~~~~~~~~
96
97 Or my personal favourite:
98
99 ~~~~~~~~~~~~~~~~~~~{.cpp}
100 Pathie::Path p1("/tmp/foo/bar");
101 Pathie::Path p2("/tmp/bar/foo");
102 Pathie::Path p3 = p1.relative(p2); // => ../../foo/bar
103 ~~~~~~~~~~~~~~~~~~~
104
105 It also provides you with commonly used paths like the user’s
106 configuration directory or the path to the running executable.
107
108 ~~~~~~~~~~~~~~~~~~~~{.cpp}
109 Pathie::Path configdir = Pathie::Path::config_dir();
110 Pathie::Path exepath = Pathie::Path::exe();
111 ~~~~~~~~~~~~~~~~~~~~
112
113 Pathie assumes that all string arguments passed are in UTF-8 and
114 transparently converts to the native filesystem encoding internally.
115
116 Still, if you interface directly with the Windows API or other external
117 libraries, you might want to retrieve the native representation from a
118 Path or construct a Path from the native representation. Pathie
119 doesn’t want to be in your way then. The following example constructs
120 from and converts to the native representation on Windows, which is
121 UTF-16LE:
122
123 ~~~~~~~~~~~~~~~~~~~~{.cpp}
124 // Contruct from native
125 wchar_t* utf16 = Win32ApiCall();
126 Path mypath = Path::from_native(utf16); // also accepts std::wstring
127
128 // Retrieve native (Note C++’ish std::wstring rather than
129 // raw wchar_t* on Windows)
130 std::wstring native_utf16 = mypath.native();
131 ~~~~~~~~~~~~~~~~~~~~
132
133 On UNIX, these methods work with normal strings (std::string instead
134 of std::wstring) in the underlying filesystem encoding. In most cases,
135 that will be UTF-8, but some legacy systems may still use something
136 like ISO-8859-1 in which case that will differ.
137
138 ### Temporary files and directories
139
140 There are two classes `Pathie::Tempdir` and `Pathie::Tempfile` that
141 you can use if you need to work with temporary files or directories,
142 respectively. Constructing instances of these classes creates a
143 temporary entry, which is removed (recursively in case of directories)
144 when the instance is destroyed again. Use TempEntry::path() to get
145 access to the Path instance pointing to the created entry.
146
147 ~~~~~~~~~~~~~~~~~~~~{.cpp}
148 #include <pathie/tempdir.hpp>
149
150 //...
151
152 {
153 srand(time(NULL)); // Needs random number generator
154 Pathie::Tempdir tmpdir("foo"); // Pass a fragment to use as part of filename
155 std::cout << "Temporary dir is: " << tmpdir.path() << std::endl;
156 }
157 // When `tmpdir' is destroyed, the destructor recursively
158 // deletes the directory that was created.
159 ~~~~~~~~~~~~~~~~~~~~
160
161 ### Opening a file with a Unicode path name
162
163 On Windows with GCC, it is [not possible to open a file with Unicode
164 pathname](https://stackoverflow.com/questions/821873) via C++'s usual
165 `std::ifstream` and `std::ofstream` mechanism. There's a nonstandard
166 extension provided by Microsoft's proprietary compiler that does this,
167 but GCC does not have this extension. Consequently, code that is
168 intended to compile on GCC (like Pathie) has to avoid it.
169
170 There *is* however a function in the Win32API that allows to open a
171 file with a Unicode pathname *and* that returns a standard C `FILE*`
172 handle,
173 [_wfopen()](http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx). The
174 method Path::fopen() uses this function on Windows and a regular C
175 `fopen()` on all other platforms, thus allowing you to just deal with
176 your Unicode filename via the regular C I/O interface. If you urgently
177 need C++ I/O streams, read on.
178
179 ### Stream replacements
180
181 Pathie mainly provides you with the means to handle paths, compose,
182 and decompose them. There is an experimental feature however that
183 provides replacements for C++ file streams that work with instances of
184 Pathie::Path instead of strings for opening a file. These replacements
185 are neither elegant nor portable, because they don't nicely honour the
186 template concept the STL is based on by directly subclassing the
187 standard streams in the matter needed most frequently and additionally
188 relying on vendor-specific details. For GCC, an internal (but at least
189 documented) interface is used to exchange the file descriptor inside a
190 stream, and for MSVC, a nonstandard (but documented) constructor is
191 used. Other compilers are not supported by this feature (which most
192 notably affects clang, where I have no idea on the interfaces I need
193 to use for such a trick).
194
195 In one word, these replacements are hacky and I consider them
196 experimental. If that does not strike you as problematic, you can
197 enable this feature by passing `-DPATHIE_BUILD_STREAM_REPLACEMENTS=ON`
198 when invoking `cmake` during the build process.
199
200 In order to use the replacements, include the respective header
201 (either `pathie_ifstream` or `pathie_ofstream`) and use the
202 `Pathie::ifstream` and `Pathie::ofstream` classes just like you would
203 use `std::ifstream` and `std::ofstream`, with the only difference
204 being that you construct them from a Pathie::Path instance instead of
205 a string. See the documentation of Pathie::ofstream for more
206 information.
207
208 ~~~~~~~~~~~~~~~~~{.cpp}
209 #include <pathie/pathie_ofstream>
210
211 // ...
212
213 Pathie::Path p("Bärenstark.txt");
214 Pathie::ofstream file(p);
215 file << "Some content" << std::endl;
216 file.close()
217 ~~~~~~~~~~~~~~~~~
218
219 There's also the inofficial
220 [boost::nowide](http://cppcms.com/files/nowide/html/), which is
221 similar to this feature and maybe more reliable. It has [recently been
222 accepted into
223 boost](https://lists.boost.org/boost-announce/2017/06/0516.php).
224
225 Dependencies and linking
226 ------------------------
227
228 Pathie is standalone, that is, it requires no other libraries except
229 for those provided by your operating system. Note that there’s a
230 caveat with this on Windows, which does provide the `Shlwapi` library
231 by default, but MinGW's GCC does not automatically link it in. Be sure
232 to link to this library explicitely when compiling for MinGW Windows
233 by appending `-lShlwapi` to the end of your linking command line.
234
235 It is recommended to link in pathie as a dynamic library, because
236 there are some problems with it when linked statically on certain
237 operating systems (see _Caveats_ below). If you are sure you aren’t
238 affected by those problems, it is possible to link in pathie
239 statically.
240
241 Caveats
242 -------
243
244 This library assumes that under all UNIX systems out there (I also
245 consider Mac OSX to be a UNIX system) the file system root always is
246 `/` and the directory separator also always is `/`. This structure is
247 mandatory as per POSIX -- in POSIX.1-2008, it’s specified in section
248 10.1. Systems which do neither follow POSIX directory structure, nor
249 are Windows, are unsupported.
250
251 On POSIX-compliant systems other than Mac OS X, the filesystem
252 encoding [generally is
253 unspecified](https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux).
254 Pathnames are merely byte blobs which do not contain NUL bytes, and
255 components are separated by `/`. It’s up to the applications,
256 including utilities like a shell or the ls(1) program, to make
257 something of those byte streams. Therefore, it is perfectly possible
258 that on one system, user A uses ISO-8859-1 filenames and user B uses
259 UTF-8 filenames. Even the same user could use differently encoded
260 filenames. Programs that have to interpret the byte blobs in pathnames
261 on these systems look at the locale environment variables, namely
262 `LANG` and `LC_ALL`, see section 7 of POSIX.1-2008. As a consequence,
263 it may happen you want to create filenames with characters not
264 supported in the user’s pathname encoding. For example, if you want to
265 create a file with a hebrew filename and the user’s pathname encoding
266 is ISO-8859-1, there’s a problem, because ISO-8859-1 has no hebrew
267 characters in it, but in UTF-8, which is the encoding you are advised
268 to use and which is what Pathie’s API expects from you, they are
269 available. There is no sensible solution to this problem that the
270 Pathie library could dictate; the `iconv()` function used by pathie
271 just replaces characters that are unavailable in the target encoding
272 with a system-defined default (probably “?”). Note that on systems
273 which have a Unicode pathname encoding, especially modern Linuxes with
274 UTF-8, such a situation can’t ever arise, because the Unicode
275 encodings (UTF-*) cover all characters you can ever use.
276
277 At least on FreeBSD, calling the POSIX `iconv()` function fails with
278 the cryptic error message “Service unavailable” if a program is linked
279 statically. I’ve reported [a bug on
280 this](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196567). This
281 means that you currently can’t link in pathie statically on FreeBSD
282 and systems which don’t allow statically linked executables to call
283 `iconv()`.
284
285 On Linux systems, it is recommended to set your program’s locale to the
286 environment’s locale before you call any functions the Pathie library
287 provides, because this will allow Pathie to use the correct encoding
288 for filenames. This is relevant where the environment’s encoding is
289 not UTF-8, e.g. with $LANG set to `de_DE.ISO-8859-1`. You can do this
290 as follows (the `""` locale always refers to the locale of the
291 environment):
292
293 ~~~~~~~~~~~~~~~~~~~~~{.cpp}
294 #include <locale>
295 std::locale::global(std::locale(""));
296 ~~~~~~~~~~~~~~~~~~~~~
297
298 This is not required on Windows nor on Mac OS X, because these
299 operating systems always use UTF-16LE (Windows) or UTF-8 (Mac OS X) as
300 the filesystem encoding, regardless of the user's locale. It however
301 does not hurt to call this either, it simply makes no difference for
302 Pathie on these systems. If you urgently need to avoid this call on
303 Linux, you need to compile pathie with the special build option
304 PATHIE_ASSUME_UTF8_ON_UNIX, which will force Pathie to assume that
305 UTF-8 is used as the filesystem encoding under any UNIX-based system.
306
307 Links
308 -----
309
310 * Project page: https://www.guelkerdev.de/projects/pathie/
311 * GitHub mirror: https://github.com/Quintus/pathie-cpp
312 * Issue tracker: https://github.com/Quintus/pathie-cpp/issues
313
314 Contributing
315 ------------
316
317 Feel free to submit any contributions you deem useful. Try to make
318 separate branches for your new features, give a description on what
319 you changed, etc.
320
321 Don’t you duplicate boost::filesystem?
322 -------------------------------------
323
324 Yes and
325 no. [boost::filesystem](http://www.boost.org/doc/libs/1_56_0/libs/filesystem/doc/index.htm)
326 provides many methods pathie provides, but has a major problem with
327 Unicode path handling if you are not willing to do the UTF-8/UTF-16
328 conversion manually. boost::filesystem always uses UTF-8 to store the
329 paths on UNIX, and, which is the problem, always uses UTF-16LE to
330 store the paths on a Windows system. There is no way to override
331 this, although there is a [hidden documentation
332 page](http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html)
333 that claims to solve the problem. I have wasted a great amount of time
334 to persuade boost::filesystem to automatically convert all
335 `std::string` input it receives into UTF-16LE, but failed to
336 succeed. Each time I wanted to create a file with a Unicode filename,
337 the test failed on Windows by producing garbage filenames. Finally I
338 found out that the neat trick shown in the documentation above indeed
339 does work -- but only if you use the Microsoft Visual C++ compiler
340 (MSVC) to compile your code. I don’t, I generally use g++ via the
341 [MinGW](http://www.mingw.org) toolchain. boost::filesystem fails with
342 g++ via MinGW with regard to Unicode filenames on Windows as of this
343 writing (September 2014).
344
345 Apart from that, pathie provides some additional methods, especially
346 with regard to finding out where the user’s paths are. It is modelled
347 after Ruby’s popular
348 [Pathname](http://ruby-doc.org/stdlib-2.1.2/libdoc/pathname/rdoc/Pathname.html#method-i-rmtree)
349 class, but it doesn’t entirely duplicate its interface (which wouldn’t
350 be idiomatic C++).
351
352 Also, pathie is a small library. Adding it to your project shouldn’t
353 hurt too much, while boost::filesystem is quite a large dependency.
354
355 License
356 -------
357
358 Pathie is BSD-licensed; see the file “LICENSE” for the exact license
359 conditions.