Discussion:
inode numbers on ZFS
Joerg Schilling
2005-10-11 22:54:34 UTC
Permalink
ZFS is a 128 bit filesystem, isn't it?

So I hope it uses 128 bit inode numbers too.....
but it should at least use 64 bit for inode numbers.

Now what happens to a 32 bit application that calls stat(2)
on a file that uses an inode number that is outside the
32 bit scope. Whill this cause stat(2) to return a EOVERFLOW
condition in this case when stat(2) is called from a 32 bit
application?

Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Casper.Dik-UdXhSnd/
2005-10-12 07:30:22 UTC
Permalink
Post by Joerg Schilling
ZFS is a 128 bit filesystem, isn't it?
So I hope it uses 128 bit inode numbers too.....
but it should at least use 64 bit for inode numbers.
Now what happens to a 32 bit application that calls stat(2)
on a file that uses an inode number that is outside the
32 bit scope. Whill this cause stat(2) to return a EOVERFLOW
condition in this case when stat(2) is called from a 32 bit
application?
Depends on whether it's large file aware or not, I'd say.

(the ino field in stat64 is 64 bits)

Casper
Joerg Schilling
2005-10-12 07:38:33 UTC
Permalink
Post by Casper.Dik-UdXhSnd/
Post by Joerg Schilling
Now what happens to a 32 bit application that calls stat(2)
on a file that uses an inode number that is outside the
32 bit scope. Whill this cause stat(2) to return a EOVERFLOW
condition in this case when stat(2) is called from a 32 bit
application?
Depends on whether it's large file aware or not, I'd say.
(the ino field in stat64 is 64 bits)
Thank you, I was confused by the #ifdef KERNEL in stat.h and
was checking the kernel part.....

Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Daniel Rock
2005-10-12 16:35:56 UTC
Permalink
Post by Casper.Dik-UdXhSnd/
Post by Joerg Schilling
ZFS is a 128 bit filesystem, isn't it?
Depends on whether it's large file aware or not, I'd say.
(the ino field in stat64 is 64 bits)
So to fully utilize a ZFS file system the average file size has to be 16 EB?
People are already moaning today that on MT-UFS the average file size has to
be 1 MB...

I hope it is just an interface limitation and that ZFS's internals don't
impose such a limit.


Daniel
Joerg Schilling
2005-10-12 16:42:54 UTC
Permalink
Post by Daniel Rock
Post by Casper.Dik-UdXhSnd/
Post by Joerg Schilling
ZFS is a 128 bit filesystem, isn't it?
Depends on whether it's large file aware or not, I'd say.
(the ino field in stat64 is 64 bits)
So to fully utilize a ZFS file system the average file size has to be 16 EB?
People are already moaning today that on MT-UFS the average file size has to
be 1 MB...
I hope it is just an interface limitation and that ZFS's internals don't
impose such a limit.
The max storage size (in case you cannot use a parallel universe to
hold the storage) is aprox. 90 bits.

With these constraints, and if the inodes on the on disk version of ZFS would
really be only 64 bits, the average file size would be 65 MB.

Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Sarah Jelinek
2005-10-12 16:41:37 UTC
Permalink
Daniel,

To clear up a misconception about MTB UFS. The maximum density of inodes
that can be in a MTB UFS filesystem is 1 inode per megabyte of space.
This does not mean that a megabyte of space is used for every file. It
simply means you cannot have more than a million or so files per
terabyte of storage.

The reason for this is simple, it could take days or weeks to fsck the
filesystem.

sarah
****
Post by Daniel Rock
Post by Casper.Dik-UdXhSnd/
Post by Joerg Schilling
ZFS is a 128 bit filesystem, isn't it?
Depends on whether it's large file aware or not, I'd say.
(the ino field in stat64 is 64 bits)
So to fully utilize a ZFS file system the average file size has to be
16 EB? People are already moaning today that on MT-UFS the average
file size has to be 1 MB...
I hope it is just an interface limitation and that ZFS's internals
don't impose such a limit.
Daniel
_______________________________________________
opensolaris-discuss mailing list
Daniel Rock
2005-10-12 19:01:07 UTC
Permalink
Post by Sarah Jelinek
Daniel,
To clear up a misconception about MTB UFS. The maximum density of inodes
that can be in a MTB UFS filesystem is 1 inode per megabyte of space.
This does not mean that a megabyte of space is used for every file. It
simply means you cannot have more than a million or so files per
terabyte of storage.
So to fill up such a file system to 100 percent your average file size has to
be 1MB - that's exactly what I said. I didn't say the block size was raised to
1MB.

Since also directories allocate inodes and are rarely filled with
ten-thousands of files, the average file size on MT-UFS has to be even larger
than 1 MB to fill the whole FS.

Luckily the limitation is only enforced in newfs/mkfs. You just have to redefine
#define MTB_NBPI (MB)
in usr/src/cmd/fs.d/ufs/mkfs/mkfs.c, recompile, (re-)newfs and you're done.
Post by Sarah Jelinek
The reason for this is simple, it could take days or weeks to fsck the
filesystem.
I think this limit is way too high for "general purpose" use:

Filesystem size used avail capacity Mounted on
/dev/md/dsk/d200 309G 288G 17G 95% /export
Filesystem iused ifree %iused Mounted on
/dev/md/dsk/d200 1535475 1037325 60% /export


BTW a full fsck of above filesystem ran in ~33 Minutes and memory usage of
fsck raised to 46 MB. Most of the time it spent in Phase 1 (Check Blocks and
Sizes) - see below. So it would have been a wiser decision to raise the
block/fragment limit to something larger than 8 kByte.

Phase 1 - Check Blocks and Sizes ~24 Minutes
Phase 2 - Check Pathnames ~9 Minutes
Phase 3 - Check Connectivity <1 Second
Phase 4 - Check Reference Counts <1 Second
Phase 5 - Check Cyl groups <1 Minute
1535473 files, 151212388 used, 10602271 free (223755 frags, 2594629 blocks,
0.1% fragmentation)
Total: 33:12.95


Daniel
Peter Tribble
2005-10-12 20:24:17 UTC
Permalink
Post by Sarah Jelinek
Daniel,
To clear up a misconception about MTB UFS. The maximum density of inodes
that can be in a MTB UFS filesystem is 1 inode per megabyte of space.
This does not mean that a megabyte of space is used for every file. It
simply means you cannot have more than a million or so files per
terabyte of storage.
Which is nowhere near adequate in all cases:

Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c2t0d0s2 703246224 626354966 69858796 90% /export/data
Filesystem iused ifree %iused Mounted on
/dev/dsk/c2t0d0s2 16688199 67187641 20% /export/data

Note that this filesystem already has more files than a maximum size
Multiterabyte
UFS (16TB->16 million inodes).

My understanding is that the argument is that fsck could take
indefinitely
long. I know that fsck on this takes about 3 hours, although you would
have to
ask what goes wrong to need an fsck.
--
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Eric Schrock
2005-10-12 16:58:24 UTC
Permalink
ZFS inode numbers are 64 bits. The current implemenation restricts this
to a 48-bit usable range, but this is not an architectural restriction.
Future enhancements plan to extend this to the full 64 bits.

32-bit apps that attempt to stat() a file whose inode number is greater
than 32 bits will return EOVERFLOW. 64-bit apps and largefile aware
apps will have no problems.

The ZFS object allocation scheme always tries to allocate the lowest
object number first, so you will never have files with greater than
32-bit inode numbers until you have 2^32 files on the system[1].

There is little expectation that anyone will be able to fill a ZFS
filesystem, ever[2]. There is reasonable expectation, however, that in
the next 10-20 years we will pass the 64-bit limit for some use cases.

Hope that helps.

- Eric

[1] The actual algorithm allows for some "fuzz factor", so this could
theoretically occur at 75% of 2^32 files.

[2] For a complete discussion of these limits, see Jeff's blog:

http://blogs.sun.com/roller/page/bonwick?entry=128_bit_storage_are_you

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Alan Coopersmith
2005-10-12 17:34:49 UTC
Permalink
Post by Eric Schrock
ZFS inode numbers are 64 bits. The current implemenation restricts this
to a 48-bit usable range, but this is not an architectural restriction.
Future enhancements plan to extend this to the full 64 bits.
32-bit apps that attempt to stat() a file whose inode number is greater
than 32 bits will return EOVERFLOW. 64-bit apps and largefile aware
apps will have no problems.
So does this mean 32-bit apps that didn't need to be largefile aware in
the past because they only touched small files now need to become largefile
aware to avoid problems with ZFS if they call stat()? (Granted, they've
already had problems with stat() with out-of-range dates from NFS servers
and other places, but those aren't as common as ZFS will be.)
--
-Alan Coopersmith- alan.coopersmith-xsfywfwIY+***@public.gmane.org
Sun Microsystems, Inc. - X Window System Engineering
Casper.Dik-UdXhSnd/
2005-10-12 17:51:49 UTC
Permalink
Post by Alan Coopersmith
Post by Eric Schrock
ZFS inode numbers are 64 bits. The current implemenation restricts this
to a 48-bit usable range, but this is not an architectural restriction.
Future enhancements plan to extend this to the full 64 bits.
32-bit apps that attempt to stat() a file whose inode number is greater
than 32 bits will return EOVERFLOW. 64-bit apps and largefile aware
apps will have no problems.
So does this mean 32-bit apps that didn't need to be largefile aware in
the past because they only touched small files now need to become largefile
aware to avoid problems with ZFS if they call stat()? (Granted, they've
already had problems with stat() with out-of-range dates from NFS servers
and other places, but those aren't as common as ZFS will be.)
And xfs filesystems exported from SGI systems....

But as said, only when the number of inodes exceeds 75% of 2^32 or
3 billion which for current ufs sizes would be a 24TB filesystem

But since the typical filesystem only allocates around 25% of inodes
before it fills up, it would be more like a full 100TB before you get to
such huge inode numbers, with filesizes staying what they are.

Casper
Alan Coopersmith
2005-10-12 20:37:44 UTC
Permalink
Post by Casper.Dik-UdXhSnd/
But since the typical filesystem only allocates around 25% of inodes
before it fills up, it would be more like a full 100TB before you get to
such huge inode numbers, with filesizes staying what they are.
Okay - I wasn't clear on if inodes were allocated sequentially or from
all over the available address space.
--
-Alan Coopersmith- alan.coopersmith-xsfywfwIY+***@public.gmane.org
Sun Microsystems, Inc. - X Window System Engineering
Eric Schrock
2005-10-12 17:58:11 UTC
Permalink
Post by Alan Coopersmith
So does this mean 32-bit apps that didn't need to be largefile aware in
the past because they only touched small files now need to become largefile
aware to avoid problems with ZFS if they call stat()? (Granted, they've
already had problems with stat() with out-of-range dates from NFS servers
and other places, but those aren't as common as ZFS will be.)
Yes, unfortunately this is the case. But it will only affect
filesystems with more than 3 billion files on them. There's not much
that can be done about this - if you want to have more than 2^32 files,
you need more than 32 bits to uniquely identify them. The lightweight
ZFS filesystem model will also reduce this effect, since administrators
will be encouraged to have many filesystems (i.e. one per user) instead
of a single mammoth filesystem (all of /export/home).

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Peter Tribble
2005-10-12 20:28:52 UTC
Permalink
Post by Eric Schrock
Yes, unfortunately this is the case. But it will only affect
filesystems with more than 3 billion files on them. There's not much
that can be done about this - if you want to have more than 2^32 files,
you need more than 32 bits to uniquely identify them. The lightweight
ZFS filesystem model will also reduce this effect, since administrators
will be encouraged to have many filesystems (i.e. one per user) instead
of a single mammoth filesystem (all of /export/home).
How far has this been tested?

I know I tested it, just to see how well it worked, about 6 months ago.
On a fairly small machine, 10,000 filesystems was starting to get
"interesting".

I just wonder, seeing as we would need about 40,000 filesystems under
this model.
--
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Eric Schrock
2005-10-12 20:55:07 UTC
Permalink
Post by Peter Tribble
How far has this been tested?
I know I tested it, just to see how well it worked, about 6 months ago.
On a fairly small machine, 10,000 filesystems was starting to get
"interesting".
I just wonder, seeing as we would need about 40,000 filesystems under
this model.
If I remember correctly, most of your problems were relating to
performance under these situations (lots of filesystems). Much work has
gone into improving performance; I don't know for a fact if we've tried
the 40,000 filesystem model. Right now the priority is getting ZFS out
the door, and our performance efforts are focused around getting
individual filesystems to perform well. I can say for a fact that there
is a lot of low hanging fruit in the administration tools to make
various operations (listing, deletion, etc) go faster. It's just not
something we've been able to focus our efforts on.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Joerg Schilling
2005-10-12 17:50:49 UTC
Permalink
Post by Eric Schrock
There is little expectation that anyone will be able to fill a ZFS
filesystem, ever[2]. There is reasonable expectation, however, that in
the next 10-20 years we will pass the 64-bit limit for some use cases.
Do you believe that there currently already systems with 2000 TB?

During the past 17 years, the capacity of a single 3.5" disk did increase by
a factor of 2000 (a factor of 1.57 per year). In 20 years, the capacity of a
single disk will increase by a factor of ~ 8000.


Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Eric Schrock
2005-10-12 18:08:56 UTC
Permalink
Yes, there are multi-petabyte systems out there. Though you may
disagree, I personally don't think its unreasonable to expect such
filesystems to pass the 16 exabyte range within the next 20 years.
Neither did the ZFS designers, hence the 128-bit capability.

Note that we are talking about filesystems, not individual disks. ZFS
filesystems can span any number of disks, just as you could achieve by
layering on top of a volume manager or through a distributed filesystem.
Besides just being flat out larger, the growth rate of filesystem size
not directly proportional to the growth rate of disks.

- Eric
Post by Joerg Schilling
Post by Eric Schrock
There is little expectation that anyone will be able to fill a ZFS
filesystem, ever[2]. There is reasonable expectation, however, that in
the next 10-20 years we will pass the 64-bit limit for some use cases.
Do you believe that there currently already systems with 2000 TB?
During the past 17 years, the capacity of a single 3.5" disk did increase by
a factor of 2000 (a factor of 1.57 per year). In 20 years, the capacity of a
single disk will increase by a factor of ~ 8000.
J�rg
--
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Joerg Schilling
2005-10-13 12:31:51 UTC
Permalink
Post by Eric Schrock
Yes, there are multi-petabyte systems out there. Though you may
disagree, I personally don't think its unreasonable to expect such
filesystems to pass the 16 exabyte range within the next 20 years.
Neither did the ZFS designers, hence the 128-bit capability.
Note that we are talking about filesystems, not individual disks. ZFS
filesystems can span any number of disks, just as you could achieve by
layering on top of a volume manager or through a distributed filesystem.
Besides just being flat out larger, the growth rate of filesystem size
not directly proportional to the growth rate of disks.
So people would need 20,000 disks then....

The biggest FS I know is 100 TB build with Linux and ATA disks.

AFAIR, they have to replace 1-3 disks per day....

Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Bill Sommerfeld
2005-10-12 18:53:43 UTC
Permalink
Post by Eric Schrock
There is little expectation that anyone will be able to fill a ZFS
filesystem, ever[2]. There is reasonable expectation, however, that in
the next 10-20 years we will pass the 64-bit limit for some use cases.
and, unless my math is off by a few orders of magnitude, a 2^128-block
pool at current storage densities would require a data center of roughly
the scale of Larry Niven's "Ringworld"...
Rich Teer
2005-10-12 19:10:00 UTC
Permalink
Post by Bill Sommerfeld
and, unless my math is off by a few orders of magnitude, a 2^128-block
pool at current storage densities would require a data center of roughly
the scale of Larry Niven's "Ringworld"...
Well at least it won't require a Dysan Sphere sized data center! :-)
--
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich
Bruce Shaw
2005-10-12 19:17:10 UTC
Permalink
Post by Rich Teer
Well at least it won't require a Dysan Sphere sized data center! :-)
You do realize when we go to quantum computing, the topology of the disk
storage is no longer an issue, right?

This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal and or privileged
information. Please contact us immediately if you are not the intended
recipient of this communication, and do not copy, distribute, or take action
relying on it. Any communication received in error, or subsequent reply,
should be deleted or destroyed.
Loading...