NFS Read latency - best way to identify culprit

Question

Hi, I have a repeatable event with a specific job run on SLES10 box acessing an EMC celerra. I see very high READ latency (Max RTT). How can I see which file(s) is responsible for this latency. Here are two examples: I used the following filter with tshark: -q -z rpc,rtt,100003,3,'nfs.nfsstat3!=70' Note the very high Max RTT for READ procedure.

NFS Version 3 RTT Statistics: Filter: nfs.nfsstat3!=70 Procedure Calls Min RTT Max RTT Avg RTT NULL 0 0.00000 0.00000 0.00000 GETATTR 1517132 0.00000 0.58692 0.00105 SETATTR 21 0.00088 0.00954 0.00155 LOOKUP 11581651 0.00001 0.98236 0.00036 ACCESS 887419 0.00000 0.18176 0.00024 READLINK 11462 0.00013 0.43970 0.00689 READ 376117 0.00009 8854.46070 0.46602 WRITE 748 0.00033 0.08924 0.01823 CREATE 8 0.00111 0.02262 0.00607 MKDIR 1 0.00633 0.00633 0.00633 SYMLINK 0 0.00000 0.00000 0.00000 MKNOD 0 0.00000 0.00000 0.00000 REMOVE 7 0.00021 0.00865 0.00342 RMDIR 1 0.00117 0.00117 0.00117 RENAME 2 0.00112 0.00113 0.00112 LINK 0 0.00000 0.00000 0.00000 READDIR 601 0.00016 0.03231 0.00037 READDIRPLUS 8254 0.00015 0.21962 0.00372 FSSTAT 0 0.00000 0.00000 0.00000 FSINFO 27 0.00013 0.00383 0.00071 PATHCONF 0 0.00000 0.00000 0.00000 COMMIT 3 0.00393 0.01380 0.00765

NFS Version 3 RTT Statistics: Filter: nfs.nfsstat3!=70 Procedure Calls Min RTT Max RTT Avg RTT NULL 0 0.00000 0.00000 0.00000 GETATTR 1425693 0.00000 0.59928 0.00099 SETATTR 21 0.00038 0.00070 0.00056 LOOKUP 11613451 0.00001 0.68981 0.00029 ACCESS 814987 0.00000 0.04566 0.00022 READLINK 11505 0.00003 0.37694 0.00704 READ 364942 0.00011 7943.28530 2.72942 WRITE 845 0.00029 0.57347 0.01104 CREATE 8 0.00053 0.01336 0.00282 MKDIR 1 0.00073 0.00073 0.00073 SYMLINK 0 0.00000 0.00000 0.00000 MKNOD 0 0.00000 0.00000 0.00000 REMOVE 5 0.00077 0.01270 0.00509 RMDIR 1 0.00067 0.00067 0.00067 RENAME 2 0.00044 0.00051 0.00048 LINK 0 0.00000 0.00000 0.00000 READDIR 601 0.00016 0.10647 0.00156 READDIRPLUS 8254 0.00014 0.11047 0.00389 FSSTAT 0 0.00000 0.00000 0.00000 FSINFO 3 0.00026 0.00032 0.00028 PATHCONF 0 0.00000 0.00000 0.00000 COMMIT 3 0.00241 0.00939 0.00486

Answer 1

0

You can use tshark -r <file> -R "rpc.time>3600" to find all responses that took more than an hour.

answered 07 Feb '11, 10:26

SYN-bit ♦♦
17.1k●9●57●245
accept rate: 20%

Thank you SYNbit. rpc.time > 3600 returns a number of frames with NFS [Illegal Segments], i'm not sure what that means ... reassembly issues, lock / permission issues. Are there doc's out there which explain what NFS [Illegal Segments] means?

thanks

(07 Feb '11, 13:02) debugme

There are no docs available on each and every field. As this is Wireshark generated (because of the square brackets around Illegal Segments), it is probably a problem in the re-assembly. Maybe because of a problem with the packets themselves (which might coincide with the problems you are facing) or maybe a bug in Wireshark.

Are you able to share the trace-file? If so, I can have a look if I can pinpoint it. See my profile for my e-mail address...

(07 Feb '11, 14:40) SYN-bit ♦♦

If there is a pcap file available we can take a look at it for you. The biggest issues with NFS is that v3 (pretty common) has a read block limit of 32KB. So per RTT, you can only transfer 32KB.

If the problem is CPU or IO issues within EMC, the trace file can point it out. If it's an intermittent problem, I would look at the usual suspects as a start (pkt loss, CPU, AntiVirus locking up files)

(07 Feb '11, 18:36) hansangb