Any LVM Wizards Out There

Latest response

Right now, it seems like my Google-fu is failing me. A task that used to be fairly trivial on Solaris is utterly eluding me on Linux. Specifically, how does one extract a fileystem, intact, from an LVM container so as to be able to directly mount it from the underlying /dev/sd device?

 

Basically, I'm looking to avoid the whole Towers of Hanoi exercise of moving the data from disk-to-disk. If I knew what blocks on disk the filesystem's LVM structures pointed to, I could probably re-fdisk the device and mount the filesystem directly from the /dev/sd device. I know Linux's LVM uses a different partition tag (8e) for LVM versus native (83) partitions - just hoping that's a label rather than a geometry change.

 

Ideas? Am I clear in what I'm looking to accomplish?

Responses

Linux LVM is a volume manager, like various Veritas storage solutions, Microsoft LDM, AIX volume management, etc...

There are things stored in meta-data areas of the Physical Volume (PV), and there are not always 1:1 block mappings between Physical Extents (PE) in PVs to Logical Extents (LE) in a Logical Volume (LV).  This is especially the case where a LV seems like it has contiguous LEs, when it's actually spread over several sets of PEs.  So even if you delete the meta-data, and somehow "redefine" the slices/partitions to address where the PEs actually are for a LV, there's no guarantee they will be contiguous and usable.

 

Also coming from a Solaris background, I don't see how you do this on Solaris.  Unless you're thinking LVM is like a Sun Disk Label.  It is not.  Disk Labels are completely different (and PC/Linux LVM often resides on the legacy PC BIOS/DOS disk label or the newer GPT disk label).  LVM is also not MD either, so don't confuse LVM with Solaris MD capabilities.

But more importantly ... why are you doing this?

 

LVM on kernel 2.6 has *0* overhead.  LVM is merely just leveraging the integrated DeviceMapper (DM) facilities of the kernel.  I.e., whether something is in LVM or in a "slice" of the underlying disk label (e.g., partition in legacy BIOS/DOS "MBR" partition table format), it performs the same.

I.e., The kernel accesses the blocks directly, as DM defines -- LVM, underlying slices/partitions, etc...

There's no need to ever "undo" LVM.  LVM has all sorts of advantages, while I don't know of any disadvantages ... other than GRUB being unable to directly boot it (requiring a separate /boot).

The volumes in question are basically a single-column stack: 1 LUN; 1 (full-disk) primary partition; 1 VG; 1 LV; 1 filesystem (basically, no point in using LVM at all - other than that the problematic-system's builder uses LVM out of habit). If you're saying that, in such a configuration, LVM might be splattering data randomly about the physical volume, one would wonder why you would ever use LVM versus other options. Randomly scattering data around a physical device is asking for pointless head-thrash.

 

While I would like to believe that LVM has "*0* overhead", LVM is currently the only difference between four otherwise identically configured hosts (HP BL460G6 servers with 32GB of RAM and qLogic HBAS; CLARiiON back-end attached by qLogic HBAs). Three hosts - the ones not using LVM - scream along with their respective workloads; the gimpy host - the one that is using LVM - is eating memory like it's going outta style (all are running identical Oracle and java application stacks - ironically, the software stacks on the screaming hosts are several times larger and busier than the one on the gimpy host).

 

So, to answer your "why are you doing this" question, it's to: A) eliminate the major configuration deviation from the gimpy host; B) have the gimpy host match the original design-spec for the application deployment (i.e., to bring it back into standard configuration for the host/workload).

 

As to Solaris, I'm not sure what you mean by "I don't see how you do this". Whether your using ODS/SDS/SVM or using VxVM (haven't had cause to try it on ZFS, so can't speak to that) for your volume management, it's fairly trivial to extract a UFS or VxFS filesystem, intact, from within either an ODS/SDS/SVM or VxVM volume. The only real gotchas to the extraction being that: the data has to use sequentially contiguous blocks on disk; you don't have more logical volumes than you do available `format`able partitions. The only real challenge to doing it is making sure your math is correct.

Typically when you create a single logical volume to span an entire physical volume, it will use a contiguous range of physical extents.  However what Bryan was getting at is that if you've extended the LV over time to take up the PV, and possibly had other logical volumes in the mix on that PV at some point, then your current LV may have multiple segments that are not necessarily in order on the disk.  

 

I can't imagine any scenario in which your LVM layout is contributing to the high memory consumption.  Have you checked top to see whats eating the memory?  device-mapper and LVM are simply remapping I/O that is destined for a certain are of a logical device to a specific area on disk, and outside of LVM commands and metadata changes, there really shouldn't be any overhead.  

 

That said, if you really want to remove LVM from the picture, you can do it as you described by repartitioning the disk to have one starting where the logical volume started previously.  This isn't trivial, but it can be done.  

 

Note: This can destroy your data.  Proceed at your own risk.  A backup is strongly recommended.  Or better yet, just copy the data from one disk to anothre and use that instead.  

 

In this example my PV is /dev/sda, my vg is test, and my LV is 1G named lv1.

 

First you need to know the range of extents used by the LV.  This will only work if its a single contiguous range.  

 

  # lvs -a -o +seg_pe_ranges test
    LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert PE Ranges     
    lv1  test -wi--- 1.00g                                       /dev/sda:0-255

 

Next you need the extent size:

 

  # vgs -a -o +extent_size test
    VG   #PV #LV #SN Attr   VSize  VFree Ext  
    test   1   1   0 wz--n- 10.00g 9.00g 4.00m

 

So, the new partition will have to be at least 4m*255, or 1020m, large.

 

Now you need to know the offset to create it at on disk, which is the same as the start of the first PE:

 

  # pvs -a -o +pe_start /dev/sda
  PV         VG   Fmt  Attr PSize  PFree 1st PE
  /dev/sda   test lvm2 a-   10.00g 9.00g   1.00m

 

So, a 1020m or larger partition that starts 1m into the disk.  First you should back up your metadata in case you need it again (there's probably already a copy, but best to be safe):

 

  # vgcfgbackup test

 

Now remove the metadata. Note: everything must be unmounted.  Commands:

 

  # vgremove test

  # pvremove /dev/sda

 

Now I create my partition with parted:

 

   # parted /dev/sda
  GNU Parted 2.1
  Using /dev/sda
  Welcome to GNU Parted! Type 'help' to view a list of commands.
  (parted) p                                                                
  Error: /dev/sda: unrecognised disk label                                  
  (parted) mklabel                                                          
  New disk label type? msdos                                                
  (parted) p                                                                
  Model: IET VIRTUAL-DISK (scsi)
  Disk /dev/sda: 10.7GB
  Sector size (logical/physical): 512B/512B
  Partition Table: msdos

 

  Number  Start  End  Size  Type  File system  Flags

  (parted) mkpart                                                           
  Partition type?  primary/extended? primary
  File system type?  [ext2]? ext3                                           
  Start? 1m                                                                 
  End? 1022m
  (parted) p                                                                
  Model: IET VIRTUAL-DISK (scsi)
  Disk /dev/sda: 10.7GB
  Sector size (logical/physical): 512B/512B
  Partition Table: msdos

 

  Number  Start   End     Size    Type     File system  Flags
   1      1049kB  1022MB  1021MB  primary  ext3

 

 

  (parted) quit                                                             
  Information: You may need to update /etc/fstab.                           

 

  # partprobe

 

Now I'm able to mount it via the partition:

 

  # mount /dev/sda1 /mnt/nfs/
  #

 

Again, I recommend you look closer at whats actually using the memory rather than go to these lengths to remove something that may benefit you in the future and probably isn't having any impact.  But, if you must, thats how it can be done.

 

Regards,
John Ruemker, RHCA
Red Hat Software Maintenance Engineer
Online User Groups Moderator

Ok, dunno if it's something weird in my testing VM, or what, but, those steps don't seem to result in the right block range for me. Perhaps I'm interpreting the above procedures wrong. At any rate, this is what I get with my 10GB testing vDisk (not risking the production system's disk until I know my application of the procedure is fundamentally sound):

 

# lvs -a -o +seg_pe_ranges TestVG
 LV   VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert PE Ranges
 Vol0 TestVG -wi-ao 10.00G                                       /dev/sdc1:0-2558
# vgs -a -o +extent_size TestVG
 VG     #PV #LV #SN Attr   VSize  VFree Ext
 TestVG   1   1   0 wz--n- 10.00G    0  4.00M
# pvs -a -o +pe_start /dev/sdc1
 PV         VG     Fmt  Attr PSize  PFree 1st PE
 /dev/sdc1  TestVG lvm2 a-   10.00G    0  192.00K
# parted /dev/sdc
GNU Parted 1.8.1
Using /dev/sdc
Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) p

Model: VMware Virtual disk (scsi)
Disk /dev/sdc: 10.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      18.9kB  10.7GB  10.7GB  primary               lvm

(parted) rm 1
(parted) mkpart
Partition type?  primary/extended? primary
File system type?  [ext2]? ext3
Start? 192.00k
End? 10g
(parted) quit
Information: Don't forget to update /etc/fstab, if necessary.

# partprobe /dev/sdc
# mount /dev/sdc1 /tmp/a
mount: you must specify the filesystem type

I tried changing the ending block, mathing out the PE size times number of PEs and converting to KB (both as an absolute and adding to the starting block to get a relative extent). In none of those cases, did It result in a recoverable filesystem. I also, just on a desperate grasp, tried (just in case) changing the starting offset from 192.00K to 1MB, 4MB and 210.9KB (original partition boundary plus "start" returned by `pvs`). Still no dice.

 

At this point, it's more "academic interest" that's driving me. I anticipate that, by the time I get back from vacation, the SAN folks will have been able to present a parallel LUN to do the data-bouncing, making this an unneccessary exercise. Given other volume managers' unecapsulatability (and your aparent ability to do it on your test system), I'm simply genuinely curious as to "can it be done" (and what's the juju).

 

John -- that was exactly what I was referring to.  I was afraid the LV could have been extended at different points, and had ranges of extents that may not be completely continous (even if still contiguous).  I didn't want to jump to such conclusions, and leave the poster upset with me as a result.  ;)

 

Nice to know about the "seg_pe_ranges" output field.  I had never tried it before.  I didn't want to send the poster down the more "raw" DeviceMapper route, which is all I could think of.  Thanx for that tidbit, and it's definitely much better.

 

To conclude, I think you and I see eye-to-eye.  As much as I wanted to go into the details of how PEs are laid out, how DeviceMapper uses the ranges, etc... and get into the DM tools, I figured the poster had a reason why he was asking.  Hence my question.

 

As it turns out, and I had a hunch (and I thought I remembered the poster from the 32-bit PAE thread ;) ), it was about performance.  And I came to the same conclusion as yourself, the performance is elsewhere.

 

Because whether you use LVM or not, you use DeviceMapper, always.  The kernel always accesses disks on a range of blocks, whether they are referring to PE ranges, or partition/slice ranges, it's 100% the same.

 

I can't think of any tunables that would modify performance on LVM.  Before DM-LVM2, back in 2.4 (when LVM did have overhead), readahead was set separately on a LV from a block device.  But it's all the same block device now in the DeviceMapper era of LVM2.

Anything you could think of?  Otherwise, I think the performance issue is in another area.

Issues like this are why I hate dealing with "hardened builds". No tcpdump, no strace, no crash-tools ...nothing. So, it leaves me with "reduce the deltas" as the main troubleshooting path. At this point in that process, LVM's the last identifiable delta. :(

Cool. Those are the types of things I'd been looking for. I'd found an article on mounting stacked LVs (i.e., you create an LV, present it through KVM to a guest that, in turn, partitions and LVs it) as bare devices, so, figured there was a method of extracting the filesystem from the LV (a la `vxunroot` and manual de-MD'ing on ODS/SDS/SVM).

 

Unfortunately, while I don't know that LVM/dev-mapper is what's eating memory, it's the remaining configurational difference between this system and three others that don't suffer the memory problem. We use a hardened build, so there's not a lot in the way of profiling tools to allow me to track things down. So, the next best thing is to try to make things more like the systems that work. While I'd normally be ok with playing towers of hanoi with the data, our SAN operatiosn folks are a bit slow in servicing storage requests. Since de-encapsulation is a fairly quick and easy procedure in other volume management systems, I figured it would be worth investigating the possibility of doing it in this case.

 

We're still in the process of unifying our enterprise backup infrastructure. This particular server sits in one of the non-unified data centers. So, while it has backups, I'd have to go through the remote ticketing system to get the restores (and, the people that own that backup island are the same people that provision storage for that site, so...).

 

This is all tied back to the suboptimal RHEL 5+PAE low-mem situation. Basically, this box runs "at the edge", so, anything that's chewing up chunks of low-mem pushes it past the tipping point.

 

Overall, I'll be a lot happier with life when we're able to migrate to the current version of the app-stack so I can move to RHEL 6 and 64bit-Oracle.

If you have Oracle on Ext3, you need to be tuning Ext3, and ensuring all systems have those tunables.  Are you using Direct I/O?

The app vendor got pissy about other attempts to optimize. Hell, they got pissed when I ran some of the Oracle-bundled report macros to see where the database might be optimized (but subsequently asked for the output of those reports for their troubleshooting). It's been a fun series of phone calls.

 

Always a joy when a larger company buys another company so they can sell that company's product, only to have all the original product's coders leave or be laid off.

Most of the recommendations I'm making, like Direct IO, are standard practice when you have Oracle atop of Ext3.  If it's not enabled, it could explain your performance issues, as it makes a significant difference.

 

I guess what I'm saying is that there could be several, major settings/configuration differences that explain your performance issue, and are far more likely (and actually significant in impact).  These are things that are easily changed and tested.

 

Don't know what to tell you at this point, just trying to give you possibilities that actually impact performance.

The vendor that pre-packaged the components gets pissy when you change things.

 

At any rate, many of the tunables that applied two Oracle (and two RedHat) versions ago (though, the DB and OS combo are each 1 major release behind current at 10g and RHEL 5.5, respectively) are now either defaults or auto-tune themselves in the presence of each other. If those things aren't currently set, it's because the app-stack vendor has specifically unset them. Me setting them either accomplishes nothing (i.e., saying "flip this bit to on" when it's already on) or it overrides the  vendor's setting (which will put me out of support).

 

We use automated build systems to ensure a consistent OS load and configuration. We use identical base hardware throughout the enterprise to reduce the number of possible hardware configuration-related issues. And, while the team that built the system originally exceeded spec with memory, that has been corrected. That leaves me with the  fact that the only configurational difference between the flaky system and the other working systems is presence of LVM on the filesystems used by the application. None of the systems have been tuned beyond those setting explicitly recommended by the vendor or autmatically done by the app-stack's installation scripts. Were missing or inapparopriate tunables responsible, I should reasonably expect to see the same behavior across all of the systems.

All I was saying is that it can't hurt to check if all the tunables match.  The lack of Direct I/O on one system versus another is the biggest red flag I've seen in my history of Oracle implementations when the storage is Ext3.

 

-- Bryan

 

P.S.  I gotta ask ... these aren't Exadata systems, are they?  @-ppp

 

I know Exadata were shipping with Oracle Linux 5.5 (can't remember if stock or Unbreakable) just a couple of months ago when a client of mine was looking at time.  I know they are often sold as an "appliance," but once a customer gets them, they are setup and managing them like any old EL box, with some NAND EEPROM acceleration and storage (per-disk) licensing of Oracle Linux.

This whole setup has made me rather cranky and a touch oversensitive to perfectly valid questions.

 

At any rate, it's an SRM product that my customer had bought more than a year before involving me in the project. Worse, by the time I was brought in, the vendor was getting ready to EOL the particular version bought. Factor in an apparent shortfall in RHEL experience by the vendor's product support team and they've not been terribly helpful when we've run into issues. All of which has contributed to my inability to keep my normal levels of crankiness in check. :p

So, today I was backing the box up in preparation to do some more debug work on it. EVERYTHING was shut down - Oracle and the Java applications. Basically, the only thing running was the OS and the (NetBackup) backup processes. About 2/3 of the way through the backup, the damned thing OOM'ed on Xorg (the X-desktops are left running on our build because workstations with X-servers for display redirects are exceedingly rare). Fotunately, OOM took Xorg out and left the NetBackup process run to completion. Changed the initdefault to 3 and rebooted. I guess we'll see whether Xorg's the culprit or if its something else (though, previous hardware diags turned up no issues).

 

Thomas,

 

 

Why not open a new thread for this is not a LVM issue?

Second one could you please avoid slang? For as a non native English speaker I do not know what OOM'ed means, I can only guess.

 

 

Kind regards,

 

 

Jan Gerrit Kootstra

  1. When people have bothered to try to provide help, I generally like to let them know what the outcome of that effort was.
  2. Given that I don't have individual email addresses for the folks that tried to help on this (and a related) issue - posting to the thread lets them know the results of their help.

Not a lot of point to opening an entirely new thread to communicate "here's the outcome of threads X and Y" since I'm not specifically asking a new/separate question.

Jan --

 

Just FYI, the OP actually posted an OOM thread first, and I do believe it is related (as I noted in an earlier response, which he confirmed) ...

 

`oom-killer` problems on RHEL 5.5 32bit (w/ PAE kernel)

 

The OP is very constricted here, and yet trying various things under those constraints, although some seem like they would void the 3rd party support any way.

 

I still do not believe this is LVM related at all.  On kernel 2.4, yes, LVM impacts performance.  But on kernel 2.6, the kernel DeviceMapper (DM) are in use, whether LVM2 or "raw" slices (Partitions) on the disk label (BIOS/DOS Partition Table), period, exactly the same, they always use DM.

Hi,

 

I saw a lot of oom (Out of Memory) problems on RHEL5, on 15-17 servers, I think. But I haven't seen an oom error on RHEL6 (so far). We have resorted to rebooting the servers weekly to try to stop the oom errors crashing the servers. It's a crude fix, but customer would not like to try the hugemem kernel on RHEL5.

 

I'm surprised to hear folks trying to get rid of LVM. I was hoping to do the opposite i.e. move data from a non-LVM ext4 filesystem to a LVM ext4 filesystem. This will allow dynamic expansion of the underlying LV and then the ext4 filesystem. Although I'm not sure if SAN storage is used, whether this assumption is still true. One of the biggest headaches is to run out of filesystem space and need to extend a filesystem. I try to leave some room for expansion by creating a LV, then make a filesystem about 90% the size of the LV and leave 10% for "emergencies".

 

Lina

The value of LVM with SAN depends on your use-case:

  • If you prefer to have static device names and/or "friendly" device names and don't want to mess with udev, LVM is great.
  • If you're doing LUN-spanning or sub-disking, LVM is a great way to do it (it's hard to beat "free").
  • If you're not doing LUN-spanning or sub-disking, starting with LVM can be great from the standpoint of future configurational flexibility. If you ever need to re-layout your storage, LVM already being in place gives you flexibility to span or sub-disk on an as needed basis. Only free solution I've encountered, to date, that's more flexible is ZFS/ZPools.
  • If you need to do online storage-migration (e.g., you've bought a shiny, new storage array and want to move off it without downtime), LVM is great. You just add the new array's LUNs as PVs, mirror your data to the new array's LUNs, break off the old array's LUNs and you're done - no downtime really needed.
  • If you've got none of the above and only *ever* intend to use 1:1 FS:LUN mappings and only ever plan to use array-side LUN resizing, LVM essentially becomes one more set of pointers you need to remember to update (meaning you have some additonal commands to run to accommodate the array-side activities). When you resize a LUN, you have to offline all of the stacked structures (fileslsytem and LVs) on that device before RedHat will allow you to update the device's geometry.

For me, except when doing a Storage Foundation-based installations, I use LVM as a hedge against future needs. My desire to remove was simply a case of trying to eliminate it as an underlying cause of or contributor to the issues I was seeing (Troubleshooting 101 is to progressively simplify your configuration until the errors either go away or result in error conditions/messages becoming more obvious). Other logical volumement solutions I've used, logical device encapsulation/de-encapsulation was a fairly trivial undertaking. Thus, it was a frequently-used technique when potential issues presented themselves.

Lina --

 

I actually tracked the OP's OOM problem first, before this thread ...

 

`oom-killer` problems on RHEL 5.5 32bit (w/ PAE kernel)

 

In a nutshell, he's got 48GiB on EL x86 (PAE36) release 5 (and not 4 or earlier, with the 32-bit the 4/4G model option).  That's just a recipe for disaster, and is not supported by Red Hat with release 5, because one is always exhausting low memory.

 

EL x86-64 (PAE52) must be used  for more than 16GiB, as Red Hat only supports up to 16GiB on EL x86 (PAE36) for release 5+.

 

Red Hat Enterprise Linux Technology capabilities and limits (supported[/theoretical])