The other story for “zpool import failure”

Normally, after a replacment, I usually do a largely unnecessary 'zpool scrub'.*

 

But, I hear reports that something is still not quite right after I had left it resilvering….though in what I could find in my terminal scrollback before I got disconnected from the console session, didn't reveal any indication that there'd be a problem from that.

 

So, I decide to do 'zpool scrub's on both E2900's (for sanity? comparison?)  The other E2900 completes the scrub in 30 minutes (with 0 errors).  But, the recently worked on E2900?….its zpool scrub is given an ETA of 600 hours, then 700 hours….then 800 hours….

 

Something isn't right, should I let it continue or not.  Apparently, I should've gone with not?

 

I wasn't on the console, but apparently it had been spewing messages about "retryable read errors" and "relocation area exhausted". (think I saw something about them being correctable read errors.)  Followed by a "retries exhausted", "disk replacement recommended (predictive failure immienent)"

 

Though what is reported to oracle is the later stream of messages, where after sucessive soft resets of the controller doesn't resolve the issue, and a hard reset fails…that reloading the controller firmware is what the OS tries next, except its incompatible…. leading to an endless cycle of incompatible firmware upload errors and ioc reset failures….

 

So, we got a bad replacement disk too….by the time I got this, it was late Friday night….so left both tickets (Oracle's and our's) to sit until somebody comes in on Monday to pick things up…and hopefully sorts out that pulling the replacement disk might get things working again.

 

The problems from this disk replacement has me in trouble because the DNS upgrade that I was supposed to have done by now….well, isn't.

 

It is a DNS upgrade to properly fix that in BIND 9.9.6, it will no longer start if the same slave zone in different views is using the same file.  Which you think about it, its amazing that it had ever worked….but had been this way since we started doing split/stealth DNS in 2007.

 

Only the internal side get's updates, and somehow the external side has worked. (except for AD, where the external side get's updates and the internal worked…)

 

The AD exception made my kluge workaround a bit harder to create….which was to have a CFEngine promise the external view zone files from internal view zone files, if_repaired("rndc_reload")….  except for the AD zone, which goes the other way around.

 

The fix solves the issue of not having any internal secondary authoratative only nameservers…since I couldn't figure out how to have views on the authoratative only nameservers that I have available to work from.

 

Hadn't really looked at TSIG until the vulnerability in it….(though we did have TSIG setup for the external view, since provider for our external secondaries had asked about it…but we haven't switched them over to using them.)

 

The big hold up is rewritting CFEngine 2.2 policy to deliver my new DNS changes.

 

Bluecat is coming next week to do a POC that will eventually replace all my DNS servers (and all the other departmental DNS servers that allow them to manage their delegations.)

 

The project was originally to replace our aging DHCP servers and to get an IPAM solution to offload management of departmental reserverations to department admins.

 

We're currently under a directive of virtualize everything until it is later proven that it can't be virtualized (not shouldn't), or its an appliance.

 

Guess I'm the only one that remembers getting campus back online from my mom's kitchen table in Calgary, Alberta on my Dad's 70th birthday (Dec 12, 2011)….because our only two authoritative DNS servers were in zones, on hosts that were both in the row that failed when the power blinked.

 

It was during this time, that we had been ordered to stop cross-connecting servers between the UPSs and to fill them all the way up.  Transfer switch in the row's UPS had failed.

 

They could get it to reset, but any blip would knock it out again…first time, I was still at the aiport… second time they had to wait for me to land in Calgary…

 

By that time, they had decided the solution was to leave the datacenter on the generator until the transfer switch could be fixed….a week later. (it took a while to get a service contract in place to get it fixed, the university lawyers were fast for a change….)

 

Recently heard that our campus is going to go Tobacco free.  The University policy draft, had June 30th, 2015 as the start of the transition to this. (designated remote smoking spots, set out that nearest one is no more than 5 minutes from any building on campus….though not necessarily mindful of high traffic pedestrian routes.)  But, it had only made its way back from the "Smoking Policy Committee" to the "Campus Environmental, Health & Safety Committee", for where my recent birthday was my first official meeting as an appointee to 😉

 

Got the impression that it might be 2-3 years before it works its way through necessary approvals (and not entirely sure who the final approval comes from.), but first it needs to go back for a rewrite.

 

Since orginally, it was "Smoke Free", but they also want to break the culture that our students participating in certain things (like baseball or rodeo) having to learn to chew.

 

Also that it should apply to the entire University, and not just main campus.  The draft was largely based on policy written for our Olathe campus, which was necessary for it to conform with city ordinances.  Manhattan, only has an ban on smoking in public buildings and within 30' of entrances and air intakes…..

 

Since no one is aware of whether a policy is in the works for our Salina campus, or whether Athletics falls into an campus policy.  (Our our Foundation building, which I had heard was an argument for moving all of central IT off of campus to this building…..which if true, would be an argument against from me.  Even though its closer to my home.)

 

I kind of think it shouldn't take that long, given that we're trailing our rival…that other Kansas university….but we'll see.

—–

* Which is mainly needed because a bug that I've never done anything about in our CFEngine zpool policy on doing periodic scrubs, doesn't work if zpool status shows "NONE" for last scrub.  (it also doesn't check the outcome of the scrub, leaving only catastrophic scrub failures as something to report.)