The issue was that Pig’s contrib(uted extension) function ISOToDay would truncate datetimes based on Coordinated Universal Time (UTC), with no way to specify a preference for local time zone that better matches a natural “day” in the source data. For example, if my friend in California browsed http://redfin.com from 3pm through 6pm yesterday in Pacific Time, that’s 11pm yesterday through 2am today in UTC. If we want to count the number of unique website visitors each day, then it would be bad to count my friend’s session as visits on 2 different days.
My patch allows Pig users to tag data with a timezone, and to have that timezone respected when datetimes are truncated to whole days, so that a visitor to a website on a California afternoon gets counted as a single-day visit.
There’s still a problem of miscounting a person who visits the site from 2am to 4am in New York (which spans two days in Pacific Time), but that is a much smaller problem than using 5pm Pacific Time as the daily cut-off. (The best we could do would be to find the hour of the day when the fewest visits are happening, and do our processing in the timezone where that is midnight. That would turn out to be pretty close to Pacific time, anyway.)
On a slightly political note, it is nice that my current employer has a policy of allowing open source contributions. I needed this bug fix to solve a real work problem, and since I was permitted to contribute the fix with no legal department overhead, I can continue to use the official common version of Apache Pig, instead of maintaining our own forked version.
More technical details at the Apache Pig Issues site. Happy hacking!