Closed Bug 637680 Opened 13 years ago Closed 13 years ago

Get top crashers for Firefox and Fennec where crash-stats are broken (linux, android)

Categories

(Socorro :: General, task)

All
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: rhelmer)

References

Details

Attachments

(9 files, 1 obsolete file)

I'll attach the two programs that can be used to fixup minidumps.
Attached file Fixup Linux minidumps
Build with -I$(topsrcdir)/toolkit/toolkit/crashreporter/google-breakpad/src

Just give a bunch of minidumps on the command line, and it will modify them in-place.
The plan is to get these minidumps into a dev server (in bug 637678), where I'll run this tool on them, then we'll feed them into the Socorro staging server to generate topcrash lists.
How many dumps are we talking?
Could we:
- run a MR to pull each busted dump, fix it, and replace it in hbase
- insert all fixed dumps into the legacy processing queue 

This would get the data up on prod.
Actually, after chatting with laura a bit on IRC, here is what I would offer for your consideration:

Create a Postgres query that can extract a list of submitted_timestamp that need to be fixed

Create a simple Python script that can iterate over those ooids and talk to the hbaseClient object

Call hbaseClient.get_dump(ooid)

Shell exec the fixer program on the dump

Insert the dump back into HBase using a subset of the code in hbaseClient.put_json_dump()

Insert the ooid back into the legacy processing queue by calling hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])

Note that the current timestamp in the same format as what is used for submitted_timestamp should be used so that the entries to be reprocessed don't take priority over normal jobs.


The end result of this job if it were run on a regular basis is that we would update the record in hbase with a fixed copy of the dump file (the old one would still be present but not visible to the normal Socorro system).  The monitor would see these entries in the queue, and as long as it doesn't reject them as already having been processed, it would send them back through the system.

There would be no load increase on the production HBase cluster to support this.  If we attempted to do a map reduce job, then we'd have to tune and test that carefully to make sure it wouldn't mess things up.  If this were tens of thousands of crashes per day then that might be worth it, but for a small volume, this should be a simple to implement solution.
Sorry, at the beginning, the first step should read:

Create a Postgres query that can extract a list of ooids that need to be fixed
Ted, do you want us to go ahead?
Daniel's proposal sounds fine to me. Let me know how I can help make this happen.
Attachment #515926 - Attachment mime type: text/x-csrc → text/plain
Attachment #515927 - Attachment mime type: text/x-csrc → text/plain
Rob: see comment 5 for the agreed procedure.  We'll need two weeks worth of dumps to get decent topcrasher info, for the broken builds.  You might need jberkus to run the query on prod PG for you for that part.

The other part is hacking up some python to follow the above steps, and running that on prod. 

We really need to get this done today - it fell off the radar this week. Can you manage it?
Assignee: ted.mielczarek → rhelmer
Severity: normal → blocker
Here's a count and example queries we'll be working with:

"""
breakpad=> select count(*) from reports where product = 'Firefox' and version = '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed > '2011-01-01';
  count  
---------
 1233417
(1 row)

breakpad=> select count(*) from reports where product = 'Fennec' and version = '4.0b5'; count 
-------
  7851
(1 row)
"""

Ran over this with ted in irc, looks good but let me know if anyone notices anything odd. I am now working on the approach Daniel suggests in comment 5.
Status: NEW → ASSIGNED
(In reply to comment #5)
> Insert the dump back into HBase using a subset of the code in
> hbaseClient.put_json_dump()

Daniel, can you expand on which part(s) of put_json_dump() we don't want?
 
> Insert the ooid back into the legacy processing queue by calling
> hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])

 
> Note that the current timestamp in the same format as what is used for
> submitted_timestamp should be used so that the entries to be reprocessed don't
> take priority over normal jobs.

I think for this we could easily add an optional param put_json_dump() to override submitted_timestamp, passing in the current time (or whatever we want instead). Let me know if I'm understanding correctly.
basically, we only want the lines in put_json_dump() that write the dump, none of the metadata manipulation or index management and such.

something like this with the one comment placeholder filled in.:

  @optional_retry_wrapper
  def put_fixed_dump(self, ooid, dump, add_to_unprocessed_queue = True):
    """
    Update a crash report with a new dump file optionally queuing for processing
    """
    row_id = ooid_to_row_id(ooid)
    submitted_timestamp = # Python code for getting current timestamp in correct format

    columns =  [ 
                 ("raw_data:dump", dump)
               ]
    mutationList = [ self.mutationClass(column=c, value=v)
                         for c, v in columns if v is not None]

    indices = []

    if add_to_unprocessed_queue:
      indices.append('crash_reports_index_legacy_unprocessed_flag')

    self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233

    self.put_crash_report_indices(ooid,submitted_timestamp,indices)
Attached patch script dump fix and re-insertion (obsolete) — Splinter Review
This is based on comment 5 and comment 12 (thanks Daniel!)

Not sure if we want to keep it, but I set it up so we could trivially land this into Socorro and call this as a cron job if it's needed.

Building attachment 515926 [details] and 515927 with a Socorro checkout just needs:
make minidump_stackwalk
gcc -o minidump_hack-fennec -I google-breakpad/src/ minidump_hack-fennec.c 
gcc -o minidump_hack-firefox_linux -I google-breakpad/src/ minidump_hack-firefox_linux.c 

Names of the fixup commands and also the SQL queries are configurable, but "Fennec" and "Firefox Linux" are hardcoded in the config and the start script (hopefully we never need to expand this :))

Might be nice to make the fixup commands read stdin and write output to stdout so we don't need to touch the disk, but not going to sweat this right now.
Attachment #517027 - Flags: review?(lars)
Attachment #517027 - Flags: feedback?(deinspanjer)
We tested this on a single crash to start:
https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
-93fb-694472110220

I have the original dump, and the md5sum changed, but not sure how else to verify.

Who can help with this?
(In reply to comment #14)
> We tested this on a single crash to start:
> https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
> -93fb-694472110220
> 
> I have the original dump, and the md5sum changed, but not sure how else to
> verify.
> 
> Who can help with this?

Taking a look at the original raw dump vs. the new one should help. In the original, you should see 3 modules for fennec libraries such as libxul.so, while in the new one you should see only two, with the first one covering the address space covered by the original two first. The resolved function names should look better too.
In the crash report you link, the stack trace on other threads almost look normal, except for the ashmem (deleted) parts, which are not related to elfhack. Having the /proc/pid/maps output from the minidump would help, there.
Note the fixing behaviour is different on Linux. The original minidumps should have one module for each Firefox library, except each is too small. The fixup will make the module address space larger, so that it fits the actual address space used in the process.
glandium has been helping to test this in IRC; looks good so we're going to proceed with all Fennec 4.0b5 crashes. Doing a dry-run now, to make sure everything looks ok (processing the right number, calling the right binary).

There appears to be caching enabled on /rawdumps calls (which Apache rewrites to the socorro-api hostname); I imagine this is on the Zeus, not sure if this is valuable.

Also, just a reminder that per comment 5 these will get inserted into the normal (not priority) queue for processing, so it'll take a while for processors to pick these up.

I should have a reasonable estimate for how long this will take once we have it running for real for a bit.
(In reply to comment #10)
> Here's a count and example queries we'll be working with:
> 
> """
> breakpad=> select count(*) from reports where product = 'Firefox' and version =
> '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed >
> '2011-01-01';
>   count  
> ---------
>  1233417
> (1 row)

Oops this is wrong, should be explicit about precedence here (thanks glandium):

breakpad=> select count(*) from reports where product = 'Firefox' and (version = '4.0b11' or version = '4.0b12') and os_name = 'Linux' and date_processed > '2011-01-01';
 count 
-------
  7732
(1 row)
Same as attachment 517027 [details] [diff] [review] plus:

* fix firefox SQL statemnt
* use /dev/shm for tmpfiles instead of disk
* catch/log exceptions in the pull/fix/push loop
Attachment #517027 - Attachment is obsolete: true
Attachment #517027 - Flags: review?(lars)
Attachment #517027 - Flags: feedback?(deinspanjer)
Attachment #517061 - Flags: review?(lars)
Attachment #517061 - Flags: feedback?(deinspanjer)
Attached file fennec OOIDs modified
started 2011-03-04 17:51:18
stopped 2011-03-04 18:02:55
Attached file results of fennec run
(In reply to comment #16)
> In the crash report you link, the stack trace on other threads almost look
> normal, except for the ashmem (deleted) parts, which are not related to
> elfhack. Having the /proc/pid/maps output from the minidump would help, there.

This looks like a separate/pre-existing issue, per irc.
Expected number of OOIDs processed.

I need to step away for a little bit, will run this when I get back and can keep an eye on it.
Attached file firefox OOIDs modified
Comment on attachment 517078 [details]
firefox OOIDs modified

started 2011-03-04 20:59:10
stopped 2011-03-04 21:24:48
Worked awesomely. The top crashers list seems not to be updated, though. And new crashes obviously are broken, too.
Attachment #517061 - Flags: review?(lars) → review+
Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of 2011-03-04 21:24:48.

(In reply to comment #29)
> Worked awesomely. The top crashers list seems not to be updated, though. And
> new crashes obviously are broken, too.

We are looking into the top crashers issue now.

To run this on a regular basis, we can easily add this as a cron job in Socorro but we should add a feature so the script keeps track of where it left off and doesn't fix crashes multiple times (the bugzilla cron drops a timestamp into a pickled file, we could do something similar, perhaps using the last-fixed id from the reports table rather than timestamp).
(In reply to comment #30)
> Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of
> 2011-03-04 21:24:48.
> 
> (In reply to comment #29)
> > Worked awesomely. The top crashers list seems not to be updated, though. And
> > new crashes obviously are broken, too.
> 
> We are looking into the top crashers issue now.

Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours prior, so we should be able to keep up with a batch job such as the one proposed by comment 30, as long as it was run at least hourly.

However to catch up the backlog, the most straightforward way to do this would be to delete the top crash by signature (TCBS) table from the first crash processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports table.

We would expect this to take between 1 and 10 hours. This means that top crashers list for crashes before the start date above (Feb 20) would be unavailable. As it completes each hour of work, though, that hour would be immediately available.
(In reply to comment #31)
> Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours
> prior, so we should be able to keep up with a batch job such as the one
> proposed by comment 30, as long as it was run at least hourly.

Filed bug 639514 to follow up on this.

> However to catch up the backlog, the most straightforward way to do this would
> be to delete the top crash by signature (TCBS) table from the first crash
> processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04
> 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports
> table.

Filed bug 639512 for this.
Depends on: 639512, 639514
Comment on attachment 517061 [details] [diff] [review]
script dump fix and re-insertion

Landed this, which is appropriate for one-time fix (given appropriate SQL query in the config file). Going to add changes needed for running from cron in bug 639514:

Committed revision 2997.
Backlog is caught up, top crashers lists updated, and a cron job is running hourly to fix broken crashes as they come in.

Top Crashes By Signature table was rebuilt in bug 639512 
fixBrokenDumps cron job was installed in bug 639514

All work was completed by 2011-03-07 22:40:56 Pacific.

Please reopen if you see any problems, or mark bug verified if everything looks ok.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #517061 - Flags: feedback?(deinspanjer)
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: