Closed Bug 637680 Opened 13 years ago Closed 13 years ago

Get top crashers for Firefox and Fennec where crash-stats are broken (linux, android)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: glandium, Assigned: rhelmer)

References

Details

Attachments

(9 files, 1 obsolete file)

Fixup Linux minidumps 13 years ago Mike Hommey [:glandium] 4.21 KB, text/plain		Details
Fixup Android minidumps 13 years ago Mike Hommey [:glandium] 4.63 KB, text/plain		Details
script dump fix and re-insertion 13 years ago Robert Helmer [:rhelmer] 6.57 KB, patch		Details \| Diff \| Splinter Review
script dump fix and re-insertion 13 years ago Robert Helmer [:rhelmer] 6.79 KB, patch	lars : review+	Details \| Diff \| Splinter Review
results of fennec dry-run 13 years ago Robert Helmer [:rhelmer] 394.01 KB, application/octet-stream		Details
fennec OOIDs modified 13 years ago Robert Helmer [:rhelmer] 651.88 KB, text/plain		Details
results of fennec run 13 years ago Robert Helmer [:rhelmer] 411.89 KB, application/octet-stream		Details
results of firefox_linux-dryrun 13 years ago Robert Helmer [:rhelmer] 395.76 KB, application/octet-stream		Details
firefox OOIDs modified 13 years ago Robert Helmer [:rhelmer] 615.33 KB, text/plain		Details
results of firefox linux run 13 years ago Robert Helmer [:rhelmer] 407.67 KB, application/octet-stream		Details

Mike Hommey [:glandium]

Reporter

Description

•

13 years ago

I'll attach the two programs that can be used to fixup minidumps.

Mike Hommey [:glandium]

Reporter

Comment 1

•

13 years ago

Attached file Fixup Linux minidumps — Details

Build with -I$(topsrcdir)/toolkit/toolkit/crashreporter/google-breakpad/src

Just give a bunch of minidumps on the command line, and it will modify them in-place.

Mike Hommey [:glandium]

Reporter

Comment 2

•

13 years ago

Attached file Fixup Android minidumps — Details

(not currently active) Ted Mielczarek

Comment 3

•

13 years ago

The plan is to get these minidumps into a dev server (in bug 637678), where I'll run this tool on them, then we'll feed them into the Socorro staging server to generate topcrash lists.

Laura Thomson :laura

Comment 4

•

13 years ago

How many dumps are we talking?
Could we:
- run a MR to pull each busted dump, fix it, and replace it in hbase
- insert all fixed dumps into the legacy processing queue 

This would get the data up on prod.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 5

•

13 years ago

Actually, after chatting with laura a bit on IRC, here is what I would offer for your consideration:

Create a Postgres query that can extract a list of submitted_timestamp that need to be fixed

Create a simple Python script that can iterate over those ooids and talk to the hbaseClient object

Call hbaseClient.get_dump(ooid)

Shell exec the fixer program on the dump

Insert the dump back into HBase using a subset of the code in hbaseClient.put_json_dump()

Insert the ooid back into the legacy processing queue by calling hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])

Note that the current timestamp in the same format as what is used for submitted_timestamp should be used so that the entries to be reprocessed don't take priority over normal jobs.


The end result of this job if it were run on a regular basis is that we would update the record in hbase with a fixed copy of the dump file (the old one would still be present but not visible to the normal Socorro system).  The monitor would see these entries in the queue, and as long as it doesn't reject them as already having been processed, it would send them back through the system.

There would be no load increase on the production HBase cluster to support this.  If we attempted to do a map reduce job, then we'd have to tune and test that carefully to make sure it wouldn't mess things up.  If this were tens of thousands of crashes per day then that might be worth it, but for a small volume, this should be a simple to implement solution.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 6

•

13 years ago

Sorry, at the beginning, the first step should read:

Create a Postgres query that can extract a list of ooids that need to be fixed

Laura Thomson :laura

Comment 7

•

13 years ago

Ted, do you want us to go ahead?

(not currently active) Ted Mielczarek

Comment 8

•

13 years ago

Daniel's proposal sounds fine to me. Let me know how I can help make this happen.

Reed Loden [:reed]

Updated

•

13 years ago

Attachment #515926 - Attachment mime type: text/x-csrc → text/plain

Reed Loden [:reed]

Updated

•

13 years ago

Attachment #515927 - Attachment mime type: text/x-csrc → text/plain

Laura Thomson :laura

Comment 9

•

13 years ago

Rob: see comment 5 for the agreed procedure.  We'll need two weeks worth of dumps to get decent topcrasher info, for the broken builds.  You might need jberkus to run the query on prod PG for you for that part.

The other part is hacking up some python to follow the above steps, and running that on prod. 

We really need to get this done today - it fell off the radar this week. Can you manage it?

Assignee: ted.mielczarek → rhelmer

Severity: normal → blocker

Robert Helmer [:rhelmer]

Assignee

Comment 10

•

13 years ago

Here's a count and example queries we'll be working with:

"""
breakpad=> select count(*) from reports where product = 'Firefox' and version = '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed > '2011-01-01';
  count  
---------
 1233417
(1 row)

breakpad=> select count(*) from reports where product = 'Fennec' and version = '4.0b5'; count 
-------
  7851
(1 row)
"""

Ran over this with ted in irc, looks good but let me know if anyone notices anything odd. I am now working on the approach Daniel suggests in comment 5.

Status: NEW → ASSIGNED

Robert Helmer [:rhelmer]

Assignee

Comment 11

•

13 years ago

(In reply to comment #5)
> Insert the dump back into HBase using a subset of the code in
> hbaseClient.put_json_dump()

Daniel, can you expand on which part(s) of put_json_dump() we don't want?
 
> Insert the ooid back into the legacy processing queue by calling
> hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])

 
> Note that the current timestamp in the same format as what is used for
> submitted_timestamp should be used so that the entries to be reprocessed don't
> take priority over normal jobs.

I think for this we could easily add an optional param put_json_dump() to override submitted_timestamp, passing in the current time (or whatever we want instead). Let me know if I'm understanding correctly.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 12

•

13 years ago

basically, we only want the lines in put_json_dump() that write the dump, none of the metadata manipulation or index management and such.

something like this with the one comment placeholder filled in.:

  @optional_retry_wrapper
  def put_fixed_dump(self, ooid, dump, add_to_unprocessed_queue = True):
    """
    Update a crash report with a new dump file optionally queuing for processing
    """
    row_id = ooid_to_row_id(ooid)
    submitted_timestamp = # Python code for getting current timestamp in correct format

    columns =  [ 
                 ("raw_data:dump", dump)
               ]
    mutationList = [ self.mutationClass(column=c, value=v)
                         for c, v in columns if v is not None]

    indices = []

    if add_to_unprocessed_queue:
      indices.append('crash_reports_index_legacy_unprocessed_flag')

    self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233

    self.put_crash_report_indices(ooid,submitted_timestamp,indices)

Robert Helmer [:rhelmer]

Assignee

Comment 13

•

13 years ago

Attached patch script dump fix and re-insertion (obsolete) — Details — Splinter Review

This is based on comment 5 and comment 12 (thanks Daniel!)

Not sure if we want to keep it, but I set it up so we could trivially land this into Socorro and call this as a cron job if it's needed.

Building attachment 515926 [details] and 515927 with a Socorro checkout just needs:
make minidump_stackwalk
gcc -o minidump_hack-fennec -I google-breakpad/src/ minidump_hack-fennec.c 
gcc -o minidump_hack-firefox_linux -I google-breakpad/src/ minidump_hack-firefox_linux.c 

Names of the fixup commands and also the SQL queries are configurable, but "Fennec" and "Firefox Linux" are hardcoded in the config and the start script (hopefully we never need to expand this :))

Might be nice to make the fixup commands read stdin and write output to stdout so we don't need to touch the disk, but not going to sweat this right now.

Attachment #517027 - Flags: review?(lars)

Attachment #517027 - Flags: feedback?(deinspanjer)

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

13 years ago

We tested this on a single crash to start:
https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
-93fb-694472110220

I have the original dump, and the md5sum changed, but not sure how else to verify.

Who can help with this?

Mike Hommey [:glandium]

Reporter

Comment 15

•

13 years ago

(In reply to comment #14)
> We tested this on a single crash to start:
> https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
> -93fb-694472110220
> 
> I have the original dump, and the md5sum changed, but not sure how else to
> verify.
> 
> Who can help with this?

Taking a look at the original raw dump vs. the new one should help. In the original, you should see 3 modules for fennec libraries such as libxul.so, while in the new one you should see only two, with the first one covering the address space covered by the original two first. The resolved function names should look better too.

Mike Hommey [:glandium]

Reporter

Comment 16

•

13 years ago

In the crash report you link, the stack trace on other threads almost look normal, except for the ashmem (deleted) parts, which are not related to elfhack. Having the /proc/pid/maps output from the minidump would help, there.

Mike Hommey [:glandium]

Reporter

Comment 17

•

13 years ago

Note the fixing behaviour is different on Linux. The original minidumps should have one module for each Firefox library, except each is too small. The fixup will make the module address space larger, so that it fits the actual address space used in the process.

Robert Helmer [:rhelmer]

Assignee

Comment 18

•

13 years ago

glandium has been helping to test this in IRC; looks good so we're going to proceed with all Fennec 4.0b5 crashes. Doing a dry-run now, to make sure everything looks ok (processing the right number, calling the right binary).

There appears to be caching enabled on /rawdumps calls (which Apache rewrites to the socorro-api hostname); I imagine this is on the Zeus, not sure if this is valuable.

Also, just a reminder that per comment 5 these will get inserted into the normal (not priority) queue for processing, so it'll take a while for processors to pick these up.

I should have a reasonable estimate for how long this will take once we have it running for real for a bit.

Robert Helmer [:rhelmer]

Assignee

Comment 19

•

13 years ago

(In reply to comment #10)
> Here's a count and example queries we'll be working with:
> 
> """
> breakpad=> select count(*) from reports where product = 'Firefox' and version =
> '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed >
> '2011-01-01';
>   count  
> ---------
>  1233417
> (1 row)

Oops this is wrong, should be explicit about precedence here (thanks glandium):

breakpad=> select count(*) from reports where product = 'Firefox' and (version = '4.0b11' or version = '4.0b12') and os_name = 'Linux' and date_processed > '2011-01-01';
 count 
-------
  7732
(1 row)

Robert Helmer [:rhelmer]

Assignee

Comment 20

•

13 years ago

Attached patch script dump fix and re-insertion — Details — Splinter Review

Same as attachment 517027 [details] [diff] [review] plus:

* fix firefox SQL statemnt
* use /dev/shm for tmpfiles instead of disk
* catch/log exceptions in the pull/fix/push loop

Attachment #517027 - Attachment is obsolete: true

Attachment #517027 - Flags: review?(lars)

Attachment #517027 - Flags: feedback?(deinspanjer)

Attachment #517061 - Flags: review?(lars)

Attachment #517061 - Flags: feedback?(deinspanjer)

Robert Helmer [:rhelmer]

Assignee

Comment 21

•

13 years ago

Attached file results of fennec dry-run — Details

Robert Helmer [:rhelmer]

Assignee

Comment 22

•

13 years ago

Attached file fennec OOIDs modified — Details

started 2011-03-04 17:51:18
stopped 2011-03-04 18:02:55

Robert Helmer [:rhelmer]

Assignee

Comment 23

•

13 years ago

Attached file results of fennec run — Details

Robert Helmer [:rhelmer]

Assignee

Comment 24

•

13 years ago

(In reply to comment #16)
> In the crash report you link, the stack trace on other threads almost look
> normal, except for the ashmem (deleted) parts, which are not related to
> elfhack. Having the /proc/pid/maps output from the minidump would help, there.

This looks like a separate/pre-existing issue, per irc.

Robert Helmer [:rhelmer]

Assignee

Comment 25

•

13 years ago

Attached file results of firefox_linux-dryrun — Details

Expected number of OOIDs processed.

I need to step away for a little bit, will run this when I get back and can keep an eye on it.

Robert Helmer [:rhelmer]

Assignee

Comment 26

•

13 years ago

Attached file firefox OOIDs modified — Details

Robert Helmer [:rhelmer]

Assignee

Comment 27

•

13 years ago

Attached file results of firefox linux run — Details

Robert Helmer [:rhelmer]

Assignee

Comment 28

•

13 years ago

Comment on attachment 517078 [details]
firefox OOIDs modified

started 2011-03-04 20:59:10
stopped 2011-03-04 21:24:48

Mike Hommey [:glandium]

Reporter

Comment 29

•

13 years ago

Worked awesomely. The top crashers list seems not to be updated, though. And new crashes obviously are broken, too.

K Lars Lohn [:lars] [:klohn]

Updated

•

13 years ago

Attachment #517061 - Flags: review?(lars) → review+

Robert Helmer [:rhelmer]

Assignee

Comment 30

•

13 years ago

Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of 2011-03-04 21:24:48.

(In reply to comment #29)
> Worked awesomely. The top crashers list seems not to be updated, though. And
> new crashes obviously are broken, too.

We are looking into the top crashers issue now.

To run this on a regular basis, we can easily add this as a cron job in Socorro but we should add a feature so the script keeps track of where it left off and doesn't fix crashes multiple times (the bugzilla cron drops a timestamp into a pickled file, we could do something similar, perhaps using the last-fixed id from the reports table rather than timestamp).

Robert Helmer [:rhelmer]

Assignee

Comment 31

•

13 years ago

(In reply to comment #30)
> Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of
> 2011-03-04 21:24:48.
> 
> (In reply to comment #29)
> > Worked awesomely. The top crashers list seems not to be updated, though. And
> > new crashes obviously are broken, too.
> 
> We are looking into the top crashers issue now.

Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours prior, so we should be able to keep up with a batch job such as the one proposed by comment 30, as long as it was run at least hourly.

However to catch up the backlog, the most straightforward way to do this would be to delete the top crash by signature (TCBS) table from the first crash processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports table.

We would expect this to take between 1 and 10 hours. This means that top crashers list for crashes before the start date above (Feb 20) would be unavailable. As it completes each hour of work, though, that hour would be immediately available.

Robert Helmer [:rhelmer]

Assignee

Comment 32

•

13 years ago

(In reply to comment #31)
> Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours
> prior, so we should be able to keep up with a batch job such as the one
> proposed by comment 30, as long as it was run at least hourly.

Filed bug 639514 to follow up on this.

> However to catch up the backlog, the most straightforward way to do this would
> be to delete the top crash by signature (TCBS) table from the first crash
> processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04
> 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports
> table.

Filed bug 639512 for this.

Depends on: 639512, 639514

Robert Helmer [:rhelmer]

Assignee

Comment 33

•

13 years ago

Comment on attachment 517061 [details] [diff] [review]
script dump fix and re-insertion

Landed this, which is appropriate for one-time fix (given appropriate SQL query in the config file). Going to add changes needed for running from cron in bug 639514:

Committed revision 2997.

Robert Helmer [:rhelmer]

Assignee

Comment 34

•

13 years ago

Backlog is caught up, top crashers lists updated, and a cron job is running hourly to fix broken crashes as they come in.

Top Crashes By Signature table was rebuilt in bug 639512 
fixBrokenDumps cron job was installed in bug 639514

All work was completed by 2011-03-07 22:40:56 Pacific.

Please reopen if you see any problems, or mark bug verified if everything looks ok.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Robert Helmer [:rhelmer]

Assignee

Updated

•

13 years ago

Attachment #517061 - Flags: feedback?(deinspanjer)

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Socorro → General

Product: Webtools → Socorro

You need to log in before you can comment on or make changes to this bug.